Merken

Developing a Schematron-Owning Your Content Markup

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
so far as that in Cairngorms hockey today about how we implemented
Schematron into our workflow
so a bit about who we are are assays publications is the world's 5th largest journals publisher our portfolio includes more than 645
journals spending rich realities social sciences technology science technology and medicine and more than 280 those are published on behalf of societies and institutions all of our journals are available electronically on the award winning Sage Journals platform powered by Highwire Press you so a bit of history a star creates a published journals using electrical using back converted proprietary dt most of our journals were displayed as header footer only
meaning just abstracts and references were available for display HTML if you wanted to read the rest of the article you had access at 2 there were a small number of journals that he did display in full text and how press control the content conversion for those In
2010 we began a transition to the journal publishing tagset version 2 . 3 we work with our typesetters to create XML 1st were close so we were no longer relying on back conversion by November of 2010 yet successful transition our full corpus of journals into an XML 1st workflow using the ML and DT this offered us the flexibility to provide full taxation for a large number of journals as well as open up features that were available on the newly upgraded Highwire Press platform are XML files are deposited into our content management system which validates them against the and then delivers the content onto how press for online prediction so what
happens when we move files to high production the it and have all it did not work the the which is currently happens on the productive and so if if you going on that was supposed to say Danger Will Robinson danger in the robot voice so this is a common error that we receive from high where indicates that an e-mail address contains a white space this is in a failed the entire issue on our and require a resupply so what is our main problems there is no quality assurance of the XML other than DTD validation so this is the back story files and often fail upon delivery to high wire troubleshooting correcting and redelivery were commonplace the the the so
how would we find this error before submitting it to highlight the structure evidence against the DT just fine inter schema student-run as a language for making assertions about the presence or absence of patterns in XML documents it even has a cute little mass so how do you know about
creating see much what types of patterns are you going to look for so we based bars of 2 types of data firstly collected 8 months of data based off of issue error reports that we downloaded from high wire corrections that since the typesetters were analyzed in errors that occurred multiple times were locked some areas were as simple as having an e-mail address that contained a white space as a sub previously to some as complex as a footnote being incorrectly encoded as an author 2nd we consulted are types encoding guidelines to write rules based off of ongoing specifications to ensure that we receive a uniform XML format typesetters some examples of this are checking for correct table graphic tagging or to verify correct related article text Simon show you some
examples every road based off of our issue error reports so 1st up is checking the validity of you know addresses as you can see there's a white space after the e-mail tag and before the year e-mail address we this is a common typesetting error obviously the e-mails can be invalid it's fail this like this actually has to come before the e-mail the so this is the rule we wrote to catch that here were using a report test which will produce an error report if the e-mails tag contains a white space I sent
example will check for article notes that are incorrectly encoded as authors this intriguing concept for typesetters to grasp when is a node in article no and when is it related specifically to the author instead so we use schema travel to help us make sure content is correct so this XML snippet shows note that contains a footnote identifying supplementary material this footnote is related to the article as a whole and not the author therefore the placement of the footnote is incorrect and should be moved to a footnote group and this is the rule written to catch this error we use the assert task the without possible footnote types that 1 could expect to appear in an offer note as you can see where here look here we're looking for common crime current affiliation deceased etc. because the footnote type of supplementary material is not 1 of those options in error is thrown this is not a valid type value for all nodes the example 3
when a check for missing list type actually no well it's valid to have an element of list without a lists type actually you Highwire Press will not accept this and they're gonna fail the issue and require we supply yeah I so 1st we're going to use an assert test to check for the existence of the list type and 2nd we use another assert test To further tractable was tight it is 1 of the following options you'll see we have order bullets alpha lower and so on the and the
so how do you manage very dangerous styles while adhering to a uniform XML style based off of the NLM DTD we have including guidelines which outline and illustrate our specific implementation of the journal article Tax the are typesetters consult the guidelines and generate gets smaller according occasionally new features are rolled out a platform which requires us to update the guidelines and then we can also add the scene funnels those term initially some
examples of rules we wrote based off of our guidelines the the so this guidelines illustrates that a table must contain an alternative for an alternate graphic and the ultimate form of actually you must be used the and this role was written to ensure that have ultimate form of actually has been added to the table graphic you can see a context is here slash profit and yes sir test is looking for the existence of the alternate form of
yes so example 5 verifying related article taking so this guideline illustration destruction follow when related article element is required for instance on a random as can see we're looking for a number of different attributes here we're looking for an external link type of volume and issue of 1 is provided and the page number 1 is available and this roles written to
ensure the but this curve this coding is correct so this is is basically going to check and make sure that you have a deal why for a volume and a page number either of which will ensure correct linkage maps the original article the the next
up is checking the graphical version of the table has been provided and then it follows the XML table marker so as you can see this include a guideline shows that it's cable graphic is required in details the placement of that element this is that a graphic she come between the caption and the table 1 yeah and last checks that the graphic is included within the context of the preceding sibling of terrible rap slash table and that just happens to be directly below a caption as you can see here as you can see the graph is directly between the caption and the tables and this is correct exon and also that rule would not Pyramus the yeah so what happens when
you run a publisher of print XML through our scheme from let's say a secured from file happens to have rules which check for the existence of a volume and she in this probably in the group and again it so obviously a publisher ahead of print article won't have a aligned an issue about it shouldn't so how can we make scantron work for us for a variety of content types the the the this is where
a content management system comes into play so when we started building asking from it was determined that we would need a number of different skin troubles sets for the various types of content by CMS handled however instead of creating separate human from files we incorporated the use of faces faces can be usable scantron file in order to group sets of patterns so that only a specific set of rules are checked against a particular type of content this Wyoming once he from file is needed for the scene last which makes updates corrections and version control much simpler so are XML files are adjusted interest seam s by a different workflows based off of the type of content that content is then checked against the corresponding schema trial phase this table shows the type of content and corresponding schema phase this is we have archive content publishers to print content and current issue contents so here's an example of a rule that
we wrote for our current issue content case it has additional rules to check for the publication dates this examples checking for current year which at the pub date with an article no matter by using the current date functions to year from the current state the and this and this is an example of a publisher ahead of print phase were going to check for the existence of a volume element which should be present in the publisher of article so this rules written to alert us if the content contains a volume in the article matter and the test will alert us the nonexistence of volume is not true it
now many rules as they're written can create a false of what was in fact we developed rescued from the use of roles by using roles making it a set of rules to report an error or warning if the sky which uniport returns a warning the CMS will give the user an option to either accept the content as it is an override the report or to fill the contents and require correction and resubmission however if the schema teleport returns an error the files will not be ingested into the CMS and will have to be corrected before the issue can move forward in the system and next I'll show you some examples of rules using the warning will
it so many errors are made by incorrectly tagging contributing names segmentation and encoding part of a given name incorrectly as a surname requires a human eye to determine if an actual error exists as you can see here the name L R is in cook should should actually encoded as a given name and surname the since this type of error requires some fact checking to determine if it's an actual error we specify the role of the rule in the chart as a warning the following rule addresses were basically going to be checking for white spaces within a surname you know many scenarios can contain multiple names so they're gonna have a white space so we have built an exception is the rule for the most common multiple surname so that will won't fire and known starting with white space as you see we've got the van the London the is around the exceptions so this rule
checking for all caps in an article title was written for journals that present the article titles in print with all capital letters but the exon mostly captured as initial caps however there gonna be rare instances that article title should be in all capital letters so we have to set the wall of the wording of the rule to a warning and here were using a regular expression to check for the existence of capital hot so checking for as prefix on page numbers issue supplements this role checks that page numbers of a supplemental issues include the prefix no we publish a lot of content and not all the journals follow the same format so unfortunately we have to write this role as a warning and human intervention is needed to determine if that 1 can be pushed through as hazards over correction is required that and this 1
is kind of 1 of my favorites and it was actually came out of a PNC so we said a lot of a lot of journals scheme and we were getting a lot of feedback back saying hey Maximals not structure for structured abstracts the please success so as you can see here the display of the section titles in a structured action should actually be controlled by the online templates and not by Italcable formatting this is a little snippet of PDF you'll see that we've got the headings objective study design and setting it is wrong talents I'm going to show you the incorrect XML for this you can see that we've got a piece ID and talent and and then the paragraph continues it so this is incorrect we wanna make sure that we can check this and this is the way should actually work with the SEC and title so this is the role that we're In order to catch this there this role however has the possibility of returning a false positive for instance an abstract could begin with a lattice species name and the use of the however with any necessary so we set the will to a warning and
and the as we developed are seen from 1st you mass we also created a version of it frees spare typesetters the typesetters built the use of this you would have about as a step into XML 1st work were typesetters now check maximal they produce against the appropriate he which obvious prior to submitting the files to us by instituting he run into the XML 1st workflow as close to the beginning of the process as possible errors in concert and coding are come much sooner this also shifts the quality assurance burden of the XML back to the types of if they encounter errors they must correct them prior to submission to our system in the warnings are generated is he returned reports the typesetters must evaluate those warnings and determine if they can be ignored must be corrected when there's a question regarding error warning the types of errors are instructed to contact us prior to delivery of the files the typesetter also benefits from seeing error reports so that they can catch common problems upfront in their workflow and ensure that they don't occur in the 1st place
so prior to the implementation of schema form are average number of file delivers are pressed for sheet for online publication was 1 . 8 this means that on average issue files were sent to how oppressed approximately 2 times before an issue we publish the 2nd delivery of files indicates that corrections were needed and after implementation of the schema shock are average number of deliveries prior to publication fell to 1 . 3 5 the this represents a 29 per cent reduction in years prior to online publication this is a significant reduction in the mouth times we after we more initially prior to approving
so in conclusion the completing the hurdle to transition all of our content into an XML 1st workflow using the DTD was only the 1st step in the process to gain control of in our hearts careful planning was necessary to evaluate common errors in the XML files In order to rescue channels which can catch those errors upfront and the workflow this enable us to push the quality assurance of the XML back to the typesetter will take the emphasis on production editor intervention and ultimately saves publication arrays referencing the drawing going on and is also necessary components to building a useful scantron this allows rules to be written to ensure that our implementation of the DTD is being employed for all XML files z finally building Schematron phases in using roles in our CIA master further allows us to ensure that the XML we deliver try my host provider is free from systemic errors which helps us keep costs down reduces staff time and publication delays the so using schema trunk transform axonal workload better just getting vital 1 thrives and works for you what the
but if B I think it was a great is in the same column I haven't generation in a common to actually on the did it makes sense to use came up from different basis of speeches all production work when pushes on ups the indirectly as much as possible however the introduction of bits His and some interest in the stability and how not only can the thing what's on the use which is the same content type along the different points in the production of low but also it can be used to chill different genres of publications such as books In long Jordan ossicles because all you this is used then a large portion of content will come from the same stock of will the same structures in that is that on I wondered Eve using all the phase long construct in the scheme of on his optimal solution seems to me it be more robust to have long the modular design of the scheme much on end users on includes to build on the schema runs a that could be used it crossed different genres of publications this is also something that we could do we actually have a file we call the journal reference file which basically contains a bunch of metadata at the highest affairs of the journal frequency the month that it's publishers in but were actually enhancing that file so that we can do better users asking a chance and we could start using it for you 1 rule for humanities 1 the social sciences and so on so it's it's an option I pulled on a little too if I heard you correctly you said you passed a version of a skillet on 2 types is that means you have differences in this connection between what you used internally and with tentatively using it's mostly at it it's the same schema time but it is that the CMS had specific metadata and it so that it would working in in in the CMS the thinks Rebecca and see the i of my 1st thought was those errors that you're getting from Highwire what did they used to generate those and did they not have a Schematron they could test some of them to you that was a common question that I asked you know I never received anything that we could use internally to help our system so I'm not sure what the process of Kevin Hawkins and publishing on but also it to me though that the sorts of rules that you're writing here are kind of applicable to everyone right they are really not too specific to any particular publication just reflect good jets practice and it seems to me that you know in the wiki would like to build whatever is in a come of this this is the sort of thing that would be good for us all share and and with each other and make use of collectively it reversible from an error on how much of you were Schematron development has been part reactive meaning there is a problem got caught so we need a rule for how much has been proactive we should look to things that what action created with somebody might be of great this the evil in the 2nd part of the question is on how much really doing on-going development of this has to find new things so when we 1st started developing it from and going to give a shout out to injury allow who actually helped us create this but final and so we did definitely look at here is that we were already seeing and then she basically combed through the types encoding guidelines looking for any errors that would be possible to check I I I don't think that's we were trying to come up with any sort of strange thing that we could see of but I you can correct me if I'm wrong spirit of the the says that knowledge on the that you think the and it's I would say it's definitely a living document on i have added rules since we instituted it because we often want you journals and those new journals were going to get typesetters who variational and hearing and we get some crazy things so it will often adding rules survive such was modeled in society to Commons along in partially sponsored observation on about pro active most of the active scheme of course is the from could be used for actually to enforce the authorial stored I said is known but for the publisher employs that would be an example of primates use hand in the book of common to use that crosses of course a publicly available scheme trunk that they use to children come content and seems to me a publisher would be well advised to incorporate the scheme up front it's quite all cross refuses into the contents of the box and so that the cut that would cut down on that piece attention occasionally all the ejection of this equation by cross could milligrams informs maybe is not directly related to a scheme and from that we have fell as is usual previous came from a lot and we don't have a yet dancing you you know that that would check that but in a way all these everything to capture loosely speaking the stupid those are the types of at to what point you check between the text that the real thing after all this for you know what capability of production data is used to read the whole picks the was that were all changes applied and whatever and so what stage in your browser that Don is your document pushing up enough to typesetter about it's actually done at the final stage when the typesetters will that basically ingest their files into our system this isn't gonna promise unchecked and then it's it's going to basically stop if there's an error warning now the taxes do have the file to run on there and so you know we hope that they're getting kind before we see them all that's a that's a clean what about the rest of the common which is not to x amount by where in any sense yeah that's what point you track that that's done the QA stage when it staged on high-wire before before the approval that's all that's attracted xhtml for that's done by the and of a black box between that's my biggest concern many things happened there in order to learn OK in XML if it does then it's it's a concern but we don't have any sort of yes we haven't figured that out yet the thank you and the Jenny chairman nature and having rules is approximately you got the all news clusters that come out of N I did not send my morning counting them I yeah I I wanna say there is that approximately 50 rooms on and Steve Haynal from Dartmouth Journal a couple questions 1 have you considered to have your typesetters applied cleanup scripts that sort of thing to cat problems like space in the middle of he 14 has scheme save time in that process but and the 2nd question is how the heck long did it take you to go through ignorance of error reports and how many people involved had out so the 1st question that you know we we did ask the typesetters to create scripts and catch these there's a front but we kept getting spaces in e-mail addresses so there are typesetters our offshore from it is not a lot of implementation that we can work with them they kind of got their own process so student-run the easiest way to make sure that errors were coming through of all and it was basically mean looking at their reports and since I was basically that checking that deliveries How many deliveries per month so I would basically look at there's every month at the end of the month had so 8 months of their reports that made lots of vitamin K and you mentioned the stuff they from and the file you mention that for example the e-mail of with the space so because you have 3 of faces and there you had to do with that e-mail correctly from wrong you put that e-mail low ruled in their 3 times in each phase know so you only need the rule once and we have what's called a pattern which is a sequel to the pattern yet and the rulers within the pattern and the pattern is basically plopped into each phase that you would need OK and 1 other thing I want to mention that come with all or PNC errors that we ever got from Jess each 1 of them for example ML for subscript that you need to children if I ever got an error we it's a the which on and that's why I think all our files are pretty much go clean and we don't get India the setting that little yet we live in how we got an error we edits was the entire book we had over 2 by 2 1 to 50 rules and asking the the because my
Softwareentwickler
Bit
Beschreibungssprache
Content <Internet>
Vorlesung/Konferenz
Systemplattform
E-Mail
Computeranimation
Softwareentwickler
Umsetzung <Informatik>
Elektronische Publikation
Beschreibungssprache
Content <Internet>
Content Management
Datensichtgerät
Abstraktionsebene
Gruppenoperation
Versionsverwaltung
DTD
Zahlenbereich
E-Mail
Elektronische Publikation
Content Management
Systemplattform
Computeranimation
Datensichtgerät
Prognoseverfahren
Analog-Digital-Umsetzer
Gamecontroller
Softwareentwickler
Adressraum
Formale Sprache
Validität
Content <Internet>
DTD
Ruhmasse
DTD
E-Mail
Elektronische Publikation
Biprodukt
Raum-Zeit
Computeranimation
Roboter
Adressraum
Mustersprache
Mixed Reality
Datenstruktur
Fehlermeldung
Decodierung
Adressraum
E-Mail
Raum-Zeit
Computeranimation
Adressraum
Uniforme Struktur
Datentyp
Mustersprache
E-Mail
Große Vereinheitlichung
Basisvektor
Implementierung
Umwandlungsenthalpie
Autorisierung
Softwaretest
Softwareentwickler
Relativitätstheorie
Programmverifikation
Validität
Einfach zusammenhängender Raum
DTD
Schlussregel
Ausnahmebehandlung
Flächeninhalt
Dateiformat
Verkehrsinformation
Fehlermeldung
Tabelle <Informatik>
Softwaretest
Autorisierung
Datentyp
Jensen-Maß
Content <Internet>
Gruppenkeim
Validität
Mailing-Liste
Strömungsrichtung
Schlussregel
Element <Mathematik>
Computeranimation
Konfiguration <Informatik>
Task
Mailing-Liste
Knotenmenge
Adressraum
Existenzsatz
Datentyp
Leistung <Physik>
Attributierte Grammatik
Jensen-Maß
Ordnung <Mathematik>
Fehlermeldung
Softwaretest
Softwareentwickler
Decodierung
Beschreibungssprache
Implementierung
Knotenschrift
Schlussregel
Kontextbezogenes System
Term
Systemplattform
Computeranimation
Demoszene <Programmierung>
Bildschirmmaske
Existenzsatz
Uniforme Struktur
Äußere Algebra eines Moduls
Tabelle <Informatik>
Softwareentwickler
Subtraktion
Decodierung
Programmverifikation
Zahlenbereich
Element <Mathematik>
Binder <Informatik>
Computeranimation
Spezifisches Volumen
Homepage
Mapping <Computergraphik>
Digital Object Identifier
Verschlingung
Datentyp
Codierung
MIDI <Musikelektronik>
Spezifisches Volumen
Kurvenanpassung
Attributierte Grammatik
Instantiierung
Decodierung
Beschreibungssprache
Hochdruck
Gruppenkeim
Versionsverwaltung
Content <Internet>
Computer
Element <Mathematik>
Computeranimation
Service provider
Existenzsatz
Adressraum
Datentyp
Spezifisches Volumen
Tabelle <Informatik>
Softwareentwickler
Graph
Content <Internet>
Nummerung
Schlussregel
Elektronische Publikation
Kontextbezogenes System
Inverser Limes
Versionsverwaltung
Tabelle <Informatik>
Varietät <Mathematik>
Subtraktion
Content Management
Hochdruck
Versionsverwaltung
Gruppenkeim
Content <Internet>
Zahlenbereich
Element <Mathematik>
Computeranimation
Spezifisches Volumen
Demoszene <Programmierung>
Summenregel
Existenzsatz
Datentyp
Mustersprache
Canadian Mathematical Society
Spezifisches Volumen
Prinzip der gleichmäßigen Beschränktheit
Umwandlungsenthalpie
Softwaretest
Lineares Funktional
Softwareentwickler
Multifunktion
Datentyp
Content <Internet>
Benutzerfreundlichkeit
Schlussregel
Strömungsrichtung
Content Management
Elektronische Publikation
Packprogramm
Menge
Phasenumwandlung
Ordnung <Mathematik>
Aggregatzustand
Tabelle <Informatik>
Elektronische Publikation
Content <Internet>
Content Management
Adressraum
Schlussregel
Ausnahmebehandlung
Physikalisches System
Elektronische Publikation
Raum-Zeit
Computeranimation
Konfiguration <Informatik>
Menge
Datentyp
Mereologie
Canadian Mathematical Society
Polstelle
Verkehrsinformation
Fehlermeldung
Funktion <Mathematik>
Rückkopplung
Dualitätstheorie
Extrempunkt
Content Management
Datensichtgerät
Hochdruck
Gruppenoperation
Zahlenbereich
Abstraktionsebene
Computeranimation
Homepage
Homepage
Repository <Informatik>
Lesezeichen <Internet>
Existenzsatz
Datenstruktur
ART-Netz
Normalvektor
Schreib-Lese-Kopf
Beobachtungsstudie
Softwareentwickler
Content <Internet>
Abstraktionsebene
Template
Hasard <Digitaltechnik>
Schlussregel
Nummerung
Dichte <Stochastik>
Dateiformat
Kugelkappe
Regulärer Ausdruck
Objekt <Kategorie>
Verbandstheorie
Dateiformat
Wort <Informatik>
Ordnung <Mathematik>
Instantiierung
Mittelwert
Softwareentwickler
Prozess <Physik>
Versionsverwaltung
Zahlenbereich
Implementierung
Ruhmasse
Ausnahmebehandlung
Physikalisches System
Elektronische Publikation
Ordnungsreduktion
Computeranimation
Bildschirmmaske
Mittelwert
Datentyp
Determiniertheit <Informatik>
Codierung
Verkehrsinformation
Verschiebungsoperator
Fehlermeldung
Bit
Punkt
Prozess <Physik>
Blackbox
Browser
Natürliche Zahl
Adressraum
Versionsverwaltung
Gleichungssystem
Fortsetzung <Mathematik>
Service provider
Raum-Zeit
Computeranimation
Metadaten
Mustersprache
Skript <Programm>
Vorlesung/Konferenz
Schnitt <Graphentheorie>
E-Mail
Array <Informatik>
Softwareentwickler
Pay-TV
Content <Internet>
Güte der Anpassung
Optimierungsproblem
Ausnahmebehandlung
Nummerung
Biprodukt
Wiki
Frequenz
Konfiguration <Informatik>
Arithmetisches Mittel
Texteditor
Generator <Informatik>
Menge
Rechter Winkel
Phasenumwandlung
Ultraviolett-Photoelektronenspektroskopie
Ordnung <Mathematik>
Computerunterstützte Übersetzung
Fehlermeldung
Stabilitätstheorie <Logik>
Subtraktion
Decodierung
Kontrollstruktur
Quader
Hecke-Operator
Content Management
Stab
Mathematisierung
Gruppenoperation
Automatische Handlungsplanung
Content <Internet>
DTD
Implementierung
Sprachsynthese
Datentyp
Canadian Mathematical Society
Zusammenhängender Graph
Polstelle
Softwareentwickler
Datenstruktur
Cluster <Rechnernetz>
Einfach zusammenhängender Raum
Schlussregel
Physikalisches System
Elektronische Publikation
Quick-Sort
Beanspruchung
Basisvektor
Mereologie
Verkehrsinformation

Metadaten

Formale Metadaten

Titel Developing a Schematron-Owning Your Content Markup
Untertitel A Case Study
Serientitel JATS-Con 2012
Teil 15
Anzahl der Teile 16
Autor Blair, Julie
Lizenz CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/30569
Herausgeber River Valley TV
Erscheinungsjahr 2016
Sprache Englisch
Produktionsjahr 2012
Produktionsort Washington, D.C.

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract This paper will detail an organization's development and implementation of Schematron in its workflow process to cut down on errors as well as develop consistent markup across articles and journals. The process for developing the Schematron will be explored. This consisted of compiling error reports from 8 months of data as the basis for writing rules.The paper will examine how the Schematron was implemented into a Content Management System and broken up into Phases for the varied workflows of the organization. Upon content ingestion, files are validated against a specific Phase in the Schematron, based on the workflow, and reports are generated if any rules throw an error or warning.The results of the implementation of the Schematron will be summarized. A decline in errors was realized which reduced the average number of deliveries prior to online approval. The case study demonstrates how introducing Schematron into an XML workflow can help a publisher drive their content markup while reducing publishing delays and cost of corrections.

Zugehöriges Material

Ähnliche Filme

Loading...