Add to Watchlist

How Well do you Know Your Data? Converting an Archive of Proprietary Markup Schemes to JATS: A Case Study


Citation of segment
Embed Code
Purchasing a DVD Cite video

Automated Media Analysis

Recognized Entities
Speech transcript
well um it OK and they grow it's and against and we're here to talk about a
journey as we might have been migrated to judge the reasoning done that over the last year we were participants at the last 2 provinces just as spectators but now we have some real life experience that we to share with you and what we're going to do is presented in 3 specific parts are Jaffa will be talking about the content transformations and the trouble spots which matches up with the title of the presentation and they will be talking about the validation and accusing piece that had to go on if we had done all these transformations and what you find is that most of you may view of working in an or have been working with any kind converters you own a lot of what you'll see is probably very familiar and you'll see a many commonalities and the money that much it might even be self evident you as to what you come across papers on the data that you can vary but none the things that was pointed out that would appear reviewers and thank you whoever you are instead they indicated that everyone understands the trouble spots how important they are but a lot of times and under estimate how important they are and it's really really important to make sure that you know exactly what you need to do and how you go about doing it so we don't want take anything for granted the fact is that our systems tools given a market that you've used over years sometimes decades changes all over the course of time but of course the data itself we're trying to catch you all these tools that's the most important asset not really doesn't change 1 preserved that's what it's quality throughout its historical lifetime so as we were doing is what we want is that it's not really just about the tags it also has a ripple effect it also affects the surrounding infrastructure and the support that you need to support to build and make the tags as useful as possible so I will be speaking that what a act he chose to do and of course your circumstances may vary the key thing is that as we all know the world doesn't stop at how a project like this you all have obligations and responsibilities that you need to do and sacral and as you go about doing this the so just a little bit about a IP I can
see the metrics and summer factoid about air the you then around for quite a while and we support a wide portion of the physics community but what I like to draw your attention to mostly is the mission statement at the bottom because what it is that the premise that runs throughout the the that is still form the background and backbone of why we want to move to jet and how jet and help us achieve that particular mission and also regard awards presentations actually pretty leading to this on because of things that she had mentioned brought up you'll see becoming throughout this if the so we want to also mention is that jets we found a really great vehicle to achieve what our goal was and as you can see in the mission statement it's really applicable to anybody out there if you just replace physical Applied Sciences when your consumers clients or end users and replace IP with organizations known will see this is what we're all really trying to achieve so even though we work in the physics space it really can apply and be helpful to just go out and 1 this is where we had
to start with is the scope of our challenge ahead of us yet he content election was about 800 thousand records and was divided up over 3 particular more victims of they all the rise at 1 point ISO so 12 0 83 with the standard but over years that to the case where the time it morphed into really an error the type of format and we primarily had everything in the header or what had a reference the XML SGML format and we had a full text as you know which is used for an online platform for quite a while as well as an XML that was derived from the full text SGML and evolve mostly to the cost of this particular time and how we use it like most people have been using it used to create your PDFs print pdfs and also as a source for HTML rendering it out on online platform and quite frankly it really did work well so but as warranted noted the evolution of how you know when came into Jackson have more more people using it and usable to coming on all the time a P which has a trend in new at the time the change needed to do something and that if we had any kind of effective for our own XML about is it really will involve special at what was strike with
special so what would the real problem why did you want to change no it had morphed into
something that was very Eric the central it became a little overly specialized our products and was becoming a product based model as opposed to a content based more and With that comes the necessary support and infrastructure and expenses that come with maintaining something proprietary and where many data transformation so on and so forth and it was becoming quite cumbersome to work with and as new standards compliant and community starts embracing it makes it more difficult and costly to and hence and incorporate those into our products and in the end what we need to do is recognize that it was at the end of its life cycle it had matured to a point where we really think a anywhere further so where he needed to
do is focus on what we want to do next so the key was the 2 areas of standardization want to join the community as was mentioning the community is a very important thing and the yeah that's not really what you're looking at it this was a full we got a lunch break at the we discuss a I p x amount that's but now we have a whole wider community that we can take advantage of we wanted to better position ourselves to adopt standards obviously the best practices that go along with using them which would make the data interchange in the distribution of the data which are primary formation that much easier and mostly it can it was just a great way to showcase the content at x amount because we're not just as it is was more than enough it had we want to tag at this point in time but it quickly became evident that not only within a matter of changing tags like you to do the tracked needed to take our systems itself and transform that and that would means setting ourselves up for success converting to just it's just not the answer in and of itself it's really making sure you make the best use of it and what that did is affected workflow the content management aspects staff roles accepted go down the viable the business rules and everything else that something like this with so 1st area want to talk about is the standards and what
happened that everybody is doing it needs succumb to peer pressure yes we were we but not the XML it also mattered with the tools that support because once you have a wide we used XML tags we was said just as this is that you have XSLT of schema McConnell all the other standards that could help and it eliminates or proprietary type of work that might be too costly for yourself and also widens the the based on your own staff who can operate and manipulate on work and make the most efficient process you can the so and 1 on the as as world was chastised us about the lesser it will enable us to evolve with the community and I should participate joining contribute and hopefully innovate along with the community and obviously our presence here is testament to that so on typically yeah somehow it fit nicely in with the traditional usage so it wasn't so bad and we had the benefit of dealing with standards to begin with so we already knew that it would be the benefit when working working that period this is what you see
pages and pages and pages of and you would think that extremely long mathematical equations tables that take a 400 pages in a journal would be the most devastating the difficult part of your transform but since we're using Nethanel already and also using the always have similar exchange moral these and they're actually being the easiest port of the transformation believe it or not know why all these white and the people 0 we don't know what it's just the way it is so I but we're not all we need to do in a transform was changed a prefix from and animal and we also so if we had to prove and change all this would be straitjacket from well probably the so obviously no surprise
we were using jets XSLT and Schematron using the green we use the archive or anything very important for us to have an open mind and also to be able to distribute our content as there is a large
part of what we you profess to do and what to do so and as well you know all these well to and we get a lot of benefits from the other aspect was that using our old systems the old management structures and everything else it really wasn't sufficient to make maximum use of this the correspond speak a little bit about a hundred foundational work we need
to do to get this started so the 1st thing was obviously communicate we need to make the now 1st it's really easy to sit in a room and soil which you go adopt standard that's a great idea we put the we just put our foot down and make the decision for the size and decided to go for a net with half the time that's half the battle so what we need to do is communicate this out everybody of course there was a small building the content acknowledged there he that was working on this particular product but everyone was aware of it they understood the importance of it being understood that by converting to just have sleep and maximizing the potential of the data that was really a cornerstone of where we want to go as an organization so it was not done in a backing the the the next aspect of it was that
we need to go for success was ownership is that IP the way was initially set up we had a couple different content groups that would work with the material we had online groups we had approaching through we had support groups for the for option and they were reporting to different managerial structures although everyone had a IP is best interest at heart they had their own projects occasion their own agendas and occasionally would complete signed up unintentional conflicts or sometimes the decision would be made and it was always officially communicated to out and then we get some surprise that you'll hear about later in the content that we wouldn't necessarily anticipated so the 1st thing that we did is make sure everyone understood we have a unified message we have a certain
thing and how we want to approach and handle the data and so we consolidate at content groups into a single group following that same principle and officially designated which would be out here as the official owners of the content and the gatekeepers to make sure that any changes to the markup or anything it's going to affect the product is handled properly and you know it shows Monty Python black 9 None shall pass no we don't really intend to have his state but we do intend to embodied or disembodied his tenacity and make sure that everything is done properly and believe it or not that surprising the resistance to this reorganization everybody realize it was the best thing to do in a lotta ways actually need to quite a few departments because this lady didn't have to worry about making unilateral decisions you what an online areas and support our platform you concentrated on online to a project manager manage projects working production you work Introduction to focus on your core skill sets in that and do the best you could no job and the the content technical details to the content acknowledging and pass muster on that everyone has the same to how we can change what we need to do but it does have to go through a more formal process and the result will be able to better maintain contact that leads to the next piece which was the infrastructure we did invest
in a new content management system currently where it was the 1st suite implement that and what will allow us to do is effectively managed to content avoid unneeded work for duplication as well as avoid those unwanted and around to get content to look right will fit right you know doesn't really have a really solid XML-based reason to do that I will have more extensibility but most non quarterly were going to have great versioning capabilities and that's what be very to everything so we can maintain and make sure we can document every change for documentation because in a way were almost on square 1 again we had all our old content built over the years and now we go back to his Jets 1 . 0 this is where we're starting to be able to manage a content of a really firm grass that's where it goes where it needs to go and know exactly who what when and where we need to manipulate something and when you have authors submitting a and publisher notes and things like that it's very important information to her and the greatest data in the world no matter how well you target if you don't manage it properly it's not going to be effective for the so what was the next step so we know we have jets OK check what we decided on that we're going to use XSLT since that isn't custom transformation programs we have been using we did need to decide what we're going to convert I had mentioned 800 thousand records we really have more than that but we decided to just translate the header record with the references and we also decided to convert our full text since it went back or to 2005 we decided to put on hold for now the full text as you know that we had so we made a business decision to do that the next step was to test of the XSLT transformation and adapt findings and whatever all the the specifications based on the results we needed to introduce a nice quality control system is where we start to implement the Sumatran process so we start again what we're doing is taking that you get another standardization tool and then need document what we were doing so for future generations of AI appears they will curse today we were on the on and it will be there thankful that we took care of everything we know exactly why something was done and then obviously we have to train the staff partners in the AIP usage because the key factors yes we're using the green archival DTD but it doesn't mean that we don't have their own way we
would like to use it instance how might we use the main content all what use custom matter what attributes wouldn't allow a specific use all those types of things or what would be incorporated into asking the kind taking firm details of so we can manage to data and have it still stays predictable as possible for us so with that the next step would be what this foundation said it is time to go wrestle with all the angle brackets and find out what surprises lurking forests and with that of possibilities jump Andrews yes when it comes to wrestling with angle brackets that's always been the as of the end and then listen was he go get some Clipper 3 slides from
the lottery so we're going to have some time so we have processed once we agreed on what the challenges were ahead of us we had to get specific about how we're going to go about converting dataset would need to be done what be helpful to do and was going to have to wait for another time
the 1st and most critical piece getting started was our document analysis we need to know what was out there what we're dealing with and how we're gonna handle it keeping in our datasets in mind we need to determine what the tagging rules were that we need to follow and once those decisions were made at least enough to get started we need a means of keeping track of them so the creation and maintenance of a document map which we refer to internally as our spectra specification became the 1st
day work and challenge this is also where the time goes on to him so this is
also a really good time has 2 great sample XML files and take copies of very specific by data structures that may cause a problem later when we need to be checked so
this is not a big surprise slide succeeded if the tagging principles that we decided upon I the 1st thing 1 of the 1st thing we did was defined how AIP was going to use and that was our 1st step off approach to all of these
things when the main facets of the document analysis and creation of the fact sorry in in the main bronze whatever forget so suppose I have to think that's converting yes you know sort records in the text XML
both converted to jobs and they're tagging principles take principles agreed upon earlier in the analysis we need very strict that is distinctions between elements onaj abuse we decided to hold customized and constant models in which is not use them right away we could use just out of the box as it were an in the short term we we got a little bit Trixie with the X mark up we reserve that end in
this instance we used to to wrap problem areas in our data that we knew we would have to go back on later now for instance for any occurrence we hadn't water list in our source that erroneously felt indicate what type of label was to be useful as a bulletin number of we actually
Output a label but we're wrap that label next tags that later on we could do a data search and find those instances in going on a case by case basis later on we can use the extra before now we were Wilson in creating a
document map we came upon this very a drummer formal we had our taken principles number 1 I multiplied by our existing documentation which all the piles and piles of people are sitting in the back of people's desks in basements holding up chairs and end the the institutional memory were fortunate to have a lot of people a a who the members or where there at the very inception of work up an online publications so they were able to help us out with why certain decisions were made so that we could maintain that the decision making process going forward we lose anything out in in what they meant the sticky this is a
sample of 1 of the pages more resulting documentation my 1st column is is what the element wise followed by an ensemble of whatever a proprietary taking was known target suggests is set forth column there that was the most critical for us the the instructions to a programmer or the who was running ourselves excess l t did explain exactly what needs
to be done frequently there were instances of why this decision was made and then as you can see any updates the king live at all on the process the idea here is that we have this document that we can always go back to and see why we made this choice what changed and how was handled find the benefits of the conversion process was are believed
to polish up archive step away from reviewing are completely tagged articles instead just reading DTD from top to bottom office identify any ambiguities in the existing tag set take outside of an article where insight on you get a sense of what the tag names just by reading what the content wells if you're looking at just as a tag on its own you start to see where it really just doesn't make any sense all by itself so in our example x a 1 action to actually 3 that's pretty darn meaningless if you just looking at the tag so we were able to take situations like that and use the more meaningful an intuitive tag set from the jets that allows anyone to open up an XML files all OK I see this a kind of is professor I see that all makes sense the conversion to jets was an ideal opportunities just for these cases the only trick came in deciding whether the adjustments to be made using XSLT or whether was why choose preprocessed in advance of the In this instance we did this by XSLT is fairly straightforward for us
going in we had some expected trouble spot things when you were going to be a problem the of course the very first one so many people facing is
generated text yeah that's there are most pervasive known trouble spots and we found a solution by both accounting for instance is a generated text in the specification for instance we have a spectral Wallace's take park and output the text analysis and also a means of leveraging some of our existing and conversions or existing programs that in the past would populate data that have been generated text me when a preprocessed to the 1st
XML file then transform so that the final output file and all the data within it yeah style
variations were another issue for us and each of these scenarios is a possible method of presenting the title Introduction depending upon what journal you in what the styles were previously we could do this to CSS or just on page alloys layouts and but now less chatting when they have everything in our file so we had to locate in document every instance of every possible configuration over titles multi-purpose tags or tag
reuse then we also call these are heavy it was lifting tags with a number of tags that while they remain the same were handled very differently 1 of the most problematic is shown here it's a catch all of them file which for some reason he call other info even yours are there up this position within the reference determined its handling so it could indicate straight text to be carried forward they could indicate punctuation especially need to be removed and then carried forward or it could be a mathematical formula within it and not had a whole separate them so again each of these instances had to be located documented in accounted for a specification course all
this in no problem is complete without the pheromones multimedia knows subsampled see or mentioned earlier because it really reflected what we saw at the very beginning of the dialog time we had between multimedia as an external file we had a link to with remain XML it linked overture supplementary database that's our 1st and then then we started seeing them coming in out what is to really find a weighted tag this in our data because people want a life I in the file so we came
up with an embedding structure the and I see alot more of them so we had to go over the real solution so we can move in a real solution and that was enough is that we have to come up with a solution that accounted for different formats of the racing part so now we have 5 separate no 1 2 3 4 5 6 separate possible taking structure from multimedia that we have to locate account for a transmission for and beg the programmer to include on and this only is going to work it all taking inside the file is correct and also taking was done by hand so we worked confident that we were that finally a
probably the biggest hurdles as rich mention before we are a the whole thing technology chain we look larger in real life and and we had to do all of this work there were talking about on top of everything else you're doing you know it the regular
workday e-mails meetings more e-mails support calls this when itself that Balinese help and then also needing straight up like this this thing was we were labels you wouldn't get biggest reason were able to do is we had the support of the rest of the staff and management everybody understood the importance of this project to the company as a whole so OK yeah the minor middle ground when we set out in we can do that teenagers pretty tomorrow but for the most part of it is really important that was a big bonus the unexpected trouble spots language we
had this group of people who speak in angle brackets all day long writing a specification for someone who speaks in Proc about we thought maybe English should be a good
answer you know I that really turned out to not be that answer on but there was even a common language but what we decided to do was to use XPath which in retrospect is very obvious solution but in other examples you could see is that packs which was a tag that we all know we know what goes we know it does know what's limitations or so it's not going to take back something here but other problems as well where can I find where were 2nd thing is it always going be informed that it could be a reference could be here and these are things we kind of had to consider because our knowledge already we have all of our head but when using the XPath in our specification allowed us to show our programs anyone else down the line who doesn't have this strange knowledge pattern had for when can sleep that anybody else again to follow it nasty
surprises you only think you know your data and really don't yeah and at that
point it is again as soon as a human being comes in this fingers on data you've lost a measure a positive control yeah these nasty surprises are out there many here how careful you are and we were all pretty dark purple throughout the years or have a russir quality checking unforeseen gaps nor for a lot of these gaps were identified on our 1st conversion run and the example or showing here is this actually absolutely power we and perhaps foolishly assumed it was there because it's in our roles is there in the directions instruction to give to our vendors nature this gets included in Europe 1st paragraph in constant we that there it was until we
saw the online displays and all the paragraphs which need to appears bold online weren't there we had no idea so then this ended up being a whole we go back to identify the problem areas rules were written for the transform again and if they were to actually use this construction right now about what was done as joint forest doesn't happen
again of kind of something I'm loosely to better quality control and testing please and as discussed sleepyhead about 800 thousand files that we can produce the market for for over 20 plus years and during that time business rules change publishing styles change text that technology changes and we separated
this fate is quality control and testing into 4 phases the prerequisite training content taking texts in cooperating Schematron and online displays the
1st step the prerequisite training time
and AIP chose to expand the knowledge of the entire case in the we're already DTD experts and Jennifer myself but but sending us out to learn the NL and that's DTD was very valuable here we learned the industry takings from Dracula the indices taking practices and we understood the customization possibilities that the Jets does allow for we also had a 2 day class in that XSLT in Schematron and would that x to date less it's 1 as the excess power to speak XSLT syntax we learn to write much clear and concise instructions for the programmer and learning the Schematron actually enabled us to write our own SkE neutron we didn't have to rely on programmers so it was a really a great a great class and the next step was the content and
taking checks of the 1st part of that I was do that she performed well the XSLT was in progress here the analyst type lots of XSLT code and the 1st confirmed that the programmer understood the instructions we had daily meetings were held to discuss any new findings or clarifications with instructions it was early on at this point we did find 1 trouble spot that was with the specification of which Jennifer touch on we realized at that point or our specification was way too simple on we would write something for example from convert the AIP artwork tailored to the Jets graphic tech into us that made sense however we sounds simple but it's actually much more complicated when you're reading the mapping and you have to think about when you can bring for example the AIT upward take it converts suggest gravitate when it's a child of for example and charity formula but in other cases it's an NMI graphic where falls in other places maybe a title formula the
and the next step of during this phase was the batch processing and this was for performs when the whole XSLT was complete the 1st goal we had was to see that the XSLT was working in at the files were valid that was basically our gold effect is point and and here we found out and we basically decisions we thought to find some hidden problems and we get to the city problems you have to decide is it better to fix the source material is it better to fix the XSLT is it better to fix the
Jats when you're ready files very completed and have to step back and ask yourself some questions how many errors or there have fixing this impact your schedule is it easy to fix and each problem was done on a case-by-case basis of of give you 1 example but in R AIP mark up we if we had a reference item then we had 2 parts to it we had a common between the 2 parts and that was allowed in in Auriol people with only converted to Jackson you had 2 parts that common had to be put inside of it could not be floating PCT PCT data so we had to in this point in this case we decided to uh make an XSLT change of course you're dealing with spaces in your dealing it wasn't so easy but but this was something we worked out armed with the XSLT so you have to decide what's we're what we're is a best to fix the errors and and done and discuss the the next step was to
the testing in this was performed when all the files work when out when when the converted files are all valid again I'm be sorted by running approximately 200 files from various journals and different article types on the entire group checked the same files the and then more hidden problems were found out as as you go on and on we look for example in in this case for a drop text actually proofreader files but we knew we had 1 taken the source we make sure that the tag was also when the target and all the the the text was inside and this is where we started running the Schematron we would get in there for example if we knew of a table footnote had to be in the format of T 1 and 1 let's say on and they came in a different format here we we would get an error amounts planar schema try on a little later in the slides we would sort of check the Schematron errors also the next
step we refer to as bull processing and this was performed when these 200 files were proved in the group testing I at this too but this point we ran all the 800 files through through the XSLT and it was great we had a 99 per cent accuracy rate now you could think that's a great number however when you're dealing with 800 thousand files left the left us with about 8 thousand files with errors so an we again we have to decide where do we fix the errors we fitted the XSLT arm or do we wait and fix it in the Jets when it was done I'll give you an example here what we found we found that with our conference proceedings we have multiple editors there grouped together and and follow or intended specifications so here the sort we decided to fix the source in this case and we found a lot of a few other errors of books such as inexact such as I've mentioned and but here we then we re-ran the XSLT and all the files were run the of so
here on the final step was and we refer to as analyze flight data and and this that was actually done on old was done on the converted jets files and Jennifer touched on this and the what we refer to what we ex set analyze flight data was basically I'm find known problems that we knew bad in the beginning I as you see here I'll give you an example I here we have a 10 with a superscript minus 8 Matt's how or PDF look we chose or in our older data we had to take hold on other and this and other take represented some unknown characters from our 1st generation of transforms done in neural early nineties and what we did is we if we did know what the character was at that time we take it as a part of the and put it in this case and an at sign she well which really had no meaning it was just an unknown character so because we didn't know what this tag you have to really convert this state we chose to
take it as an ex strike take a strike 0 wrapped next head and this way and we would they would go all the way through the conversion it would be a valid file and new with the end we would get a list of program would supply a list of all the X Tricase and then we could analyze them at the end and and modify and no modify all largesse files again a business decision was made to do this at the end but it really a made the most sense to us and there were a few leftovers at the time but you have those we did manually are the
next step in the quality control and testing was incorporating the which I have to say this was some might my favorite piece of this is the centerpiece log to the process and it was derived from a pre-existing proprietary QC programs and is the neutron is a list of checks are assertions written in XPath and tax on interest there was a morning specific to our data now as mentioned by rich and we decided early on to use not make any modifications not to customize suggesting D and to rely on the Schematron as on as the have more control over a AIP styles in example we have 1 Journal 1 hour chaos Journal which has structured as structured abstracts The rest don't so here was a case where in the sky which on we can make sure that every chaos Journal will always have a structured abstracts so this is the type of thing that we we checked for and the XSLT was that the performed throughout the whole QC process since we started of because every time we found something specific in a journal we would write up a Schematron rule and that's a matter a styles is the role of let's say only happens 80 % of the time we would put it in as a warning but this way we had every single rule as much as we could in that in that Schematron elm so it's extremely helpful we don't have to say I'm reviewing this journal let me make sure that a B C and D is correct you don't have to do that if you put all the schema drawn from you could make sure that everything is correct but right now we have about 250 rules I'm in place and the continuing writing more the up here is an
example of if you look at the top of the screen content keywords the but the top the orange the compound keyword has actually 4 parts inside and that's valid and jets and that is a problem that we found the earlier helped however or style all we wanted it's only have a laugh at you look at the bottom 2 parts inside the compound keyword so and each 1 having code 1 having value for this it's with this specific type so what are we do we made a schema triangle to only allow 2 parts inside a compound keyword so here the phi was valid but the 2nd check 1 actually gave us an error why you don't have to
understand the actual Schematron rule you'll see the answer is the error that actually comes out when you run the Schematron the first one saying a Content Keyword must have 2 parts and the 2nd 1 saying uh that the cut the the actual values of the attribute must be coded value so again we were able to put an arrow in there that shows or style but yet them you know the file was valid we have to modify private jets DTD the final
stage on again which Jennifer touched on is the online displays and the icing on the cake than we could see the ising but to us we call this the icing on the cake and assumptions at this point that the file the ballot scheme which 1 is run in an and everything everything is not perfect the and the the testing here was expanded matters to us small group but on the online publishing group and we selected random testers throughout the organization so they could also I check the files of course at this point there was a grand along every weight but errors are fewer and fewer along along the way and this was the case and again it's Jennifer found when you we actually viewing the files we saw that a paragraph from 1 of our journals was emboldened arm and sure enough it was so much easier to find when you're viewing the files on so what we do we made a Schematron taken this will never
ever happen again because we found the error but but it's it's really a great weighted to confirm that all year of business rules are being followed this online displays OK which this theory of general knowledge the
so it is the sum it up discover some a general lessons learned and conclusions this excuse me in the bottom line is that you don't go alone I mean I think that was a key point that war would bring out the 1st presentation that you have this whole community out
here brilliant people people gone through often what you trying to do a style and when you following industry best practices and standards it just enables you to get that much more in the way of resources from within the hour set yourself up for success make sure that when you do all this work that you build a system and you build environment in which you could constantly stay successful with it and it's also impossible to overstate the importance of document analysis but as we pointed out no matter how much document announcing new there's always something that slips and this figure 800 thousand more record you forget how many characters could possibly be in there and the hands of them over the years something someone's gonna look good on minority yeah but it's not going to be right underneath the covers and then what you do is you the analysis as an opportunity to correct any known problems that you have that you always wanted to fix new date but never had the opportunity to do because want to undertake a project like this how what do you ever get
the resources and the attention of the organization to make sure you get it all done so take advantage of it while you can all Hi good to see you again I think it I have been like the whole of the craze and recognize the differs gene banning incorrect data
on in I apologize I didn't write my notes on that 1 In the correct a boat that was the difference was is that you have data that is just all out there just actually makes no sense of user put in this empty peg makes no sense some which is when the data and then you have something that's just incorrect data where they chose a tag that's kind what would you good it might be valid but it's not really what you want another particular location so as you go through your process of analysis which which was which was makes a big difference as to how 1 goes about its of the attractive 1 of my CDP solutions to get the matters but I see are greater detail socket map we should example of a before the more information you can put into this the better it is going forward it's a living document that hopefully will be of a pass off to any user who comes in the organization or is already in the organization wants to understand what you've done and why you've done it as OK you know is detailed as possible he was current as possible the taneously as they mention was very valuable of mostly because it gave a 2nd language to work on but also it you know I look at the woman got hot may not know XPath cold having worked in markup so long so it's become and a tool 18 even realize how useful is going to be too I started working with that now I think that can imagine it's it's kind of how my thinking we talk about menus in our text that's terrible I the neutron is a centerpiece of the QC process we've always had uh QC check that has that been done 1st buyer programmers prietary written house this scheme channels has given that Fe-Ni enrich the opportunity to create our rules as the men should not have to buy the programmer we know the data best we say that to sound brain but we work closely with it all day every day so we know what has to happen we know who would ask to make sure that found this particular power treaty cochlea that heading is accurate so it's much more expedient 1st about update maintains document then is to have to put this all into some kind of request concept someone else create another round of questions and of course the the top and foremost issue is to work as a team but hopefully by asking here you can get the idea that yeah we do stuff and that spirit was really with us throughout the entire process with the organization of everybody really this project in that made it so much easier to have the support of the staff you met you managers your process and just the whole production per require OK this just a little side the closing statement In the paper that we submitted and uh yeah just really achieve what we want to do that of course is always really for improvement will continue to improve that we hope will be participants active participants in the community and with all of you and wish and so on the 2 arms 2 from the population what you think of
the few that was 1 of the many factor it is
more of a better life I given Hopkins from an publishing adherence machine library mostly interested in York and of the slide you had about the heading Introduction we're talking about is handled with different journals of additional has its own particular style and what really surprised me was that you wanted to have an any old tagging but these were not tagged uniformly and yet handled by a style sheet in the out correct but that a new new tagging you instead wanted to encode them differently that is sort of move this information out of the cell sheet and into the Jets mark why did you choose you that rather than re write the CSS to to continue to handle the the sound channel part at only step 1 which is chosen as hard decision was made on so that we had more controls data and because can be a better ways space of our work process can be fragmented if we put that decisions CSS then again that we're going to be writing repressor someone can change the CSS going forward or who's going to make sure that all this general as i talent this journals all have I and they pay taking it out of that structure and put it in our XML where you know you know it is also overstepping to have that kind of instruction inside the x now but the gives us the control it lets us say OK in this scenario this is you know it's polar like target has a label number on it so wherever we send this off you know where the words playing inter is gonna come up on I don't know too terribly much but CSS and Robison display the same what Firefox's doesn't and explore what we've hit those situation before this is gives us the ability to always make that change yes yes I've been understood to follow up with that is that of the the slide we have generated tags immediately to help out production systems and processes with these 2 we have an awful lot of generated text in all those labels were generated and what we want to do is put that in the content and broader reason Jennifer just spoke to in that we distribute out it's up the weather was to displayed as whatever they need to do to display however they want they don't have to guess what I have to generate Arabic for this particular job that generate roman numerals for this particular journal and then quite frankly in this day and age even with the AIP Journal reasoning is a lot of his old always reassessing displays requirements for the 1 he has and all that so by removing added generated text environment and putting the label specifically in there were not putting in in that it would give the comptroller general speaking about and the final product the here's another question FIL that part is a lot of high minus crystal only work for PMC and the sky you mentioned that use our sweet and I was curious about that move infrastructures probably important to years success in that it would would you think of our we sweet and I think it's space a like 7 uh well I can't speak in great detail so that the certain hasn't really focused on I don't know of I can't speak to our and if you want to make a comment on 1 7 0 and say I t of that have that conversation offline if you like the number in the middle of a major our sweet implementation of coffee facilities are to have a real archive with version control a master archive so we have a production archiving and then the the publication Archive so version control also we're in a world where we have a lot of external systems and
external typesetters external hosting provider external and content enrichment vendors and so on and so that particular project is to provide the plumbing that connects all of that together and of course we have
a large number of external data feed customers which mentioned this in his slides but 1 of the real benefits of drafts and 1 of the reasons we chose not to customize charts at all His weaken their hand are data to any external provider and say this is traps according to the stack note of the box don't bother us and go read the stack when ii Wendell piece of peace consulting services thanks for the plug and also I had in and address the training you should also be thinking more operate because of course the trash training and a lot of you know so that everybody here knows about library so entity to stress that on the fairness of the the things that interest me I mean I sort of question which was sent to Pond but they're related 1 is that a question that has to do with the fact that you guys were implementing of migration from 1 text into another and therefore you had control over both hands and of course a lot of people were going with something rather different in terms of production you know pulling data in this being in a good light into force of and so the question comes up as to what lessons that you might from your experience which would apply also the sort through other kinds of applications jets on the 2nd part of it has to do with the thing that you touched on with respect to the whole training issues that expertise the expertise that the developed internally which I think is actually really interesting because that is when the trains were seeing is that that you're not seen such a strong boundary line between the technical people near poorer people as as as we used to instead the editorial people getting a lot of technical expertise leveraging at and then improves communications as well as control right so so that became you just the same level you might come the mind with respect to the whole thing about what kind of lessons that you talk which you think would help others who were also worked with jets and as a team working together I would think from the training perspective is that even if you're at p was fortunate to have we have very counted counted technology programmers who we can work with and so on but then if you need to use a consulting firm all you don't have those resources in house having the training just to speak the language that they do because nobody is going we're highest and ordered data as well as you do but if you're able to communicate in a language of common with them and explain what it is that it just makes it that much easier so even if you are not writing schema Trianon at writing XSLT yourself understanding how it works and knowing that those more than likely the tools of someone's going to be using to do the work for you it's just a great benefit because what you can communicate that take advantage of every scenario I think that's what thing Jennifer have touched on quite a bit is that we would think all we know is the way we deal with the program so we can that will is every single time you do that well that's not what happens if spelled with the action of capital will what do and it really makes you think that every single scenario that's that you want now if it's not written right where you just have dropped million know about it so you need to make sure that every instance is accounted for in only you know that and even communicate in a way that's almost neutral it's not your organization jogging it's an industry standard kind of speech so say that is just so much more helpful and had 1st part question and I had don't call that a well I was about to come will what lessons you might take apply to organizations were not migrating and all dataset into jets but who were actually may be working actively to publish new material that coming from all services sourced from word or whatever the case may be you know something that which is you know the sort of come back to the chaos question maybe a little less stable but yet I mean in that scenario do you think that it's but do you think that the that the Sun you know the shell idea about the the technical level of expertise within the editorial group is from apply the same way and yet you might want to come in and certainly something to on the classifier for myself primarily because they now also involved in the data analysis secretion specifications of I mean I was going sound trite but documentation for any point when you don't got from migration is starting new and the Jets documentation online is wonderful the great examples there but of but we went term figure out some of the tags that were sitting in our dataset and we just 1 island universes I don't always trying to be it's a string of numbers I have no clue nobody was there who really remembered and lot digging of so even from the very beginning it you might think it's minor makes perfect sense to you everybody around you it's still important to have noted somewhere have that information central repository that anybody can get hold of it and in nature the going forward there there's a chain there's a trail for people of all the defined how you go from a to B you had a about and I think that it is because we're in a situation just as you're describing we had a production unit editorial unit of processes information on a daily basis is constantly and then they have been familiar working with XML editors working in and their role has changed now looking at the content itself and sure that authors information is being preserved properly and what will come down to is that particular documentation and what's more importantly from our standpoint when anybody standpoint is that you need to make sure you annotated with your usage and all that is quoted because just because you have to just documentation saying a b and well how you interpret a B and C were doing environment making sure that we are aware of that and 1 of things that we would be doing to help enhance that is when you were right a schema trying typing out why we're doing this it really people's comments that it becomes that external extra piece of documentation helps people in production what are we doing this why is this happening what business rules trying to be that you're particular case and we can break it down and translated into a joint and that is more familiar with the article thinking you 1 thing I'd like to add another comes the XSLT if you could say it in a pattern if you could to say it it could be written in the XSLT and that's basically the philosophy that we went with that if you could find a pattern that just so it's really pretty simple to write so that's how we and as a little more context and just so the audience understands AIP no longer copyedits in-house or types so that's so I knew the authors manuscripts all go to offshore vendors and come back to us right now in EIP axonal but the documentation that was produced as part of this project will of course these are the basis for them returning jots to us but they can't do that until we switch our platform because we can't currently hosts Jocelyn warning until we launch a new platform so it's never easy could change in all the parts the 2nd time but the Indian documentation is absolutely central and the Schematron forces that this well just that and National Library medicines how thank you for this paper of already shared it or in the preliminary proceedings with some organizations who were thinking about the data as as an example of a low for some people that did the right way so I think is that and I wanted to the kind of cool on the back of of 1 those questions and that is when you're writing your transformation specifications using expert did you realize how close you were to actually write your own XLT and we're excited about the Vietnam that's it but we write X and Jennifer and I have now we write small itself to fulfill the company writing the whole big 1 you know that's still get the come but whenever an XSLT in the written the company we are now the official XSLT writers and that was something that we gain from expert it's just you know learning that I mean that was really great so I think of the gap is 1 of the things we like this and you know it's never we had a similar kind of format that we wish to engage in was a strict when was started by the respective kind of meaningless in expectancy at you know by the time I handed your here we this set can in have had this but you know electricity and water is itself and there's no Lausanne things that we're going with this I would right I also really but mainly in the late I mentioned in the lab at any time you have to have a lot of for all the and things that make me cry the the key thing from organizations these used to be done these transformations and that's all Pearl or something like that that depends on what practical resource money were available a fit and now it's more self-contained a lot things Hi Jenny chairman Nature Publishing Group and I'm just wondering how many people within you come near involved in this project and how long it took into and will primarily it was the 3 of us and we have have the 1 dedicated resources to handle the main XSLT that was really monstrous and they were sorted of I developers maybe 2 with 3 over a piece of their time where it according to what do large amounts of processing because we work on Unix platforms so needed their help to run the files and do all that so many really it will boil down to the 4 individuals spending most of their time doing it you know we do all the had other responsibilities at the same time as responsible for sustaining the whole publishing operation with the questions and support simultaneously and we started the Our published August September 2011 is when we had started to write the initial specs and all that and it really because of other responsibilities and resources it was really about January of this particular you restored get full fledged into it and we spent most of the 1st and 2nd quarter finishing up and running a tests on the files so so overall about a anointed 10 months then another piece I like to add to that but at the same time there would they this team was converting the XML we were also restructuring all the content assets the graph fixed the packaging of the content as we move from our old repository to our new repository and that has to tie in with the XML correctly of course so that there was another set of projects going on as well been thank
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation


Formal Metadata

Title How Well do you Know Your Data? Converting an Archive of Proprietary Markup Schemes to JATS: A Case Study
Title of Series JATS-Con 2012
Part Number 2
Number of Parts 16
Author Faye Krawitz,
Jennifer McAndrews,
Richard O'Keeffe,
License CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
DOI 10.5446/30581
Publisher River Valley TV
Release Date 2016
Language English
Production Year 2012
Production Place Washington, D.C.

Content Metadata

Subject Area Information technology
Abstract The presentation will describe the challenges, benefits, and opportunities resulting from converting an archival collection of approximately 750,000 files to JATS. The goal was to migrate the American Institute of Physics (AIP) and member society archival collection from multiple generations of proprietary markup to an industry standard to create a true archive, all managed within a new, more controlled content management system. Integral to the process was the adoption and application of the XML technologies XSLT, XPath, and Schematron to transform and check the content. Sounds straightforward doesn't it? Perform a thorough document analysis, map out the transformation rules, convert the data. But is it? Have you accounted for all historical variations, generated text, metadata, nomenclature variations on XML file assets? Beside your core, don't forget about reuse for other products, edge cases, online presentation, distribution channels and staff training!

Related Material


AV-Portal 3.5.0 (cb7a58240982536f976b3fae0db2d7d34ae7e46b)


  645 ms - page object