Add to Watchlist

mPach: Integrated Publishing and Archiving of Journals in HathiTrust


Citation of segment
Embed Code
Purchasing a DVD Cite video

Automated Media Analysis

Recognized Entities
Speech transcript
however 1 of them as result I'm of Kevin Hawkins and I'm going to begin and hand it off to the others of that later so that he said and we all work for Michigan Publishing and which is the university michigan library but you presenting work that's been accomplished by various current and former staff members of a libraries it's not like it's just us but will do we can to sort of speak about half of those who are
our are working on the project I have worked on and continue to work so brief unlimited number 1 here give you an overview of the system that were calling impact and patterns of tools for publication of born-digital journals and antitrust and and what the system is in and especially where it came from the sort of context with of here and then rise again given Introduction to the interface to the the the proper system of impact and then sets going to give a little technical discussion of the norm utility that's built into this which designed to converge at this time just a word of text files to Jax XML that so from Michigan Publishing is
and publisher of university Michigan and base university library love saying I'm we published under various brands including university michigan process sort of flagship brand of and we offer all sorts of various services related to publishing and we've long use the system some of the publishing platform called DL access this issue developed university michigan library on as a primary platform for online content and and but where were kind of heating up against the of the boundaries of the system here it wasn't designed for publishing born digital content so that is level publishing journals and so we need an architecture that will scale better in order for us to continue to grow it and so when you new architecture to work on and
1 that is sort of at hand for us is from cotton trust reciprocity trust is and well the the name can mean either of the partnership or the Digital Library world repository being created by this partnership so it's a partnership of research libraries from around the world administratively based at the University of Michigan Library and is found this part of institutions have created continue to develop a shared repository and it is and certified according to the trusted repositories audit and certification or track on which is a tool developed by the Center for Research Libraries is 1 of only a handful of the world that and that is very strict but certification process but it has over 11 million digitized volumes in it but actually last time I did a version of this in October at the impromptu meeting it was it was around 10 so I had update that part of the slide on were about it nearly 500 terabytes of data on what's in here it's mostly larger would say content that has been digitized by Google and given back to their partner libraries including using Michigan but we have continent here from member institutions that were also members of the Open Content Alliance for which of digitized things on the right and now we are going to have a party trust move into a sort of a new phase in supporting and born digital content right not content just already held on paper or otherwise and physical media and in in the member libraries as part of the print collections so
I there you know where a library-based publisher and so we kind of exhibit this tension in the roles of publishers and archives right so publishers I'm sort of require flexibility in the work they do and because they're trying to be innovative and they're competing with 1 another and trying to offer and better services new things respond to user demand libraries also trying to respond user demand but sort of carry with them and here I'm using libraries archives interchangeably all our need a certain level of stability in terms of the systems they build the content they preserve because they had they carry with them this sort of mandate to preserve the content in long run so we believe that to trust provides us with the infrastructure which 1 we can provide long-term preservation of born digital open access show content but and discoverability this content but also allowing us to build the innovative services on top here in a way that won't interfere with the with the preservation of the so the main design principle the system which I
wanna spend just a moment on because it's kind of key to the whole way reconstructing this is that I the process of archiving the content is going to happen as a by-product of the act of publication rather than something that happens after the fact so the way I would tend to imagine systems working in general and lot I think they're mostly out there in the world and well actually I think the political would be an example this and I don't mean them to call that anyway that's that's really great work on but not much time the publisher publishes the content and then someone else takes care of archive were but in the system here in which the very act of publishing it is tightly integrated with the repository so that 1 you can only publish it if you successfully archived it and you can only produce revision to its to fix a typo or you know all the kind of a rod and by re depositing the archive and the so that the integrated you can bypass it and just because of the more convenient to you because you want fix that type of rate were really trying to build these in together in a way that ensures that the version in the archive is the very latest the very best version of a the whole system is built around using that version as the the axis version I mean to derive the access version for the user so why
we hear the death conference well and for born-digital open access journals and we've chosen Jats as our format for preserving the content of the course was selected by telling you would you all know this really um it's there's an increasing coalescent for publishing industry around this open non-proprietary standards in particular of the 3 flavors of jets we've chosen the publishing or blue and tag set and for this for digital literature why blue you might ask well you because assemblage archiving girls well archiving them there were also publishing them on as you know the blue tag that is the middle 1 is a bit more constrained in green and because this is new content going and we feel that we can in fact constrain the structure enough that we don't need the full flexibility of green but unlike orange we do need the important metadata elements that are there that go with it so that's why we need this little factor here it is so I a quick overview the architecture
of the system here and very high level diagram of the components of impact here but but I was speaking there are 3 main components have here on on the left here you have a peer-review in editorial management workflow this is quite open-ended and we're going to consider the Open Journal Systems OJS to be the default option here because it's very commonly used in I in university libraries among antitrust members and it's open source or and freely available use but we want the system to to work with any other not peer-reviewed editorial workflow management system you might have some assuming they could talk through an API but would be working on OJS and developing a plug-in for that I will talk through an API that is over here the middle we have the impact of the impact of proper system that we talk about the really for the rest the presentation here this prepares the content for are not ingest archiving into trust civic here on the right on 100 trust that already exists in his in his in sort of full production work load but we are um making some enhancements to it in order to support 4 and it'll open access content stored in adjusts for federal because right now basically most of the systems are designed around digitized scanned page images of books from libraries collections so in brief the envisioned workload here is that the authors of the journal will submitted their their articles the manuscript of consideration into OGS another system the editors of the Journal will have put those articles through whatever peer-review processes in place for this academic journal on after some discussions with the author etc. and you will find when they choose to accept it they will and prepare this article to send a proper and what we're looking at this point is should come as having a surprise to you the applying of styles in words according to a predefined set of files and that will serve as a guide for the conversion tool to convert to jets format so we're starting about as our lowest-common-denominator format on the 1 that is commonly used across uh journals and authors we all know and were open to supporting other formats that in the future will discover that I thought were but it takes a stock X file provides a web interface upload the file right through the norm tool that converts to Jax and lets you review the conversion of of it here you get to see the article has rendered and by using the sparsity and see if something didn't come from
right and it will tell you there arrows in the conversion and you go back and fix them and upload the file that allows you to of supply of metadata that wasn't in the in the document that allows you to supply of text for images upload higher resolution versions of the images that you might have had the in the word document and upload
supplementary material all these pieces that the whole package together and then submitted to hide trust but reusing how trust sort of architecture here so the whole package goes in with the mets Metadata wrapper describing this whole package on the Jets itself although supplementary files of for the embedded images in the supplementary material and a mark record for the metadata for that individual journal article but then of course goes an antitrust and it's going to live in that repository and be discoverable among all those other objects in the repository but we're also going to be providing but away in the system to view a whole issues in Journals and within trusted just float around there is loose articles and with system they are and will have the orders of my way to discover and navigate among just articles in a particular journal I Hunter Jocelyn provides not only an
API for metadata but also an API to retrieve full content trust when the object can be legally made available to you so if it's in the public domain essentially and going be setting that
somebody could deliver I the original of the derived HTML on and the drive the power of the derived pdf or if you chose uploaded up a typeset PDF was manually types that it could deliver that back so that you could perhaps create your own interface additional that lives outside of the 1 of invited to throughout teach test that's a high-level overview of the system
among now turn it over to and Brian who's going to take things from here the hi I'm brian on the
talk about couple briefly answer properties
impacts words this is where as an editor you log in you can organize things into articles in 2 volumes and the
issues you can upload new content prepared for publication not trust on and proper namely does these things but also will guide the process the conversion process for you you will use norm or some other tools to do the conversion from doc x 2 jets family this is a Riveros applications hosted on how trust infrastructure so really just after
to the process of uploading a new article and publishing it high trust so on even after the out but if you see appeared in this section is novel about when you click choose File on and father I'll pop and you just selected the doctor exploited in unit conversion when you're ready just click submit and at this point you've already initiated the process of converting it on so what are the 1st step in our upload process and you'll get a conversion important at this point enormous modals already up on the run grand and then that is extracted and you can actually get a chance to go through a review make sure that all correct if you're happy with this called the bond the page shown in this annex button which on you can't see you could on that go the 2nd step and this is where you review the media that he extracted from the documents on this would give you the opportunity to provide things like alternative access and I have maybe us a lesson images with archival quality images something's resolution word format and really happy with this step you gone in and go to the 2nd get preview of the content as it will appear in contrast so this gives you an opportunity see exactly what you're gonna publishing put when you when you finish with this process she would review this here and you're ready on you're going to the next which is the supplemental material stuff and all this on the material is optional and that you can upload a camera-ready copy after made the PowerPoint presentation database in the supporting information here and when you're ready and you could next year at the final step in this is we get the opportunity on to to 791 publishers in in Europe to here and even at this point you will actually have to publisher and now you can go back into the dashboard and do other things in it'll say that you can park at any point process and come back to the what you're ready to actually published this time interested click submit a the confirmation and at this point you might return back to the journal here and there an article section this is where you would have anything that you that you know that you've started work and the uncompleted but this particular article has been submitted this will this will take a couple days and watch appear trust trust the process of actually uploading content to hide interesting property pretty intuitive and
simple process on just wish a couple screenshots of household pure and huntress even idea of how the content locked in users so this is an article you and
if you look on the scene conjured up you have some simple on the sensible branding on getting metadata license information and you can access a calibrated here you can download the darker of article me in multiple formats and material then the right hand side you can access the content this kind to here is going out here exactly the same way that saw in step 3 you're uploading content program the kind of it is here you
get a general idea in the job you you can browse through the vines nations and so kind to that dada wants so that it does on the show in terms of proper and high trust in the hands of receptors
connection fees can talk about the conversion hence we talk about the normal 2 norms of the tool that we have
of to command line application was written by found other converts documents to jets XML parliament essentially with principal conceptually amino just passes by the unzips the dot X in a positive the relevant XML on the 2 represents the the content and the elements of this control representation and it maps of basins config file articles were given to the units just takes the wraps developed creates just XML with with the depressive state pulled the production and right we use it with proper not a standalone command-line application but the idea is that you can do both right so you you could actually see things through in that process we can future of the right now to take stock X we need to jump the gun and thought maybe we can do to you so they have some code written but it's not it's not really there yet we need to go back to rethink how this is working a little bit the body of upper right so it's just that remains is to follow and there the chance some element of of folder for assets and resources in the supplemental materials in their home as well as the tocarry PDF the
International this this output of in
preparation right now will will commit file join the mark record will join it and it'll all be wrapped up into the the cipher submissions package and that's actually we could put it laughter through this can mentioned
some word styles this since this is our doing things and this way see here small the abstract right there is the blue paragraph you you you you style it would have struck with the abstract style found in word and he had to go through the entire article and do this so the head body like always different parts that's how we're doing right now anyway I'm pretty pretty similar to of you Iike styles and and other conceptually out things work and immune from which you see here in the slightest used for this simple 6 the body will just kind of complicated of depending on kind of content which will talk about so this is home and work together and you
don't take the word doctor good the worst those about correspond to Jax elements in this configuration file businesses this is the worst thing was the center on the configuration also has this gentleman that these words can dominance I'm in my head I'm enough the forefront of the body in the back to so you know the dark
experts on the and then the norm starts with an empty it's all representation of data structure and it goes to the doctor acts on you know the free development and that explains the style which should be marked up in XML form in terms the gentleman and it turns which section front your back or whatever it increases this tuple or you giant list basically all these developments so this is what the
word that looks like this is a really simple example right so this could be the title the style is called the article titles spelled out and when you unzipped that were dark you look inside of it this is what you see you see the titles in there but inference italics inside title the this is what the configuration file like so you
see the front section there so article title style equals article title but jets element in the ones in front and the parents section down here and it's really the article title belongs to title group of title group blogs article by article metabolize I'm in this
computed norms through internal representation assume tuples intervals and tuples from the density here is that at the Jets some an element of the title itself it has any actually use them being there on internal it also you because this is gotta tell as part of that part and then in the last representation is the word style itself so you
want to start this can change on data structures internally on starts you decreases and the from DOM tree and search to fill them to in order to create objects it recursively goes through it that's the things in the right places ideally arm and then update print out x amount mn and
this example of what that would end up looking like you extremely simple example so that you in the following we
talked about the profession and in New York pretty excited about it and it worked pretty well for 1 article that we had to make for it to work we started the
feel like different kinds of particles through you know and we know we put your scientific articles server we also
propose to a lot of different kinds of things into it and it's not doing the things we want to use so we're going to make the changes that entered edited the coming to the top that internal representations case can further on small projects but it's pretty hard to reason about also the styles themselves you know you see a world where you all these different kinds of articles the styles get control and it will be hundreds of them so how you manage that x is
a couple different challenges you know we're now this week we're actually going through looking at options is the
actually the space I guess it seems like that this school called Pandora and then when we looked at a lot of so called indication in the text which is on through it a lot of XSLT with a Python wrapper
it's built on top of box hours it it takes were docs turns into TEI and inference in them and an alarm going to take it out completely and its existing projects it doesn't do other from metallurgists body but this has done it
handles the body well in most cases only coarticulatory or over here and but we're looking at you know taking norm of using the word style concept for the front matter and maybe when these are the tools the body depending on if they can handle kind content you throw out or you might be tempted to refactor norm but maybe use consulting this time I don't know if you go into everything logistic to handle was kept complexity of
and going forward in the future In the last in this and
every question the more refined and I have a question about who
is doing the study analyzing the content and is uploading it is the author of the journals and and initial plan is to have Michigan Publishing step do the styling of names that come in especially in early stages years working up above which essentially replicates our current workflows doing things you know access to I hope here is that this will become the easier it will ultimately be easier to use than our current internal calls we haven't bothered trying to people to use based on and that we will will also the option that perhaps a graduate assistant working on on a journal and academic journal or some you know that journals managing editor of depend how the staff of these tend to be marginals tend to be small many journals that are just undeliverable of indicates that that that will have the option of doing this themselves and not waiting on us and no longer run as possible that the city and software and we say that if you do it yourself and it's used otherwise you can pass a small amount to do it but we're a long way from thank you yes so of you representative so a very interesting topic fundamental so not only just 6 and you and present is and is the only of what price would be in some kind of template structure violent you cannot just the randomly in the world for so called that's right so when we should screenshot of the word document with styles all those the different paragraphs different colors because the template that includes styles had to find them in these crazy colors just to help you do the styling inward when they come through represented in trust and injects the colors disappear so you have to use our set of styles but we should do that on configuration file in which the names of jets elements were mapped to the names of styles the system is written in a way that you could use your own custom templates in were with their own means of styles and write the mapping like that perhaps you have a whole set of word documents that have already been style using style you could write a configuration file do that math and then you wouldn't have to resell them according to the default styles that will the the did he looked terrible very technologies on very interesting you talked about front matter which is nice metadata and norm does really well you talked about body and you don't want norm is much refined but you're working on alternative solutions a refactoring norm 1 e really talking about was that Mandarin references what you do and for will be of right right now we I should have brought norm doesn't do that kind of found that then the types it's tool actually doesn't quite well what's interesting and they actually infer those references In this work introduces interesting way how of this parentheses after you assumes that the reference translate them up but it's somewhat successful a lot of this stuff is still things we need to work on but in is head it's time right in word it will come out but even in the type set while attempts to match up there the the reference within the article like authoring here with reference to the end it doesn't at this point I'm past the citations identified authors titles means publishers new data publication in those pieces components within the citations and we knew the beginning that would be quite task so that's been sort of near the bottom of the list so for now the idea is to have a citation label as a citation without all the pieces inside and be interested in offering the option again this is all meant to be pretty modular here in terms of which pieces you choose to use the system with certain like people to build a plug-in to a to a citation passing tool for example the 1 from an error in American never member mutual respect and effects yes or any other tool that you might have on hand to help you with the the passing of the citation thank you that's what I want to know inter internal citation pieces things such that for some of the Optical Society in the beginning of your presentation you mentioned that in the event of publishing In the event of archiving of tightly integrated so each event and you correction for example Lydia on the creation of a new version of the document well by definition publishing something is making something public but say for the 1st time so my question is is their facility all the option for each in the system to it said conversion is that social article as the baseline always version of wrapped with the your so that the user not only would be able to get the latest entries and is the greatest person but also receive the exactly as it was published it's a time part of making being made available for the 1st time right now want trust but does not have versioning built into it so it's an item and antitrust has been readjusted and you can't see the over version the over of a rich but we want to see that developed here so that a user could go back and see earlier versions this interface here doesn't yet have a place in where you would be able to navigate earlier versions but somewhere in a location to to to be determined when antitrust if this in place there would be a way to have you know the latest version interview earlier version something akin to the way Wikipedia displays numbers something like this we could go back and see which version is such that we haven't imagines a way that you can make an annotation is saying something like this was the 1st published version because the idea here is that the very 1st 1 is the 1st published version you wouldn't be able to go back and see versions that were underway in proper in the earlier days because they never made it up yeah by Liam created from the organization responsible for the world wide web which is deliberately and it gives people a sense of what I just wanted to filed a minor bug reports against your paper but because it's a good paper with the set of people will be looking at it so it's worth pointing out but I mention this in October of User Group meeting suprarenal inside XSLT can XSLT stylesheets can of course reconfiguration finds right here and although you know that people reading your paper may not because you paper implies that you can't so it's quite possible trust so instruction that's the basis of the style sheet so reads configuration files yes reminded but not thank you for reminding us that we were trying to remember which things 1 of the revised and in between the 2 and that has the yes Scotland Optical Society but the HTML looking at looks very nice I know you mention some oversights that working on terms more granular tagging etc. of is 1 potential target XSL-FO for the externality of producing or is it really mainly for HTML display over here that get this article links some we have 1 for download PDF Annex nominee so the rendering here is found we're using the preview XSLT package and as the basis for
this I for generation of PDF we started out looking and XSL-FO and then later decided to go the root of and 1 of the various actors out there that take HTML and turn
them into PDF that is taking this rendering and turning it into PDFs or planning to use that for this Jesus not using XSL just I'm just given by 11 in your configured file I see how you list the parents of the element that you getting ready to the you right how do you control the order of those elements
within time because there could be other things that could be a subtitle and all title and trends to group maybe that right so how do you how you get them in
the right order in the xml I mentioned the body just follow the done order but in the metadata you might have to rearrange things you find you will have some 1st staffing changes and turn around lately and then there was the impression that this was working already for that and recently have you have an issue with the user name and the given name for our order really counts as possible it's not actually the future of the ordering the nature of the for what it's worth the programmer wrote essentially this index using the equals sign in both cases there and that of course I had a misleading and so of course it did happen to be equivalence in that 2nd set and tell you that this 1 population latest that 1 but not in an
equivalency of other questions but you must
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation


Formal Metadata

Title mPach: Integrated Publishing and Archiving of Journals in HathiTrust
Title of Series JATS-Con 2013
Part Number 15
Number of Parts 16
Author Johnson, Seth
Smith, Bryan
Hawkins, Kevin S.
License CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
DOI 10.5446/21795
Publisher River Valley TV
Release Date 2016
Language English
Production Place Washington, D.C.

Content Metadata

Subject Area Information technology
Abstract mPach is a package of tools being developed to provide a modular platform to enable the publication of born-digital open-access journals in the HathiTrust repository. One of the chief technological challenges for this system is the conversion of edited manuscripts to an archivable format. We selected JATS as our preservation format because of the increasing coalescence of the publishing industry around this open, non-proprietary standard. This paper provides a technical overview of the mPach platform, with special attention paid to the design and functionality of Norm, a tool being developed to convert Microsoft Word documents to JATS.

Related Material


AV-Portal 3.5.0 (cb7a58240982536f976b3fae0db2d7d34ae7e46b)


  525 ms - page object