thank you unconscionable and from digital communications have found all going to start my presentation this is today's agenda 1st of all finding I introduce gist is having serious then i'll talk about video class think xn creation to from technical aspect I explained this to reads demonstration online In the end I show you that conversion ways under some future improvements take so
let's begin with brief introduction what histories but yesterday miss at the
sample of of these so some item covered some of
these Our zest phase
is the major feature our publishing part Parliament upon and provided by J. st to bonds Science and Technology Agency on not about 2 . 4 million articles on Regis stand e-mail this year the jet stayed but they just 3 is launched and this new upon except after his reads just hallmark in food text that all in bibliographic information pi here's a dagger
diagram diagram on general publishing platform and related services in the long is being here bond of the gate and distribute what goes to the wide wide study seats just communicates here John L. the ProLinc Centre here on defining Centre recently becomes the new deal I resist mission agency but and that too I introduced they is provided in with inventor stage this
diagram describes the Sybase In distance main part of gestation consist of 2 parts registration system and the public system users Oxes from the Internet to public assistance besides these C stands there's BBO archaic XML creation to under ET is provided for marketing-mix society is using this stage this to accept PDF files and a call but the 2 just be Videography commit data xn qualitative data can be reduced out in Jason's mangy sends OK the so you
might have Christian by these 2 was developed there is a some reasons and I believe that people come here day fears the that extended EC because it makes a mistake is very simple and it's just have cities easier understand the domains basic life would have said the destruction that to and maybe you created from all authors state our main people presented about this but it might dreams I mean people understand that made of X this thing I think in Example but preventing this from happening are the fact that it's difficult to for authors and whitey undergoing xn format to people by using is very and in muscle the pain don't produce XML all and some picking companies because don't hog capability to walk reads XML and the they use XML highest is is required the some reasons
why we convert form PDF for example an I said that very various cues are used in production process full writing want what pick is used what printing BTB twos what publisher C stands also used to unfold distributing PDF or its chairman former quicksand not also used but it's unrealistic to try and support all the various performance of like this so for these these and on target group all users of just pages and intended pop us of the and more than anything and almost all academic societies how pdfs so the method all that information extraction on the general layout of the PDF 1 circuit you this mixed so
I show you there lack all of these 2 the current that fall these
2 2 phases the His Iran is tempered but on creation in the space and users called get tempered hands OK this some Palatka PDFs the standard items the pies layout there's book feature information not the spacing font style and concise and character think information In the PDF the template time using these information for extracting metadata in phase 2 the conversion of the pdf ought to to just fixing take place and an individual article was is where to start into to that people the graphic information needs then extracted from each PDF based on the terms on then just XML is generated In I will describe it in detail so before the Dean demonstration I have spoken about East aren't but it's what here in source
data off all of these 2 is speedy this summer restrictions about PDF it might version 1 . C to 1 . 5 this to also accepts high PDF pleasures but a new function of PDF is not recognized but there are other restriction need that forms of and that it I mean the last ice was can't PDF cannot be used and that's this PDF security or PDF cont they had on rising it is needed PDF security is being torn off OK and the next at former is just a very Dixon former it has been real graphic elements only and is compliant with this stages excimers submission guidelines the settings and also talked about these guideline yesterday so as of so let's start the
demonstration In contents
are here I we describe them using the how many people is here for clarity understand Japanese consciousness and so if I use the interface is only Japanese so please use for your imagination so I
let practice is an kinds character can you understand and he has very very ordered Chinese character fonts the almost the media what the clown this is not in and FIL long what the same yes this is the last these 3 has 3 words so XML in barrier here's small kinds characters beds OK and Hobbit piece OK this is a fish laughter and yes this is 3 dimensions character so I can rotate like these I this is all just 1 game is that these can have we're not be used in cities and I have this at but I have
no idea Peter so of course I create a new template in phase on and mentioned before so it will be in this eastern article 2 new templates being made here this is
also a radio beams and here's comfort setting think aren't these attempt fall in the created and add new template pot and here all day so now new template manager so core and the animal and resurrects something PDF the so some PDFs part I use these PDFs here can
see and Japanese tighter subtitle English fighters on author names and institutions here found at last of stock and he wrote in Japanese and English also here on December PDF is desirable to use to call what would if if he told that I can cite grew by title and subtitles there will guide you these on his radio buttons we this is where you put down you gas Hector create new temperate or import insisting can't beds on the 10 like these but I use this new create new and suddenly the but the subregional these forms PDE Macy's all rendered on some other freezer rate to that yeah next
so on the next so operation is this margin setting we Dias operation the body region and ahead of the DOE region are separated based on the set I studies after the margin settings PDF ideas on nights by PDF underlies a library decided I is developed by onto the house the and hitting paragraphs columns and other textbooks on weeknights part pages the see mean then in water change the that magazines like these log will important ones like the fonts check all the pages and stick to it and objects and some meat but a 1st the Dihana lies when I that kept us and thought sort them by coordinate because takes that you can see content stayed out order In the PDF is not the same after that the PDF on Isaac creates takes box and that takes the and make colonists by finding the gap he said lines then decided take this for
ordered at last make each lines 2 1 paragraph by recognizing indentation and training blank space it's of the banks space characters and the hyphenation needs stumped up the center but this is done with that dictionary basis so or the hyphens up the and by revealing move was like you out there is reason hyphens but at an old old book book the
OK analyzing these finished already school what think against the the PDA thing there's unrecognized because the books so a and right side of the seas bookrest we can we can use to extract yes information metadata the so selecting books he's now generated 1 sample path create a new Block here OK this is a block section sitting Parnell here's block name like tied to reject means and the next part is section range there 2 types of the locus 1 needs an iPod graphs with hitting and the other is pargraphes only I use photographs on the end spot is so setting but its former want name on the size they die and prefix suffix so takes buttons by pressing formant called the baton huge and again and did is 1 thing all in the PDF on font sizes and takes but still is defined between whose sole unchecked these and the last spot here it is and now 1 of the searching range and this is this is finished numbers title is past it so you take these under here's our regions setting by coordination like here if you page numbers but I I don't do this now so Panting these Orkut this torture and Walton the
persistent OK this book peace and recolonize at so say this a you don't have created here and some more samples for pregnant
here you we are in which here's a humorous that hitting on or so I use these the attack form quality the OK keywords the same for every but so unchecked here arms prefix but its contents ideas this and
he's OK you know why the this is
a book said selection bond kicks off this section 16 blocks next versus is to assign just him to go the I
said suggest I came it is only denounced not seem to XML member of just because I made department I mean characters need a tighter and trust titers what URIs fortuitous and on and is need keyword query elements also needed so I call this fall suggest items or seems simply just in so all the call back to the
system the here's a signing the series is up suggests I can't tighter in Japanese subtitles of me as being reached tied to subtitle as an abstract causal and so i slipped Japanese to here OK the then assembled here on the no answer 1 2 words encourage here OK the but that in I can assign found just I can pull 1 blocks but it's difficult to have settings for more complicated the settings so thank i is only log just like in here's Japanese but the title I think good type to book at this tighter Japanese book has is tied to a stream and for of the string is onto title so I don't need to change this setting but what you are as
the words the is related to the be divided by come out so what settings is needed here but they here's the and just itemset the the she is unbroken name unjust name and next part is like looks an here's the we the here's how for section users horror blocks and section by condition and and heading Rees pargraphes and the I headings in part is progress His parliament divided painting sold by using these aren't same as books I called humorous she hears and the condition of headings 1 more settings this is a of the separator from needs this cause Anderson quorum I use come common and like his books can be used and characters OK of this season
our the he is and this conversion botherin efficiencies and all new window here
will get converted his Japanese title aren't misrecognised books the she us here aren't field he was divide this quantity and not being called will not to uh temperature
here's almost completed temperate of here many box here and height is here as abstracts you can see so I think this where region on this book His doing
this conversion results on title and his hot typos these design is similar to His essays is public the standard and Japanese tighten keywords and abstract on English here on foreign is is divided into these items OK so I use this template for next explanation a lot
to side again next is phase to convert PDF to just XML using this template
sample PDF file is really the same result so I adopted some pdf into cheap fire I his seat then of gold to the
and here's boring and issues in these and create the issue in the the which is this the sheer all after is no so by pdfs static the template using conversion concentric fighters I used the steep 5 carbons for the last OK and upload of it for what files he's appeared on the status is in progress convergence may be done in party but I think the on the rear seats and all the the Pacific Ocean responses various also I I using the same old piece you know all conversion duration is an test accounts for image creation and 15 seconds for extend collision with the other 10 paces so 1 must be used only the to what Boeing and initiated and recognized is up here this year and but it's still from these hooks on the status of the the this is our sample temperature PDF the here's recognized the this is
XML editing doing the so a lift binary is PDF marriage on same as templates sitting and right side the it's I like title subtitle English heightens abstracts used here all and if you stepped takes a box recognizes region you will appear on it is also changing a is include page she's divided so but so this might be we we must quite OK 1 the stadium we know how was time time here's title and abstracts much relief mission like but you type except dates Stefan L. pages copyrights both as institutions 2 hours and before and is and also John our information and if Hughes fish so say Boston and it is
this guideline check with ground and now 1 analyzed you of war publish this is not it is here greater you on this is wrong so I in thank you for handle these time has also when reviews if you change life and so said
temporary Fred and you will be updated soon new prior this year but if you find some recombination you can get quite data from CDF like this that so said here on after
editing you can down load xm here and open G. priorities and also for these foreigners are Xist is guide rang compliant is John recall this volumes and each issuer's on the page she's of here always you it will always have XML and takes in extracted from PDF on exam as here which of scene this is a result XML of to my regret our duty binaries z a pointer follow and here's a j stages John our called and out general tighter is an excited font user on his ideas then for people that and the but the season remains valid because and yes it is registration system where you sort it all must kind here's titles RT tighter on hepatitis and quickly pair English 1 you uses all authors here's the fossils are higher scores you by the Japanese the sensory style and listen site will minus 1 is here OK for think in
editing window all and go to all of the top here you can add also this here hostile to please see my name found English can't tools couple don't call take hold the case under his books and I can no this is the section is hidden and just by here it gen here's a tunnel walls the current optional items like our mountainous you always all biographees here it and but institution have the item that MIT institutions my company
all most might be the such
heroes of the by the you the in the yeah
here on the concepts here in New our institution by taking like this the left to this extent of our the the in in in our head new my name you you institution and a pair creation 90 if is set up In temperate theories and right before inside you coalition of so the will be here I like here on correct before us is created in the extent also and and just issues in front page is found of stocks here aren't you guys this year and differences here about to be friends on it if I seasonal commodified these detailed element like 40 minutes and issues at this time we have plant to cooperate resampling center on more complicated automation with down the people want to use the going
but with his conversion result on re-use pinch hours for example of that came Sharma for each other cannot who was for each Jonas some generous hearts few articles because some Propædia offered by a pink society contained don't article PDF like there aren't there the peer-review solve this is John a name aren't language here just she means Japanese article risk English title and abstracts in B is article in English and J. in Japanese and this is recognition where of its meaning that and max some of these are these 2 Our Lady is very low that this spot related was closed fine frequentation failure all France it is and q was titles title all before as was not selected by condition be all of all the finances 1 not extracted solely commission weight went down OK use his humming future
improvements all details are not and it's undecided undecided at this time this 2 makes up the real graphic information objects from PDF so it is natural for the piece to to enhance the accuracy of convergence and the full text conversion this do consist 2 phases is the fast stages in the process of reorganizing that takes a block of PDF the 2nd stage is process of of signing that pixel to test elements so their improvement future all DBpedia finalize our library and just context recognition abilities 1st PDF file former is placed chapter is in any portion and any PDF reader the job arises the talked has read location orientation and size of 2 dimensional disparate the we PDF reader is infeasible it is feasible as of normalcy can show the string of text but in country in the PDF it is not our arranged on the same seeking shows between it appears in the Reader human readers can and recolonize that accessible through the displayed but PDF analyzer logs line and human readers that's for this reason the enhancement of accuracy of PFI analyzer is indispensable in order to increase the accuracy at all the OK and In the 2nd stage of processing this through maps takes evokes to just elements In other words at the 2nd step is a process to give meaning to that takes the block for a PDF that's created in activity is DHP to is . right layouts may change depending on the situation therefore but it really be necessary to develop how to react to render layout has changed to understand the meaning of all the book there is a need for improvements and last but there is a possibility to offer not only be the last the information in the exam but also through art visits to implement their common version of a free particle the Frost Giants it the 2 week when as all as it object in the PDF like the manages or become graphics was a table as and when targeting scientists undertaken rosy those it would be important to even out if you really impact to even are mathematical expressions creations this requires a of 82 recolonize precise arrangement of characters and to understand automatically in the context of the mathematical expression itself this is a causal biggie fraud and maybe on some site treaty for the last this is an all presentation solution I introduced the Bill graphic XML creation tool is the is is to configure it it is easy to do anything but need more improvements and here's some utilization friend of the 2 days of data only vessel than then a half here sort of alternately I cannot show that creative you all cost reduction but from access on assists form that it for some science some of Greek societies are using that to reach publication double life must be what the and now our said 190 be exam the husband is you set of justice means she said it is in various for small number of of those but think about these cues are sort stacked data use of XML the In conclusion it is time to improve the recognition weight and to to golf what leads for the Cassini mission of PDF but the result will help the external ization number of unique Jonah's k all thank you if you have Christian and is contacted the estates have is this what technique us questions to the sad communications ask is is me and actin Haas America but if you want to know more things and also try you're not used very please ask me in black tie I repeat up full you're talented me
and thank you very much that was fascinating from at least I am astonished at how quickly and effectively that worked out have said it was impossible who the questions comments the is the pressure is removed from the it's really interesting work you're doing I have a lot more detail question when you talk about the d hyphenation what you designing it so that you can have our dictionaries on a per journal basis because from an editorial style perspective the rules for hyphenation in English can vary based on editorial style from 1 journal to the next our here's an if a hyphenation really dowries and dictionary this 1 the the I think it can be also down the hyphenation with dictionary-based but I think it they discussing and on but I can say that I of in that I do not say much of our the size of my well you find Hollis's so I don't I has 2 of the horse and I to answer you should and so the OK thank you although knowledge Publishing Group them up I wonder you're interface was uh in all in Japanese army have missed this but the point is you last summer interface that in English so that in the yeah you're you're into these OU buttons where had a Japanese characters and now our I can't I cannot talk about the time because J AST is developed and he's and and the open system has read public same all the testing disease because Japanese and English and yeah but this is internal to the full of and NKC societies using Interstate so if the sun society of up on we want to use these sold phrase led to talk it's true there has the of fair enough food and anybody else the the
Formal Metadata

Title Reducing Costs and Expanding XML Submissions with PDF to JATS Conversion
Title of Series JATS-Con 2012
Part Number 14
Number of Parts 16
Author Katoh, Keishi
Kobayashi, Tokushige
Mitsuru, Kitazawa
License CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
DOI 10.5446/30583
Publisher River Valley TV
Release Date 2016
Language English
Production Year 2012
Production Place Washington, D.C.

Content Metadata

Subject Area Information technology
Abstract The paper presents a brief overview of the challenges facing institutions with the XML-ization of academic journals and the steps being taken in Japan to meet both those challenges with the new J-STAGE3 implementation and a solution for automatically analyzing and converting PDF into XML for JATS metadata and bibliographic information. J-STAGE3 has fully adopted the metadata and bibliographic JATS format. The automated solution is currently achieving more than a 90% accuracy rate and future plans are to expand it to be able to produce full-text XML from PDF.

