Transforming the arXiv to XHTML+MathML
Converting arXiv into XHTML+MathML: an opportunity for blind and partially sighted to access scientific papers
We describe an experiment of transforming large collections of LaTeX documents to more machine–understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e–Print Archive (arXiv)using the LaTeXtoXML converter which is currently under development.
Chen so yeah right so I have to wake you up probably the only person here who doesn't come from the perspective of helping disabled people I'm actually interested in semantics it's like Reliance by the finally there From the that's what some someone my work actually help people with problems assured about so I'm going to tell you about my part of the and then they just stand there actually prominent tell you why it might be useful so this is some work we started about 2 years ago at time
we were actually trying to build a formula search engine for the weather but we have a very nice she was searching for a formula that was kind of building out automated theorem proving technology up and the nice thing was that it actually did semantic search the bad thing about this was that we needed content CDA and we had a crawler scouring the Internet for 3 months and the found 13 pages that we didn't know about before so we consider that we're doing something wrong and said Well we have to to create some content and and so we got extremely ambitious and held by a colleague of Newell who was had been envisioned by failed because he didn't have the right to friends and earlier so the idea was that we we have very resources of scientific theories out there that's the colonel even occupied some of you may still know it expects the level of God where we have by now we just crossed that threshold 2 weeks ago today because of the of half a million pages in length at the full mathematical formulae and if we could only transformed them into it is confident that there would be so I have a very good students they were also very motivated to do the following things and so we just basically sat down and started this it was a project and the good thing was that we know that body not everybody knows Bruce Miller who had actually sat down we implemented the tax parts and use that to create example after later that the only thing we have to do to have files about a terrible idea sure terrible and run this program over it and the pick up the pieces when something's wrong so that the story basically there is a lot to learn when you're applying technology to large examples so that basically what we're doing 1 of the things that if you run bruises program over I yeah over this caucus then it takes about years what have something like that everything goes wrong so you have to to do something about it so we that's the story would like to to
tell you the that the thing I would also like to tell you because that's beyond wearing and then do it the semantic recovery for it's a lot for a lot we're also doing in small because we can do cool things with semantics is why recovery of semantics actually a hard thing and again there we wanted to use later because that's at least in the record of educational I'm looking at the standard format so let's look
at map notation if you look at the former London To get the In the top and you can say that we have the differently colored Alpha and there that's just something I took my undergraduate lecture about the I nominated and and we have also will insist that the office on the bottom of the this up we have and how year reflux not only is the lack of which was identical but is time this is what has happened over here which is a reminder of the time of US involvement in the time that the iconic acts like how we just happens to have not signed up fight to but there was something that was that my students to immediately see a different also tells us that patients are difficult for the machine because we look into the article while they just say backslash Alpha no distinction IAC thinking of my own but also all the people down of incentive stations of you made where were you may know
this year as the binomial coefficient and cable you your men in France I think you write it like this and you and your money in Russia where like that think yourself the name she trying to find out what was said here you have to know whether the others the Russian her friend got word from Frenchman learned he would demand that Russia London up short and the standard man namely knowing that actually both of these In the event there's is no way actually the distinguishing results Manfred coverage in something that we say I heart meaning you have to do what he could do this he can do artificial intelligence but were not told of the other thing that that's interesting is that military actually follow complex rules in Malaysian them that would be introduced before but we also have things like where W is the something was of being interviews after its you thought that some location is a complex piece of advice
get our hands on the other problems you face when you're reconstructing semantics sometimes mathematicians actually about women have what they should actually so we have a lot of extra gently lots of 2 the Book of conditions notation this is like this 1 very much where you have actually 2 equations that that was long qualified geographic miners and the same thing with the things like that and we have at our exceptions where an infinity treated differently here we have a signed West over water which might be this form might be there and it really depends on what what's down there and of course you know sizes of gaps improves really depend on it how clever you you think you're interlocutor itself yes "quotation mark so there is 1
thing with export which is actually filling was later 2 but right the semantics into grade into the attack so that it can be recovered in that actually something which will directions I already done to them it's not a thousand pages of even the slightest are actually the slated for and I can generate enough out of it them and treated with my humble new borrowers to
candidates for the article because you the general noncombatants about 10 million pages of text out there so know going to decorate all these sources backslash office of something that we have to do something else and here we basically you have the problem that we have a ritual but this conversion tool it's such a long that actually needs I'm looking at every single style but that's the I have about 6 thousand of which found that thousands of are different we don't really know because there are so many because we're not going to look at 6 thousand 5 OK so we will always stand by what we need a very potent thing and 1 show you the
kind of things you can encounter can anybody read later well the running through attack that this is going to have to evaluate to a Christmas song again if you run through Miller's ladies and all you get the that's very
close to what actually DVI would like so L just forget your pearls that look for backslash of July 1990 prices here
in the late world where that actually changed the tokenized as it goes along and what surprised me months most I happened when interview he was attacked late it would be a the wild I was wrong and if you have a feeling for the country about 10 in the wild you're wrong there is no nothing like that really looking and answer we will also
a conversion harness but actually runs the converter over the text
of amendment to talk about and how it really works and want to show you that if the weather
effects just go to but the web page of here you can't really read and you'll find something like this we're running at all 40 machines and there's a lot of men in go the important thing
here is the result here so right now now we have an agreement then that's about 40 per cent which means that there are a couple of semantic warnings about this morning in which basically means of the itself couldn't find figure out whether I was actually the functions or as an individual things like that work over the course of the document that was used inconsistently no problems means that it's a mail is completely happy and I hope you can understand that we have checked all of these by hand that's then there is a category which is where the 6 thousand so far which basically means .period are couple macros that are not implemented we have no way of doing statistics over what the environment Over the last key words or something like that so not really that matters is that you have some weird corner cases that are still missing and other words we have this thing here which is unlikely learned with graphics and text or something like that so that confuses me but that is basis the fatal error me basically means that characters more than 100 in which case it really so that you give up so we have balanced 90 per cent in the screening where you can actually see something and
let me show you a
random article and this is what you get and I'm sure that you believe that it's real madam financial news services and Firefox so if we use the consumer outsourced the last thing is that we had presentation of all year the relatively new to 2 of and we even have known as it don't and we usually you have called and all that but it's still pretty that if you want better come continental nation talk today on the bottom of beer because he's actually a lot was doing the contract work but at least something you can decide what people have actually read and finally that the way to go continue to and come
back to my talk so what we have to to do is basically for every member of reviews we have to to write something like that but want to have to write something like that instructor macaroni and then will the nice thing about having to do that we are able to actually recover some of we have somebody says backs real then we don't have the money he was thinking of the real numbers present if we actually were using some the real estate parts what we would do that would be overtaken that Blackwater formed at 14 points and signed by 12 points and then puts Beatles required OK so that's going to be a lot of years because then we would have to do something like but we don't have any answers some of the real thing but we have to remember that we have to say something and I have a
couple of very dedicated students like they were actually telling do this but I think for
them and they needed something like that happened he works macros in the and then when the next wrong they can see these numbers are going up it's very very running that they're actually doing something the world with
of the it what's the state we can do something like 85 maneuvers on where there is about 40 overs without the errors what we're starting just now again with a couple of graduate students to linguistic analysis for instance for us that we need to be able to it To find universal barrier something of a forum for all x y and z let as long and it's not because those are variables we want instantiated later find me there nobody can understand that this is not just a logic ,comma thing we need to do linguistic analysis we have to know something like Let S and G B something would have to actually find the flexibility conditions that we have a if we build a system where everybody and inviting everybody over you want to play with natural and when we have a lot of it we have about 10 major formula 100 major formulae and so you have a little programs that actually spots universal variable or definitions with or complex named entities or something like this was the only thing you you can run we'll run over our progress so that we can learn from it the only thing we're asking you to actually read data behind me give us a copy of what you've thought you find a universal Barry just give us the expert we'll try to do something interesting and you're
interested in trying these things are bill system yeah if you have any later flight you want to convert it into a both national so that it looks like the 1 I join and which is more accessible than just basically send it to this era of the your likely we have already whoever they decide the needs of otherwise I asked you to by Indians friends a couple of beers I because there is going to talk about the application of what this can do for the vision-impaired of land interested in is generalization search for without and universal variables fathers or semantic search by academic disciplines of we know more about the structure of the entire visit we would like to know is there anything that can prove this formula or computer this formula but I only want something about but that's about it in expanding universe where proton decays lower something like and I yes stopping you go there and most publications of had but the guy this I but that doesn't have to listen to the said on the floor of the 1 is the the use USA of all this 1 all of these things in the of some of the additional units sold in this message that ended and that it is in there was no there will also be sold the all the video and on the status of the majority of of the people in world the economy more accessible to all around the world in the past that you realize that in light of its aspirations and that was the best decision but stressed that the decision made you more we are also the future of the what was believed to be in the United States but what and the sale of all the men were also of thought possible would be useful In general firing in the history of the directors also Bloomberg small balances on his amounted 1 of the things that people want to get out of there being here it is we you where semantics might actually help you because I have a lot of credit students who wanted a meaningful work out I want to do meaningful work with somatic cells have an application and that would be wonderful to cooperate year after year ,comma relocations go Washington he called on the front of the Beatles "quotation mark this all about you can't just from nearly all of its use battles with false origin used used firearms digital also up till now we can only do it from the formula itself became so right now we're doing I don't we are actually starting to do linguistic analysis of the the formula context but that again just like his project is at least 29 year or more project because if we can do that I think we can do all the time and of course I would like to fast during this maybe only by reading the book says I'm a bit skeptical but we're going to do in the next 5 years With the movement