We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Transforming the arXiv to XHTML+MathML

00:00

Formal Metadata

Title
Transforming the arXiv to XHTML+MathML
Alternative Title
Converting arXiv into XHTML+MathML: an opportunity for blind and partially sighted to access scientific papers
Title of Series
Part Number
16
Number of Parts
19
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We describe an experiment of transforming large collections of LaTeX documents to more machine–understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e–Print Archive (arXiv)using the LaTeXtoXML converter which is currently under development.
Multiplication signMereologyPerspective (visual)TheoremStudent's t-testRight angleBinomial coefficientPositional notationEvent horizonWell-formed formulaLengthEnergy levelComputer programmingGreatest elementAlpha (investment)TheoryProjective planeMathematicsContent (media)Combinatory logicArithmetic meanPoisson-KlammerComputer animation
Positional notationRule of inferenceInfinityResultantAlpha (investment)Standard errorMathematicianSlide ruleMathematicsProof theorySocial classStatisticsSineCategory of beingCondition numberWater vaporDirected graphDirection (geometry)Arithmetic meanThermodynamisches SystemGradientGroup actionNichtlineares GleichungssystemModulformFrequencyComplex (psychology)Basis <Mathematik>Computer animation
Greatest elementPresentation of a groupReal numberKontraktion <Mathematik>Directed graphComputer animation
Rule of inferencePoint (geometry)Variable (mathematics)Algebraic structureVapor barrierUniverse (mathematics)Cartesian coordinate systemMathematical analysisPhysical systemWell-formed formulaStudent's t-testComputer programmingStandard errorChemical equationDecision theoryMereologyReal numberPresentation of a groupState of matterNumerical analysisMaß <Mathematik>Condition numberMultiplication signLogicProjective planePhysicistTheoremGenerating set of a groupExpressionModulformMathematicsPhysicalismContent (media)Validity (statistics)GradientSinc functionComputer animation
Transcript: English(auto-generated)
So, can you hear me? So I have to warn you, I'm probably the only person here who doesn't come from the perspective of helping disabled people. I'm actually interested in semantics and to my great delight I found out that, from Christian, that some of my work actually
happened. I'm going to tell you about my part of this and then Christian has promised to
tell you why it might be useful. This is some work we started about two years ago. We were actually trying to build a formula search engine for the web. We had a very nice way
of searching for formulae that was built out of automated theorem proving technology. The nice thing was that it actually did semantic search. The bad thing about this was that you needed Content MathML to feed it and we had a crawler scouring the internet for
three months and it found 13 pages that we didn't know about beforehand. So we considered that we were doing something wrong and said we have to create some more Content MathML. So we got extremely ambitious, helped by a colleague of Neil who had been ambitious
but had failed because he didn't have the right friends earlier. So the idea was we have a very nice resource of scientific papers out there, that's the Cornell ePrint archive. Some of you may still know it as xxx.lanl.gov where we have by now, we just
crossed that threshold two weeks ago, ten days ago, something like that, of half a million papers in LaTeX full of mathematical formulae. If we could only transform those
into Content MathML then we would be in business. I have very good students who are also very motivated to do fun things. So we just basically sat down and started this enormous project. The good thing was that we know
somebody, not everybody knows, Bruce Miller, who had actually sat down and re-implemented the tech parser and used that to create XML out of text. The only thing we had to do is to download half a million files, it's about a terabyte
short of a terabyte, and run his program over it and pick up the pieces whenever something went wrong. That's the story basically. There is a lot to learn when you're
applying technology to large examples. That's basically what we're doing. One of the things is that if you run Bruce's program over this corpus then it takes about
a year, one and a half, something like that, if everything goes wrong. You have to do something about it. That's the story I would like to tell you. The thing I would also like to tell you, because that's the other hat I'm wearing,
instead of doing semantic recovery in the large for a large corpus, we are also doing it in the small because we can do cool things with semantics, is why is recovery of semantics actually a hard thing? Again there, we wanted to use LaTeX because
that's, at least in the bracket of education I'm looking at, the standard format. Let's look at math notation. If you look at the formula in the top, then you can
see that we have three differently coloured alphas in there. That's just something I took out of my graduate lecture. It's about the i combinator and we have alpha equivalents there. That's the alpha on the bottom of this blue alpha. We have an alpha here
which looks, normally is all black, which looks identical but is a type, namely the type of this variable x, and I have an alpha up here which is a reminder of a type that's involved, namely it's the i combinator at type alpha, which just happens to have
not type alpha but type alpha to alpha. That's something that at least I ask my students to immediately see and differentiate. Of course it also tells us that notations are difficult for the machine because if we look into the archive, they just say backslash
alpha there. There's no distinction. I use distinction in my own papers but almost all people don't. Even standard notations, you may know this here as the binomial coefficient n over k. If you learned your math in France, I think you write it like this and
if you learned your math in Russia, you write it like this. Think yourself of being a machine trying to find out what's said here. You have to know whether the author is a Russian or a French guy or even a Frenchman who learned their math in Russia.
Short of understanding the math, namely knowing that actually both of these mean that, there's no way of actually distinguishing this. Semantic recovery is something that's, as we say, AI hard, meaning if you can do this, you can do artificial intelligence. We're not that
hopeful in other words. The other thing that's interesting is that notations actually follow complex rules. You can have notation that's introduced before it's used but we also have things like where w is the something or the other, where notation is introduced
after it's used, stuff like that. So notation is a complex beast and we're trying to get our hands on it. Other problems you face when you're reconstructing semantics, sometimes mathematicians actually leave out more than half of what they should actually say. So we have log x which actually means log sub 2 of x. We have condensed notations
which I like this one very much where you have actually two equations, f of x plus 1 plus pi p is g of x minus, and the same thing with the things flipped. We have
ad hoc exceptions where there's an infinity treated differently here. We have sine of x over y which might be this or might be that. It really depends on what's down there. Of course you know that sizes of gaps in proofs really depend on how clever you think
your interlocutor is. It's difficult. There's one thing we've explored which is actually fiddling with LaTeX to write the semantics into the tech so that it can be
recovered. That's actually something I've already done a thousand pages of. Even these slides are actually in this LaTeX format and I can generate XML out of them and treat
it with my omboc tools. But of course you can't do that for the archive because it depends on how you count, but it's about 10 million pages of tech out there so you're not going to decorate all the pluses with backslash plus or something
like that so we have to do something else here. Here we basically have the problem that we have a conversion tool but this conversion tool is one that actually needs looking
at every single style file that's out there. The archive has about 6000 of which about 1000 are duplicates. We don't really know because there are so many because we're not even going to look at 6000 files. With all these style files we need a very potent
thing and let me just show you the kind of things you can encounter. Can anybody read LaTeX? Well if you run it through tech then this is going to evaluate to a Christmas
song. If you run it through Bruce Miller's LaTeX ML you get this. That's very close to what actually the DVI would look like. Just forget your Perl scripts that look for backslash alpha and so on. You have a parser here in the LaTeX world where that actually
changes the tokenizer as it goes along. What surprised me most, I had the feeling I knew what LaTeX would be out in the wild. I was wrong. If you have a feeling
for what's out there about tech in the wild, you're wrong. There's nothing like really looking. We built a conversion harness that actually runs the converter over the text.
I'm not going to talk about how it really works. I'm going to show you the web interface so you can just go to the web page up here you can't really read. You'll find something like this. We're running it on 40 machines and there's a lot of metadata you can get.
The important thing here is the results here. Right now we have a green thing that's a couple of semantic warnings. That's this warning thing which basically means that LaTeX
ML couldn't figure out whether I was actually used as a function or as an individual, things like that, or over the course of the document it was used inconsistently. No problems means that LaTeX ML is completely happy. I hope you can understand that we haven't checked all of these by hand. Then there's a category which is where these 6,000 style
files are which basically means there are a couple of macros that are not implemented. We have a way of doing statistics over what these are. They're of the class keywords or something like that. It's not really that the math is bad but you have some weird
coroner cases that are still missing. Then of course we have this thing here which is when people are doing graphics in tech or something like this. The fatal error basically
means that it encounters more than 100 error at which case it really says I'm confused and gives up. We have about 90% in this green thing where you can actually see something and let me show you a random article. This is what you get and I'm sure that you'll
believe me that it's real MathML if I show you. This is in Firefox so if we see the MathML source. The nice thing is that we have presentation MathML here that's relatively near to semantics. We even have, in this case we don't, I don't
know why, we usually even have ContentMathML there but it's still pretty bad. If you want better ContentMathML you should talk to Dejan and buy him a beer because he's
actually the one who's doing the content work. But at least something, the sighted people can actually read it and let me basically come back to my talk. What we have to do
is basically for every macro that's ever used we have to write something like that. Oops, no. We have to write something like def constructor, macro names and then the XML equivalent. The nice thing about having to do this is that we are able to actually
recover semantics here. If somebody says backslash reals then we know some semantics. He was thinking of the real numbers presumably. If we actually were using the Realtek parser what we would get would be take the math blackboard font at 14 point and a design
size of 12 point and then put a double struck R there. That's not going to be a lot of use because then we would have to do something like OCR but we do have at least some of the semantic thing but to recover it we have to say something like that.
I have a couple of very dedicated students like Diane who are actually helping me do this. The nice thing for them is they can do something like attack the keywords macro in the Revtek thing and then with the next run they can see these numbers actually
going up. It's very nice for undergrad students that they're actually doing something for the world. It gives them a good feeling. What's the state? We can do something like 85-90% with errors, about 40% without errors. What we're starting just now
is again with a couple of graduate students is to do linguistic analysis. For instance for our search we need to be able to find universal variables, something of the form
for all X, Y and Z, let F blah blah blah and so on. Those are the variables we want to instantiate later to find applicable theorems. You can understand that this is not just a logic topic. We need to do linguistic analysis. We have to know something
like let F and G be something or the other to actually find out the applicability conditions. We've built a system where everybody, and I'm inviting everybody over, if you want to play with natural language we have a lot of it. We have about 100 megaformulae.
If you have a little program that actually spots universal variables or definitions or theorems or complex named entities or something like this, the only thing you can, we'll
run it over our corpus so that you can learn from it. The only thing we're asking you is to actually leave the data behind, meaning give us a copy of what you found. When you found a universal variable, just give us the XPath and we'll try to do something interesting with it. If you're interested in trying these things, our build system
is here. If you have a LaTeX file, you want to convert it into XHTML plus MathML so that it looks like the one I've been showing you, which is more accessible, then just basically send it to this URL. If you're lucky, we've already converted the style
files you need, otherwise I'd ask you to buy Dejan's friends a couple of beers. Christian is going to talk about the application, what this can do for the vision impaired.
What I'm interested in is generalisation search, for that I need universal variable spotters, or semantic search by academic disciplines, so if we know more about the structure of things, I as a physicist for instance would like to know is there anything that can prove
this formula or compute this formula, but I only want something that's valid in an expanding universe, where the proton decay is lower than something or the other. Yes, I'm stopping even though there are nice applications still to be had.
Just a reason to place in the context of this world. After now we have spoken mainly
about tools, also about how to produce tools, contents and so on. With this teaching we wanted to focus on some contents which can actually be available in XHTML plus MathML, and I conducted some experiments in reading them with a player and networks, so it means
that they become thousands of papers from physics and mathematics and computer science, which are available in this repository, may become more and more accessible both for
blind and partially sighted, because it's real that blind persons can read light in some situations, but with XHTML plus MathML the solution may be better, both for speech
output, brain output, but also for the possibility to enlarge some parts of the expression. And second, having available semantics where possible, as far as possible, would be useful to generate partner-like quality speech writers, or also facilitate converging new
soundwriters. Thank you. May I make another small comment? Since I'm interested in semantics, one of the things I'm hoping to get out of being here is your view where semantics might actually
help you, because I have a lot of grad students who want to do meaningful work and want to do meaningful work with semantics, so when you have an application that would be a wonderful way to cooperate.
We have time for one question. To make content MathML from whatever, you need to know a lot about the context of the formula, and where do you get the context from, just from the formula itself
and the use of the macros within the formula, or do you use the text environment, say a chapter or paragraph around the formula too? Up till now we can only do it from the formula itself. Right now we're doing dumb conversion. We are actually starting to do linguistic analysis of the formula
in a context. But that's, again, just like his project is a at least 20-man year or more project, because if we can do that, I think we can do all of AI.
Of course I would like to pass the Turing test, maybe only by reading the archive, but I'm a bit sceptical that we're going to do it in the next five years. Let's thank the speakers.