Mathematical Document Classification via Symbol Frequency Analysis
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 4 | |
Number of Parts | 14 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/21270 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
5
13
00:00
Mathematical analysisFrequencyMathematicsAreaInterface (chemistry)Axiom of choiceOperator (mathematics)Linear mapGreen's functionSequenceNetwork topologyModulformInfinite conjugacy class propertyMaß <Mathematik>Maxima and minimaDifferential (mechanical device)Limit (category theory)Partial differential equationLogicNumber theoryTheoryRankingFunction (mathematics)Distribution (mathematics)Statistical hypothesis testingStatisticsAlgebraic structureMathematical analysisDifferential equationGenetic programmingSequenceComplex analysisOrdinary differential equationCurveLinear algebraLinear programmingMathematicsLogicMultivariate AnalyseNumerical analysisOrder (biology)Noise (electronics)Statistical hypothesis testingStatisticsNumber theoryGroup theorySet theoryGraph theoryModel theoryPositional notationFrequencyWave packetQuadratic formModulformInterface (chemistry)Category of beingVector spaceEntire functionWell-formed formulaDerivation (linguistics)InfinityPhysical lawTotal S.A.Energy levelPartial differential equationEquivalence relationApplied mathematicsArithmetic meanAxiom of choiceDifferential (mechanical device)ExponentiationFunctional (mathematics)Line (geometry)Directed graphGroup actionCalculusPower (physics)Mathematical objectMereologyMoment (mathematics)MultiplicationNormal subgroupTheoryPhysical systemRankingResultantSigma-algebraSubstitute goodTable (information)TensorConsistencyCountingVektoranalysisAreaLinearizationReal numberDivisorSimilarity (geometry)MeasurementOrdnungsstatistikRange (statistics)Operator (mathematics)WeightStrategy gameConnectivity (graph theory)Heegaard splittingCross-correlationDistribution (mathematics)SummierbarkeitPoint (geometry)Körper <Algebra>Social classWeight functionCartesian coordinate systemMany-sorted logicObservational studyStudent's t-testProcess (computing)EstimatorExpressionGraph (mathematics)Sign (mathematics)Algebraic varietyClosed setDifferent (Kate Ryan album)Element (mathematics)IdentifiabilityMultiplication sign2 (number)SpacetimeRight angleGroup representationFigurate numberPosition operator1 (number)Equals signMaß <Mathematik>Computer programmingNetwork topologyIncidence algebraInvariant (mathematics)CoalitionComputabilityFlow separationDeterminantState of matterQuantum stateSampling (statistics)Term (mathematics)FamilyRegular graphGenerating set of a groupWater vaporMusical ensembleOcean currentCivil engineeringCumulantLattice (order)Film editingGraph coloringEvent horizonTrailMass flow rateLikelihood functionAssociative propertyVolume (thermodynamics)
Transcript: English(auto-generated)
00:00
Okay, there we go. So this is a, you might think of it as being a research by-product that you have, you know, meet and meet by-product, you have or, you have or by-products, and this is some, this is actually a research result but it wasn't the aim of what we were looking to do. The goal was to try to help find things to improve mathematical
00:25
handwritten character recognition and I'll say a little bit more about that later but one of the things we noticed something along the way and looked into it a bit further and I thought it was interesting and it might be useful to to mention it to a group like this. Okay, so the problem that that we're looking at
00:49
is given a mathematical document I could hand to anyone in this room, any document from the University of Birmingham library, any book, you could flip through it like that and you have a pretty good estimate within three seconds about what that book was about. You know the
01:03
area roughly and so how are we doing this? We're recognizing some expressions, we're recognizing the use of some symbols, maybe some words and so on, so how far can we go by looking at the set of symbols that are being used in a mathematical document or the set of sequences of symbols that are occurring in a
01:24
So that's the problem and so for this talk I'll say a little bit more about that. I'll talk about how we went and computed symbol and symbol sequence frequencies in the collections of documents, show how the frequencies behave by area and then go and talk about how we can, given frequency
01:45
information, how we can work backwards and look at so we can find the area of the document and I'll say a few things about ongoing work of what we're doing. All right, so the problem then is how to identify the subject area of a mathematical document and by examining the symbols in particular and why would we
02:05
want to identify subjects in mathematical documents rather than just believe the the keyword classification or the MSC, the mathematical subject classification information that was typed on to the document when it was submitted, well it may be the documents that don't have classification provided
02:25
by the author or it may be that we have some document that was written at one time and then some years later a new area of mathematics was identified and now these documents were written before the field was
02:41
invented in fact are right in this area and but they're not tagged with this because those subject classifications didn't exist at the time we wrote the document, so somehow we should believe the document more than we should believe the classification number somebody put on it one day. All right, so as going a little bit further we would like to understand
03:03
what's what's in these documents somehow rather than just seeing a document as a collection of pixels it would be nice to to identify what are the mathematical objects that are involved in the discourse and even if we can succeed only partially in this then we can use this information to help us do
03:23
other things with the document and in particular I'm interested in pen-based mathematical interfaces and lastly I just think it's a really neat problem that that we can identify the area of a piece of mathematics so quickly with just a glance and yet our machines are hopeless at it so there are obviously
03:43
some macroscopic features that we're identifying and it would be nice to be able to figure out what those are in a little bit. All right, so I talked a little bit about retro classification I won't go into that now I think actually okay to aid in doctrine understanding so if I'm going to be
04:01
parsing some mathematical document so that I can cut and paste the mathematics in it and put it into my mathematical workbook or my maple worksheet for example I've got to be able to figure out what these expressions mean and even if I can correctly parse these into expression trees what as the semantics associated to them can vary quite a lot so if we
04:20
were reading a paper in semi algebraic sets this would for sure mean that G is a not was a real number was bigger than H is a real number but if I were reading a paper in group theory that would mean that H was a normal subgroup of G and there's no way to tell the difference unless you know something about the
04:40
subject area likewise here is this an ordered pair is it an open interval is it it could be any one of a number of things this is probably the most choices for this is this the set of nonzero elements of a field or is it the dual of the Faraday tensor and here is this an exponent of power of
05:05
some number you or or is it the eighth component of a vector who knows we don't know unless you know something so people have dreams about parsing tech to put it into maple and so on are completely just doing that dreaming
05:20
until we can assign some sort of meanings to these even in an approximate way and to do this we can either ask the user a lot of annoying questions we'll probably have to ask a few anyway but at least we can get some broad guesses out of the way with first by looking at what they're doing all right so this a lot so for disambiguation if I had written sign
05:42
something that looks like a WT in handwriting is it an Omega or a W then there's some cases where the characters are exactly the same so for example we don't have a black we do have a blackboard we don't have a blackboard
06:03
so what is that well if it's written under something like this then it's obviously not if I add Z equal to F of T it's obviously the time
06:24
derivative of Z even though pixel by pixel they're exactly the same so knowledge of these things is useful even when so to determine when the symbols are different for handwriting okay so the point is I'd like to try to look
06:48
a little bit at document classification based on the mathematical formulas that appear in the document and not on the text you can imagine you go a lot further or do interesting things if you analyze the text and look
07:01
for common words but sometimes we don't have words if we're using pen based math interfaces which was where I was coming from at other times if you just have the notes of a lecture you don't have a lot of words in it you just have some something that was written on the blackboard which would be a bunch of formulas and so I think it's interesting problem to look at being able to
07:21
classify just based looking at the mathematics and so the idea then is to analyze the frequencies of the symbols and the engrams that appear and by engrams what I mean are sequences of consecutive symbols we're going to consider identifiers and operators separately because it turns out that's
07:43
useful in getting some differentiation in the in the frequencies and I slipped something by you a moment ago I talked about engrams and a lot of people are shaking their heads up and down well it doesn't engram mean you've got two-dimensional you've got two-dimensional structure well the point is that for each operator we're going to ascribe a particular writing
08:04
order like you write the subscript first in the superscript for the Sigma when you're doing a summation and for each of these we picked a writing order and it doesn't really matter whether the writing order is correct or incorrect for this purposes because we just need to do linearization for the
08:23
mathematical handwriting recognition purposes it is important and so then we can talk about doing weighted sums of how many people write this way versus that way and so on but it turns out that what I'm talking about today here is completely invariant to that choice okay so what we did was we
08:40
need to get our hands on some data and so we have two one was from the archive preprint server where we downloaded approximately 20,000 articles from the year 2000 to 2005 which was essentially all of them that had
09:01
text sources and mathematical subject classifications so it's one corpus and another completely different corpus is the set of engineering mathematics texts for second year engineering math so this is the first mathematics beyond the freshman calculus level that engineering students would see in North America and
09:24
there is a set of very popular textbooks and I'm thinking of Conrad Lorenz the guy with the rubber boots and the ducks following the rubber boots that they thought the boots were their mother that they imprinted on these boots and so for the rest of their lives they thought that rubber boots were mother ducks and so I'm thinking that engineers when they see
09:44
their engineering maths from some textbook in second year that they will somehow imprint on that and they will continue to use that kind of notation until somehow it's beat out so that's just an assumption like we either think of this as just being you know an academic problem or you might actually be doing it for a reason which would be the recognized
10:03
handwriting for engineering students in which case you'd want to be able to see what kinds of things they might be trying to write okay so we got this the original text sources for the three most popular engineering books in North America and so in some sense what we're doing is where we are constructing an
10:23
empirical measure on the space of engrams where the measure is determined by the set of expressions with a set of engrams appearing in each of these books weighted by the sales figures it's very empirical you talk to
10:41
your academic book representative and you get the hard data on how many copies of each book was solved and that somehow tells you how much of what notation is out there but the total number of mathematical expressions ever printed on paper is finite so we just count them up we've got a space there so these are that so we've got the text from the author in one case for the publisher in another case and Erwin Kreisling is now an emeritus
11:05
professor at the University of Ottawa and I think he's in his late 80s or early 90s not more than I you know a very high percentage of the books that the engineering man has taught North America's from him and he's he wasn't willing to give up anything so we scanned all 1,500 pages of the book using IFTI and it
11:25
didn't generate correct tack necessarily in all cases and so we then went hand corrected every single page and so now we've got the set of tack expressions that occur in this book so now there's this it's not for free yeah I actually
11:40
have to pay people to do this I didn't do it for fun all right so as we see the Kreisling said you know the reason we were interested in doing that for his book was because it most of the engineering students actually use that all right now we've got the tack how do we get the end grams so the first thing we have to do is we have to evaluate the macros because whether it's for the
12:03
archive server or the engineering books essentially tech is all about macros you have no idea what the form is going to be until you process the macros it's not just a simple substitution you actually have to like evaluate run a text simulator to to see what sort of native tech low-level tech you get and form and then we have given the tech they have to actually form
12:23
expression tree because tack does not we think the tech gives you some kind of expression trees but in fact that your eye parsing it in most cases indeed for most tech that's written when you see a closing parenthesis superscript two it's only the parenthesis that is squared not the whole expression so we
12:40
have to go back and figure out what the expression trees are the way they did we did this was using our Taft MathML converter that we developed at our laboratory and as I said already so we convert the Taft MathML now we've got expression trees now we can define traversal order on them and now we've got our our sequences from those we can compute n-grams and so then we can
13:00
go off and compute the simple frequencies and so here's some here's a simple frequencies for the entire corpuses or Corp I or whatever you want to call them so for the archive data and engineering tax these are the identifiers so and appear these are a number of times occurred per million
13:23
mathematical symbols okay so per million so if you don't if you're writing a handwriting recognizer and you don't know what somebody wrote guess n okay if it's an n or if it's an engineer guess X well n isn't such a bad guess
13:40
but X is better okay I didn't need to tell you that okay but whether K or T was more popular that that's another thing so this is for the whole corpus and you see somehow that the things drop off pretty quickly now everybody here is familiar with the mathematical subject classification so we just looked at the
14:03
top level categories we didn't go any smaller than this because some of them you know had not very much data not very many papers in them this is what we cooked up as as being the mathematical subjects for engineering
14:21
ODEs linear algebra vector calculus PDEs for analysis multivariate calculus complex analysis numerical analysis linear programming graph theory probability statistics and for each of the books we decide which chapters were in which area and we just counted the symbols for those chapters so we defined what we thought should be a set of areas so now so here we have for
14:45
the archive data examples of frequencies according to area and we see that in number theory this noise is logic so that in logic that I and n acts in
15:01
capital X these are these are the leading symbols in number theory then it seems that n and P and K are more similar and impartial differential equations X and T mu so even though the distribution looks kind of similar that big numbers go to small numbers and fairly drastic steps that the order the
15:21
ranking of symbols is totally different so the question is how similar are these distributions that's something which is worth looking at and it turns out that they're very similar indeed that if you put the logic and number theory and PDE graphs on top of all the subjects then it's very difficult to distinguish the curves and so we have something which is very closely
15:46
approximating a zip distribution for these for these symbols will be due frequency by symbol rank okay same thing with the engineering math symbols so each line represents a chapter here and these are the symbols where the
16:04
horizontal axis is labeled differently so from but the most frequent character in PDEs will be different from the others but the distribution is very similar these are different ways looking at the same data here's the cumulative distributions where we're adding them all up and here's we're looking at the law and you see that well everything is linear locally but
16:25
here we see that it's pretty linear in the log distribution over a wide range you know with some exceptions for the leading characters and some exceptions for the rarest ones the same thing is true for the n-grams we've looked at the n-gram frequencies only for the engineering maths but so here we have
16:46
n-grams for the three different authors these are bigrams trigrams four grams and and five grams and so even though each author has their own style that the n-grams are very similarly distributed so we see that we've
17:03
got a fairly steep curve and so now here's the data then that we so when we we go and we look at the the data in all the areas by frequency we see that we can't rank letters for each one from most common to least common and I put a paragraph mark in to show where the sequence starts to get unique
17:26
so for example in in logic that you have other areas that have I and an X is the three most common symbols but it starts to get different in the fourth symbol and the longest one I think is something like this where we've got an X
17:47
I K T and it goes to some other symbols here's the rest you see it's the the paragraph market comes fairly early and so we only really need to look at the top ten most common symbols to get an idea of what area we're in with
18:03
the mathematical operators is another paper in the proceedings that looks like this and like in the back of the room all you can see is a bunch of gray with some blobs that go up and down and the point I'm trying to make is the blobs are close to me and they're far from that end and for the math
18:21
operators the blobs are further out here so it seems that using letters is the best thing and so we're now in the process of doing experiments so now talking about conclusions in future work we're in the process of doing experiments to separate the data so take the the archive data split in half
18:42
train class of half the data see how often we correctly classify and misclassified documents from the other half of the data I don't have those numbers today we're just in the middle of doing this but I thought it was any case an interesting thing to show how very different and and from
19:02
each other the symbol even the symbol frequency use is in the archive data let alone n-gram sequences so do we need to see the forms of the expressions to know what area we're in maybe not maybe you see that there are a lot of capital ends is enough maybe even at that very superficial level you can
19:22
do some classification as at first we're trying to do we're looking at classifiers based on order statistics saying which is the most common least common if that doesn't pan out then we can look at classifiers that are based on how different they are like something which is where one symbol is very common another symbol is not quite so common as a second word that might be
19:43
a more important transposition than one where they're almost equal so we might need to use the frequency count signatures but already we've got some conclusions here that the symbol and n-gram rankings they vary quite significantly by subject area whether we're at the research level or at the
20:00
teaching level that was a bit of a surprise to me and lastly that the frequency of a function of rank follows an astonishingly similar distribution from one from one area to the next so the fact that that the different areas that distributions that are so close to each other that might tell us something
20:21
and maybe it's based on the fact that we can only remember seven things or something about the way we think the way we write but it really does show up in our writing and that's all I have to say about this today thank you I would be very interested in running in a very similar experiment but not with
20:42
areas but with comparing it to citation cleats or the math genealogy thing I kind of have the intuition that you inherit your notation from a supervisor
21:00
right we all use factor letters because of Crowell's journal yeah and so on and I think that comparing that instead of areas which is a very coarse-brained thing to other more I would say unity based things would be really really I really encourage you to try that okay okay this is like the
21:27
equivalent of figuring out who really wrote Shakespeare's plays yes I wanted you to do the experiment that say taking the linear algebra from the three different authors and see if you get the same ordering of frequency
21:45
of it wasn't here to me just how reliable right so that that's right so we did because there were only three authors and this there were a few
22:01
subjects the and the goal was of course they were different but they were quite similar and like some author would use some symbol than another author didn't use at all and so what we did then was we constructed a weight where we took the the rank symbols from we took the cat symbol
22:25
cast from all the authors and I'll apply them by the weights but it turns out among the most popular symbols for author one were the most popular symbols for author two and they were sort of in the same order when they existed so the similarity between the different authors was still greater than the
22:40
difference across different subjects even for the same one yes that's correct so there's more similarity between areas than between authors with one exception one of the authors was so so there were tons of circumflexes in his
23:02
thing that that was across all areas yes yes that's the next question so so just to be clear speak there what I was this what I saw and there's what
23:21
I'm after what I was after was to be able to get to generate some Markov models to say I've seen this I've seen a sigma I've seen an eye the next character is going to be an equal sign or a greater than or an element of like to predict what the next character was going to be for right writing recognition but along the way we discovered this and so another
23:42
question that we would like to answer but haven't is how strong is the
24:16
discrimination such a table yes there's a we have a paper in the document analysis systems
24:24
that's been held in September where we look at multi grounds for French you have did you have some idea to compare with your statistic with the
24:52
right that that's a really good question and one that would be nice to investigate but that was we sort of saw that one go by as we drove past
25:02
that it was a somebody should look at that question I think that it would be it would be very interesting and and it may be that the words are an even stronger predictor than the mathematics but I was interested in the formulas
25:21
talked about the linearization to be a certain didn't matter what do you pick one
25:41
so for example Sigma I probably a very common summation of I from yes this Sigma superscript infinity or lowercase another one but depending on which linearization because it's really you know this token and the one end
26:08
tokens later yeah it may it may indeed be the case that because if you're looking at a finite horizon that if the most strongly correlating thing
26:25
appears early in sequence you push it out too far then then then you miss a correlation yes but the effect is like that's like a second-order effect compared to
26:46
where if you're feeding me a sequence of characters and I think you're right in the superscript and you're really writing the subscript that's a major fact yeah so well it from the position you deduce that he's writing a superscript and you
27:05
know that there is a separate it's a superscript of a summation of what might be a something they you look at a different database right so but you're quite right though a little bit too quick to say that it makes I guess the point is that the which order you pick might help you find stronger
27:29
correlations but so long if you're doing a classifier yeah yeah I guess the point is that that if you're doing trying to classification if the sequence you give for the test is the same as the sequence the same strategy
27:44
was used to build a sequence for the for the database for the training it doesn't matter which order you use and the question that it comes so long as they're consistent the question becomes is what order going to give you a stronger classifier than others you can you can you can have consistency by construction right that's a good point
28:06
apparently only scan one free tech object right pardon me you only scanned in one document everything else that's right I'm wondering
28:22
I'm wondering okay I hear what you're saying what so I think though I might have not been
28:43
sufficiently clear so what you're saying is certainly true that over time usage has changed absolutely and you see that you know with a conference proceedings from the 1960s that are all in typewriter done with typewriters and then the IBM type all was invented and then there's
29:01
typesetting along parallel to it but it's not that this textbook was not available in town it's that the author would not provide access to the sources even for research purposes to just get histogram accounts because this book is okay you say you didn't have time but we would be we would be naive
29:23
to ignore the fact that this book is big business so the text sources are somehow you know his livelihood and and a large part of his publishers livelihood and they probably have an armed guard outside of the study so yeah
29:40
so that's that's not about it was before tech was invented it's just would be pure I mean this does so have the flavor of tail lagging go on and it would be interesting to see yeah I think that without having numbers to
30:01
back it up I think all of us and see if you go back far enough the assembly is just very different we go back I mean you know old journals from a couple centuries ago so and it's probably changed continuously in between thank you
Recommendations
Series of 13 media