Mathematical Document Classification via Symbol Frequency Analysis
00:00
check out there that so long this is likelihood is being a research byproduct that you have meat byproducts or or byproducts and this is something of this is actually research result but it was the the year the aim of what we were looking to do the goal was to try to helped find things to improve mathematical handwritten character recognition and I'll sale but more about that later but 1 of the things we noticed something along the way looked into a debate over there and I thought it was interesting that might be useful to mention it to groups like this best OK so
00:46
the problem is that that we're looking at is given a mathematical document that anyone has any document from the University of Birmingham library would you like predestined within 3 with that of President know the area rough and so how are we doing this for recognizing some expressions recognizing uses of symbols and words on how far can you go by looking at the symbols of being used mathematical document sequences symbols bringing the matter got to figure out what period of time such problems and so the stock also warned that all talk about how we went in computed symbol and simple sequence frequencies in collections .period show how the frequencies by area and then go and talk about how we can give frequency information and work backwards and focuses on the area now said things that all the work we are doing alright so the
01:57
problem how to identify the source area of mathematical documents and by examining the thing and while want to identify some of the mathematical documents Robbins leave the the keyword classification at the MSC the mathematics of the classification information typed onto the document was submitted well immediately documents that don't have the prosecution by office or maybe and we have some document was written at 1 time and then some years later a new area of mathematics was identified and now these all these documents were written before the was invented in fact are right in this area and we're not had with this because those of us who wishes to exist time .period so somehow we should believe the document whether we should leave the classification of or something on 1 all right so long as it's going a little further I would like to understand what's what's in these documents on hold rather than just seeing the darkness a collection of pixels it would be nice too but to identify what are the mathematical objects that are involved in the discourse and even if we can succeed only partially and then as now and we can use this information to help us do other things of the document in particular I'm interested in 10 days mathematical interface and lastly I just think it's a really neat problem that that we can identify the area of the piece of mathematics so could you just glance and yet our machines are the hole was so there are obviously some macroscopic features that were identified and nice feel figure out what those are all right at the top
03:53
will be the rhetoric classification local then I think actually
03:58
located often understanding so if I'm going to be 1st some mathematical backhand like cut and paste mathematics and put it in my mind work of money worksheet for example a global leader in the field of correctly perceived expressed trees what this past associates vary but lot so if we were reading the paper and the I family have not a breaks that this would for sure mean that she is not was a real number was bigger than H is real but if I were reading a paper in group theory that would mean age was normal subgroup G and there's no way to tell the installations in subject areas like last year about this hair is said open interval isn't it could be any 1 of a number of things destroyed most this this nonzero elements of the field or is it the dual Arab agents here is the same as that the unexplored the power of some you work or is it the case ,comma that who we don't always know said so people collect dreams about passing attack to prudent enable us on our part completely just doing that dreaming until we can assign some sort of meetings even and across and to do this would not ask the user going annoying questions the PRI have asked if you anyway but recently get some broadcasters that way with personal data all right so there's
05:37
a lot so that produces the duration of the flight had been signed something that looks like a W T in handwriting is in Onego or the there's some cases where the characters are exactly the same so for example and you don't have a lot we do have a blackboard we don't have a lot of here in Federal Paper Board capable of color so what is that it's written under something like that and it's obviously not was and then there would have to be it's also time Gruden said even though
06:26
pixel by pixel they're exactly the same so knowledge of the state is useful even when the
06:37
sister determine whether similar difference for handwriting OK so what is it like trying to look a little bit of classification based on the mathematical formulas that appear in the document and not on the text imaginable on further or do interesting things if you analyze the text look for common words but sometimes you don't have words were using penbased mad interface which throws coming from and other times and if you just have the notes of lecture don't have a lot of words in it just has something that was written on the blackboard be a bunch of formulas and so recently problem fast just based on that and so the idea that is to analyze the frequencies of the symbols and the ngrams appear and by grounds what I mean are sequences of consecutive symbols and we're going to consider identifiers operators separately because it turns out that some useful in getting some differentiation India and the frequency and I something volume of God talk with and GM and what were shaking their heads up and down for Disney and Randene got twodimensional twodimensional structure wellappointed is operator were going to ascribe a particular writing like you write the subscript 1st and the superscript for the signal you doing a summation for each of these we picked her writing and it doesn't really matter whether the writing water is correct or incorrect for this purpose because we just need to to immunization for the mathematical handwriting recognition purposes it is important and so that we can talk about doing pleaded sounds about break this way so that when but it turns out that by talking but today here is completely in invariant batches today so what we did
08:40
was we need to get our hands on some data and so we have to what was in front of the archive repressive wary downloaded across 20 thousand articles from the year 2000 2001 which is essentially all of them that hightech sources and mathematical subject this 1 course and another completely different corpus is the set of engineering mathematics texts for 2nd year engineering that as the 1st mathematics beyond the freshman calculus level that engineering students would see North America and there is a set of very popular textbooks and I'm thinking Conrad rents guy the rubber boots and ducks following rubber boots at the thought that this was not the imprinted on his boots and so for the rest of their lives about problems something that's something that engineers will receive the engineering has shown some textbook said here they will somehow imprint on the continued use of the location of the assumption that this is just being in at problem we might actually do it for reasons which we recognize handwriting for years in which case you want to be right yet so what is the real tax sources for the 3 most popular injury books in North America and so in some sense what we're doing is where we are constructing an empirical measure on the space event grounds where the manager is determined by the Senate expressions as endaround appearing each of these books weighted by the sales figure is very clear Top Gear book representative and you get the the hard data on how many copies of each book was sold and that someone tells you how much of what alterations of the toll number of mathematical expressions of printed on paper it's fine to use count them on the bus station so this is so we got attacked from the author in 1 case for the publisher another case and urban pricing is now an emeritus professor at the University of all along and I think he is early nineties not more like you know that very high percentage of the books of the hearing and he he was willing to give up anything so we stand all 1500 pages of the book using in the and didn't generate crack track unnecessarily in all cases and so we then went hand corrected every single page so we've got have attacked expressions current spot so this it's not afraid it actually paid people to do this it into price so that the city of
11:45
crises that in have the evolved we were interested in doing that for his book was because it most of the engineering students actually use that
11:55
alright "quotation mark attack and in so the 1st thing we have is that no matter what it was for the archives of 1 such attack is all that matters have no idea what was going on until you cross the macros lost his serve future generations valuing run at simulator too to see what's there the intent lowlevel test again and for a never have given the that and that form expression through the tax is not we think it gives you something 1st expression from factor all iPods and indeed for most tend to use the closing parenthesis in respect to it's only the practices of holds questions about Grove expression trees on the way they did we did this was using that method Elton develop their laboratory and those already have to protect the amount of the expression trees and the device or on them and now we've got our our sequences from those who contribute any grant as illegaloften often
13:01
computers several frequencies and sell here some years the seasonal frequencies for the entire caucuses for Quirk whatever you want to call them so for the archived data at EU taxis identifiers sold and here is the number of times per per million mathematical symbols itself per million so I don't hear it Ray handwriting recognizer and you don't know somebody wrote yes and motives and ended the war in not here yes that's what it is that of ideas that was is better but but not very ready to tell you that OK where Katie was more popular than that's another thing this is for the whole workers sees and how that things dropped off pretty quickly
13:57
now everybody here is familiar with the matter said classification so look at the top level categories we didn't go any smaller than this because some of them you know had not very much data not very many have papers and then this is what we cooked
14:15
up as has been the mathematical subjects for during that only easier of about Dukakis peas warehouses multivariate calculus public House 1 of the new program about problems since the region was the saddest chapters were which area has just counted the symbols for those chapters so we need to find out what we thought should be a set of areas so now
14:43
I'm so here we have for the archived data examples of frequencies according to the area and we see that in number theory this was lodged so than logic that's all I can user fees at the meetings and in number theory then it seems it and he can also part of the way that the use of the bulk of the distribution looks kind of similar that the numbers small numbers from drastic steps that he ordered ranking symbols of Toledo said questioned how similar probably distributions that said something which is worth looking at and it turns
15:30
out that the very similar the life you put a lot number theory the grass on top of all subjects then it is very difficult distinction and so we have something which is very closely approximating as if distribution for these were the samples will do frequency by civil rights
15:55
the same thing the injuring symbols so the July represented chapter here and these are the symbols where the whole the horizontal axis label different of the the most frequent character including easily needed others but distribution is very similar in their different ways the same day that years of cumulative distribution around lot and here is moving in the long season of everything is locally but you see this pretty linear logic solution over a wide range there was some exceptions will be characters and some exceptions for the rarest ones
16:36
the same thing is true over the emirate we have a lot the ngram frequencies only for the during the butter so here we have In grounds the 3 different authors by programs for GM and the fight and so we know each author has rung style but the and are are very similar distributed so we see that we got fairly steeper as
17:05
of now fears that the data then that we so we go we look at the data in all the years by frequency we see that we can write letters to reach 1 of the most common this Parliament and I couldn't paragraph mark in to show where the sequence starts to get unique so for example in logic there you have other areas all units of 3 most ensembles but it's certainly is different in the 4th and what was what I think is something like this where we got Annex II to the coast some other symbols the
17:51
rest to see it's the paragraph comes fairly early and so will really need to look at the top 10 most common symbols in idea of what area were it with the mathematical operators is not a paper in the Proceedings that looks like this and like the back of the all you see is what you break with the walls gulping down and the point I'm trying to make the blobs at close to me and the and for the matter operators the loss of further out here so it seems using letters is the best thing and so how the prostitute experiments so talking the conclusions and
18:32
future work out where the process of doing experiments several days taking the archive days living and training of the see how often we correctly classified is classified documents from the other half the day I don't have those numbers today I was in the middle of doing this I thought was indicates an interesting thing to show come very different from each other symbol even the simple frequencies used in the archives let alone and so do we need to see the forms and expressions of maybe not agencies for a lot of capital is making endeavors to promote the use of classification as at 1st were trying to find ruled the classifiers based on water statistics saying which is most commonly found out that the panel would look at class prices based on how different they are something which were once it was very ,comma noticeable not by its opponents as a separately that might be more important trends along with the almost the sole might reduce the frequency and but already got some conclusions here that the symbol ngram rankings vary quite significantly by subject area where research by he has its price and last thing that the frequency of functional rise follows astonishingly similar distributions from 1 from 1 area so the fact that the different areas the future you are so close to each other I might tell us something in face and that no 1 remembers that the flow something that the winner but it really don't show up in writing and that's all I have to say that this thing and it would be the 1st of all the reality I I think that this kind of a region that American but I don't know these factors letters crowds during yet the on other thing that comparing that to the area which is very costly on the other hand the new things in the really encouraging to try them the president and his father was like the people of the area who of Shakespeare's plays against the I want to experiment that thinking anew after a the different authors concede same ordering of 3 it was a few years we I'm right so that that's right so DNA but because there were only 3 authors and artists few subjects the young and the goal was of course they were were different but they were quite similar and like some author use some simple and author did and so what we didn't want to be constructed we need to be there symbols we I need to be constantly testimony out of my life which the terms of the deal among the most popular symbols from the 1 with most participants from here and it was sort of the same or eliminate this the similarity between the 2 there's still greater difference both the subject to the same yes that is correct so that it is more similar to the areas than it on with 1 exception 1 of the authors with the government is itself so that there were tons of certain facts in his that that was across all I just to really In all you have to be flown to yes that's the next question so so that was little it has to be fierce declared what what I was doing what I saw as well as well was the actor was to be able to get to generate some market models assessing the services they might see like the next character is going to be an equal sign greater than all the predictable in its characters the great readiness but along the way discovered this sort of another question that we would like to answer perhaps is how she promised discrimination it was really a nice little revenue yet which can be used as the people of the view that also features will do in this article would you like to do is to find some features we should get rid of this coalition on yet another delivery within the letter of the law .period because we used to think I guess there's a lot paper the this is the 1st of kind the it said but we will get the most out of range Have you have some idea compared the last you you have to do more of their right and that that's a really good question and 1 that would be nice to investigate but that was the suicide of globalizing drove past that it somebody should look at that question that I think that it would be really very interesting and it may be that the words are an even stronger predictor of mathematics that I was interested In the past couple of years to you can also use the "quotation mark all of the the the issue of yes this summer you through him according is another 1 the you can buy because ultimately this and 1 in the world the amount of the aid agency the case that because fuel at time of incident that is the most strongly correlated with increases in the future and and the and the next which you yes but the effect is like that of the 2nd quarter compared to the handwriting if you're using characters dash events in which the rights of others you as well you know about some of the can used to you you have to think of what might have been able to get it to do so that you can get quite like the ability to say I guess point which ordered that might help find each were on the correlation so long as originally thought a supplies yes that is why is it that you do investigation at the scene and you get for cats the is same as that of the same strategy was working with the database training in the laboratory used questionable not so long the question is what were widely watched from across the Union can have consists of a structure that is a good 1 but I the regularly standing in on the back of I think you're always on that's right that was great thank I want know what during the the last thing some of the kind of guy who just want so I think that I have not been sufficiently clear of what you're saying is certainly true that over time is just as easy that with every competencies 1967 roundtable favor and then almost and then there is was like sitting on heralded the start of this textbook was not available in the office would not provide access to sources for research purposes to justice system him because this book is stated that talk but it would be the he naive to ignore the fact that this book is big business so the of the text sources on how his livelihood and large part of publishers 1 the price armed guards outside so the axle that thought it was important was that this would be the end of the 1st half of the year regular yeah I think that without having numbers that all of us in the legal battle foreign civilians is we know that we old the central government and the states so that is pledging to continuously to begin with many of have questions background the
Metadaten
Formale Metadaten
Titel  Mathematical Document Classification via Symbol Frequency Analysis 
Serientitel  DML 2008 workshhop  Towards Digital Mathematics Library. 
Teil  4 
Anzahl der Teile  14 
Autor 
Watt, Stephen M.

Lizenz 
CCNamensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. 
DOI  10.5446/21270 
Herausgeber  River Valley TV 
Erscheinungsjahr  2012 
Sprache  Englisch 