We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

From Text to Map, a state of art.

00:00

Formal Metadata

Title
From Text to Map, a state of art.
Title of Series
Number of Parts
295
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Natural Language Processing, has been revivified by Deep Learning approaches. This presentation will show what we already can achieve to convert plain Text to Map. Think for instance, to retrieving geometries conveys by an article, a book or a Tweet.
Keywords
129
131
137
139
Thumbnail
28:17
Presentation of a groupPoint (geometry)Lecture/Conference
Pattern recognitionQuery languageDifferent (Kate Ryan album)Bounded variationPattern recognitionPoint (geometry)MereologyOvalInformation2 (number)Operator (mathematics)OntologyComputer programmingRight anglePresentation of a groupComputer animation
Pattern recognitionQuery language2 (number)Formal grammarStatisticsNatural languageVirtual machineQueue (abstract data type)Different (Kate Ryan album)Process (computing)Operator (mathematics)Computer animation
Formal languageOntologyFormal languageWordData conversionSemantics (computer science)Video gameWeightPrice indexCausalityComputer animation
Gamma functionSource codeExecution unitPlastikkarteFormal languageProcess (computing)Theory of relativityScaling (geometry)Point (geometry)ArmCartesian coordinate systemWhiteboardLie group
Turing testComputerGame theoryDigital signalOpen sourceMaizeSource codeFormal languageProcess (computing)Endliche ModelltheorieVirtual machineLibrary (computing)Wave packetDifferent (Kate Ryan album)Point (geometry)Formal languageEndliche ModelltheorieOpen sourceRight anglePairwise comparisonFinite-state machineFunctional (mathematics)Field (computer science)Flow separationComputer wormMereologyPattern recognitionBookmark (World Wide Web)Term (mathematics)SpacetimeInfinityArmSource codeComputer animation
Data modelFormal languageNetwork topologyModemFunctional programmingContext awarenessFormal languageInformationEndliche ModelltheorieConfidence intervalDifferent (Kate Ryan album)1 (number)InfinitySpacetimePoint (geometry)Finite-state machineWordOperator (mathematics)Goodness of fitLevel (video gaming)Computer hardwareUniform resource locatorWebsiteNumberSelf-organizationLinear regressionArmKnotMultiplication signBus (computing)Wave packetPopulation densityXML
Perspective (visual)Endliche ModelltheorieOptical character recognitionComplete metric spaceMatching (graph theory)StapeldateiContext awarenessDimensional analysisComplete metric spaceData conversionLevel (video gaming)Pattern recognitionProjective planePoint (geometry)InformationMetadataElectronic mailing listConfidence intervalEndliche ModelltheorieMachine learningMatching (graph theory)Context awarenessOntologyLink (knot theory)Arithmetic progressionInstance (computer science)Right angleNeuroinformatikAreaData managementMachine visionMappingQuicksortMultiplication signForm (programming)Open setPrisoner's dilemmaObject (grammar)Data structureVideo projectorComputer animation
Perspective (visual)Matching (graph theory)Complete metric spaceOptical character recognitionContext awarenessStapeldateiDimensional analysisEndliche ModelltheorieStructured programmingMultiplication signStapeldateiContext awarenessMatching (graph theory)Computer animation
Perspective (visual)Optical character recognitionComplete metric spaceStapeldateiContext awarenessDimensional analysisEndliche ModelltheorieStructured programmingFormal languageVideo trackingProcess (computing)Social classStapeldateiData conversionContext awarenessField (computer science)DistanceDimensional analysisPattern recognitionConfidence intervalWater vaporMultiplicationCloningDifferent (Kate Ryan album)Computer animation
TheoryNatural numberFormal languageProcess (computing)Video trackingDomain namePattern recognitionSoftwareEndliche ModelltheorieScaling (geometry)InformationArmCASE <Informatik>Context awarenessArithmetic progressionSemiconductor memoryStapeldateiDomain namePattern recognitionPresentation of a groupEndliche ModelltheorieSoftwareComputer animation
Domain namePattern recognitionSoftwareEndliche ModelltheorieTwitterPointer (computer programming)NumberPosition operatorWeb 2.0Field (computer science)State of matterComputer animation
Pattern recognitionDomain nameEndliche ModelltheorieSoftwarePerspective (visual)Optical character recognitionComplete metric spaceStructured programmingQuery languageSource codeOpen sourceJava appletAsynchronous Transfer ModeData modelFormal languageTuring testComputerGame theoryDigital signalOpen sourcePattern recognitionLevel (video gaming)NeuroinformatikMereologyMachine visionOnline helpMappingOcean currentIdentity managementWave packetGeometryParsingMetric systemArtificial neural networkPoint (geometry)ConsistencySelf-organizationPlanningBitComa BerenicesComputer animation
Pattern recognitionTuring testComputerGame theoryDigital signalPosition operatorInstance (computer science)Theory of relativityNetwork topologyArithmetic meanTwitterPoint (geometry)MappingFirst-person shooterContext awarenessCASE <Informatik>InformationDifferent (Kate Ryan album)Chief information officerWave packetEndliche ModelltheorieUniform resource locatorSet (mathematics)Multiplication signMessage passingComputer animation
Network topologyPerspective (visual)Optical character recognitionComplete metric spaceStructured programmingEndliche ModelltheoriePattern recognitionDomain nameSoftwareMatching (graph theory)Context awarenessStapeldateiDimensional analysisPerspective (visual)MathematicsWave packetXML
Dimensional analysisPerspective (visual)Matching (graph theory)Optical character recognitionComplete metric spaceStapeldateiContext awarenessEndliche ModelltheorieStructured programmingRoundness (object)Endliche ModelltheorieInformationFunction (mathematics)Pattern recognitionHybrid computerOntologyMassAnalytic continuationComputer animation
Transcript: English(auto-generated)
All right, ladies and gentlemen, if I may have your attention, I would like to give you Olivier Courtin. So this presentation is about text to map. So the point is to present
the kind of issue you are facing if you want to convert from text to map. So what is the kind of presentation of what are you facing? Since you are kind of text, the point is to be able to extract
the geographical information conveyed by the text, even if it's a non-structured text, so a plain text, and to be able to extract kind of information and to convert them into geographical coordinate, able to be rendered on a map. So how can we do that right now? First
operation is the ability to extract part of the text referring to geographical stuff. And the second part of this kind of problem is to be able, once we've done that, once we extract several toponyms, to match them with an already existing gazitter.
Once we pose on this kind of problem like that, we can say that, naively, that first operation is just a name entity recognition, and there is already tools
providing some kind of thing like that, and the other one is just a geonames query. So we have the impression that it's quite easy. but in fact, the name entity recognition must be reliable enough to be able to make the difference between
Washington's and Washington. So are we talking about the famous people, or are we talking about a place? And sometimes it's not obvious at all. And on the other part, on the geonames one, there is so many Paris.
So it's not obvious to find the good toponym, even if you have a toponym. So why NLP, so natural language processing, is odd? First, you know that a comma can serve a grammar. And that kind of little difference, for you it's obvious, it's common sense, but
if it's statistical operated, there is no common sense. So a comma can really kill a grammar for statistical stuff. Second is irony. Irony was designed to express the quite opposite that you just said,
because you bet that the one who heard is able to translate. But it really depends to which is the other machine don't understand irony well. The other kind of stuff is based on the sound,
ICQ, and so on. So in fact, if you are really looking about language, you are, some people, it's a well-known philosopher who dedicated quite his life on language stuff.
And one of his famous quotes is to say that there is no semantic convey in a single word. There is no absolute semantic convey in an absolute word, in a single word. It really depends on the people involved in the conversation. And in fact, the same with something bigger than only a word, but a sentence.
If you are looking about the sentence, there is no absolute semantic also conveyed by a sentence. So the classical way to handle language understanding was historically based on ontology.
So it's a relation between one and the other, and a kind of relationship between the other, so a triplet. It's really huge to put in place and to maintain. And so in fact, it can only be workable on small kind of application, because it's quite big to modelize at large scale.
So in fact, it doesn't work broadly. One other way to perform that at a point was to say, okay, we forget the ontology, we forget the semantic, and we focus only on statistical posts. So what we already are able to extract from unstructured data, so text,
only by statistical stuff. And this kind of approach, with bag of words, with n-grams, and so on, are already able to do something. They don't deal at all with semantic,
they don't know what they are doing, but they don't understand the text, in fact, but they are able to extract stuff from the text. And so it's quite a beginning, and it's quite efficient, even if they have no idea of what they are looking at. So what kind of open source NLP library do we have right now to play with?
One comparison provided by Speci, with my right now favorite worms, is to compare different kind of function between different kind of library. And the point we are interested in is the
ability to recognize entity recognition, the ability to recognize part of the sentence. And there is also new worms library, because it's a really vivid area, and there is a new
library who came up, not every weeks, but it's quite an old field with a new solution arising. So if we are looking at Speci, we can use it for several kind of language and
with different kind of model. So the point for this kind of library is the ability to train them, it's machine learning stuff, to train them with some kind of data where there is enough information, in fact labeled. The point is to train this kind of model with text and label.
And so there is no an infinity of wide text available with labeled. So in fact, there is not all the language we are taking in account, but only a small of them. But if you are the
lucky one, it could be already interesting. And for some language, especially for English, there is a different kind of model, small worm, medium worm, big worm, and so on. So it can be interesting to see the kind of difference we can expect from a large model
and here a small model. The first of the difference is the number of places you can have between the two. The ones who are interesting for us are only the ones related
to location. So here is the location, the orange one, and here it will be both GPE and also on this one, okay, on this one, it's only GPE. So this ones and this ones. And the point is
that on a larger model, you could have, and so you could bet that it will be able to make more differences between some other hardware to classify. For example, here at Bluegrass,
here is identified as an organization, so somehow as people, but here Bluegrass in this context by a small model is identified as a location, so as a toponym. So it really depend on the model itself. Was it well trained? Was it trained with enough data, enough good level data to be
then able to reclassify correctly your text and to extract what kind of word are supposed to be a toponym, yes or no? If we look to bigger tool, able to do both operation at the
same time, so it's Mordecai. Mordecai could use a name entity recognizer as a space eye for the first operation, but is also able to provide you at the end a confidence
that the toponym you extract is or not related to a country. So it gives you first confidence information that is the toponym you extract from space eye, is yes or no related to a country,
because it looked in geoname, gazetteer, and it finds that yes or no, there is a this toponym, and if there is many of them, it will give a clue about the matching relationship based essentially on the population or on the size of the feature. So it's a basic way
to give first confidence rate, but it's really a first step to increase it and to avoid the
list of millions of toponyms with coordinates and with extra information. So there is kind of metadata related to each information stored in geonames. The point is that geonames is not
complete, so there is a lot of information in it, but there is still missing names, as there is missing maps. So even if there is still, even if there is already a lot of information in it, no one pretends that all the information is in geoname.
So if we look on the whole problem and if we look to, oh we can go further to have something really able to convert text to map, the first subject is name-entity recognition.
First, we can imagine use more rich structured text, for example as DBpedia, who conveys some kind of ontology inside to be able to extract more information for the labeling and so to train
models in a more efficient way. That's the first lead. Another interesting thing will be to find any way to deal better with multi-language stuff. So anything can be generic enough to work not only on one language but on more. And the other, because it's a really vivid area right now,
in fact somehow some years ago it was computer vision, was the hottest stuff in machine learning and deep learning. Now it became NLP. So right now it's really something where there is a lot of
research in progress. And for instance this link will give a list of all the latest papers available only on name-entity recognition. On the second subject is gazetteer. What we can
expect to increase the text-to-map conversion. First one could be to find an automatic way to complete the gazetteer. There is some project related to OCR existing map to extract the toponyms
and so to have some beginning of automatic completion. It could be a way or at least to present some toponym to validate by a human but with an automatic extraction to help them to save time. Could we use another kind of data like Wikipedia,
anything to complete gazetteer. So the same kind of stuff will already be done for OpenStreetMap. Can we also do that for gazetteer to complete the names. And the third one
is the toponym matching. Right now we are only using one toponym at a time. So we don't use context of the text to help the matching. But obviously in a text if you are talking
about one toponym and few sentence later to another toponym there is a kind of relationship between them. So here there is no, it's really a good way for the step to try the matching
not one toponym at a time but for batch to keep all the context of the text to make it. The other, as Modicay already use the population dimension to help to increase
the confidence matching, we can imagine using other dimension. So population is a good one but we can use other kind of dimension and for us the geographical could be also a good one.
So the distance between the different toponym and if we use batch we can have some kind of cross differences, cross distances between all of them. So there is several ways on each step to increase, to improve this kind of conversion. If you want to play with there is some human
learning stuff so I provide some course, this is the best one if you want to enter in the field. There is a small workshop dedicated only on name entity recognition. If you want to track
what's happening in NLP it's there. This one is related to the context of deep learning. If you want to keep some information in memory and not to forget
your learning from one batch to another and here are some books related to this field. As a takeaway NLP is harder than any other pattern recognition domain so it's fun.
You have already name entity recognition software and model already available and you can play with. It's really something hot and vivid so there is new progress arising.
We can have first tool we convert text to map but still a lot to do to have something really reliable and to be, to scale and to be reliable in any case. So yes we have, we already have
stuff but not yet usable in something you can already trust, trust on. That's it. Comments or questions? Thanks for the presentation. I think it was a nice summary of the field,
quite useful with lots of useful pointers. But what I'm missing in here is that some experimental
data, some numbers actually to see how well this or the current state of the art perform. Like because I was initially thinking of you know you're taking a post from Twitter for example or a generic text on the web and applying it using one of the tools to see actually how well and then you see what are the problems, where do they get things wrong, where do they get it
right. Because yeah I was just wondering with to have a feel for the performance of current tools right now. The two kind of issue are really these ones. So the first is if your
name recognition fail to recognize something as a toponym, so you could miss toponyms. Or it's it's even, it's even worse if he believes that something is a toponym and is not. And if this toponym is really on a gazetteer. Yeah okay. I will say that it's something like
it will be something like 65 percent of reliability. So it's better than nothing.
And it works in general but it's not something as reliable as computer vision which is something like 80 or 90 percent. So it's already something and it's already working but at a point you are
not that sure that it's really this toponym or not. So the stuff we really have to improve is the reliability and the consistency. And if you want a metric yeah something like 65 or
70. Yeah okay. Hi I'm Daniel Dufour. I work on first draft GIS, open source artificial
intelligence that makes maps. I was wondering did you stumble across that? Anyway so it does geo parsing. We're getting about like 80-85 percent accuracy but it really goes to what data set you're using and there's just a lack of training data. And so I would encourage anyone who wants to look at that. Honestly it's open source it's just me right now so please help.
But it's github.com slash first draft GIS. Thank you. So it's really interesting we skipped the identity recognition part because it was so hard.
Yeah a question quickly. When you are referring to the text to map tools
you are referring to countries or cities is there something like map being streets? There are some tools that text to map brings you to the street?
In fact in this design as we are looking on the toponym inside geonames the point is
the question is do we have a gazetteer related to what kind of toponym we extract? And if we are really sub-level we have to be sure that we are related about which city are we talking about. So if you extract a toponym related to street you must be sure that
this street is related to this city before doing anything else because if not you will you will be in this case there is so many streets with the same name in different kind of city so
the first point will be in this case to be sure that because of the context you are confident enough about the city we're talking about and right now no there is no solution like that. Reference to this training data set if we are talking about the twitter or tweets
so generally the twitter is also giving you not for all for some of the tweets you get the exact position of those tweets as well so if somebody is talking with reference to that context for example Bucharest and they are talking about this location right now for this conference
so we could easily use those informations to train our data set and then based upon that we can generate better text to maps. As a training data set it would be a very good solution at least as a starting point but yeah you can verify it I mean if somebody is talking
about Bucharest and they are sitting in Berlin so for training data sets so using the twitter tweets and their positions to to look into the context of those tweets or whatever is
written there for training data sets. Thank you. The point for training data set is to increase the ability to recognize well entity inside a sentence so even if we have
metadata related to a message so a tweet for instance in fact it's not enough because in fact we are interested in labelization inside the sentence so it's something interesting but naturally
meaningful to improve this kind of data set models. I have time for one more question. Have you looked at Wikipedia as training data and what's your experience been there?
For now I just mentioned it as a perspective so I think it's a good idea because there is
already something like a massive ontology base inside it so we already can use extra information from Wikipedia because there is tags on them but with Wikipedia we also have some ontology so it's something more than plain text and I don't see
some kind of hybrid approach with both statistical and semantic based on ontology so I think it's a lead to see how we can enrich the kind of information to have a better model
to with the eye our label output in name entity recognition. Thank you very much Olivier. Another round of applause please. Thank you.