Natural Language Processing meets FOSS4G
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 295 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/43409 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FOSS4G Bucharest 2019174 / 295
15
20
28
32
37
38
39
40
41
42
43
44
46
48
52
54
57
69
72
75
83
85
87
88
101
103
105
106
108
111
114
119
122
123
126
129
130
131
132
137
139
140
141
142
143
144
147
148
149
155
157
159
163
166
170
171
179
189
191
192
193
194
195
196
197
202
207
212
213
214
215
216
231
235
251
252
263
287
00:00
Natural numberProcess (computing)Natural languageEvent horizonProcess (computing)Natural languageMappingPresentation of a groupMusical ensembleForcing (mathematics)
00:31
Process (computing)Natural numberNatural languageSample (statistics)MappingCore dumpInformation securityPoint (geometry)Physical lawType theorySimilarity (geometry)DialectGraph coloringLevel (video gaming)
01:02
Sample (statistics)Texture mappingMenu (computing)Maxima and minimaConvex hullContent (media)Similarity (geometry)Point (geometry)MappingTwitterArithmetic meanMorley's categoricity theoremUniform resource locatorWordProcess (computing)2 (number)Hazard (2005 film)Source codeXMLComputer animation
03:07
Sample (statistics)Texture mappingSpatial data infrastructureProcess (computing)Natural numberNatural languageDenial-of-service attackMappingCASE <Informatik>Numeral (linguistics)Traffic reportingCharge carrierGoodness of fitCondition numberTransportation theory (mathematics)Multiplication signProjective planeDemosceneComputer animation
04:49
Gastropod shellSpatial data infrastructureNatural numberNatural languageProcess (computing)Scripting languageJava appletModul <Datentyp>Process (computing)Natural languagePerformance appraisalNatural numberComputer animation
05:13
Spatial data infrastructureNatural numberNatural languageGastropod shellSimilarity (geometry)Content (media)Function (mathematics)Machine learningFloating pointUnsupervised learningConvex hulloutputMach's principleEstimationMathematical analysisVector spaceProcess modelingSpacetimeData conversionReduction of orderFile formatDatabase normalizationoutputLevel (video gaming)Similarity (geometry)FrequencyCategory of beingVirtual machineAlgorithmCoordinate systemTwitterGraph coloringLibrary (computing)Theory of relativityDistanceAbstractionVirtualizationContent (media)Reading (process)Sparse matrixVector spaceOpen sourceMappingComputer programmingPoint (geometry)BitNeuroinformatikNumeral (linguistics)Reduction of orderGene clusterMathematical analysisProcess (computing)WordUniform resource locatorPartial derivativeArithmetic meanSocial classModule (mathematics)CASE <Informatik>Machine learningNatural languageKey (cryptography)SpacetimePersonal digital assistantChromosomal crossoverBit ratePhase transitionAlgebraFunction (mathematics)Information securityQueue (abstract data type)Arithmetic progressionMessage passingPattern languageEndliche ModelltheorieCore dumpSource codeWeb 2.0PiAreaFitness functionObject (grammar)Multiplication signMassQuicksortGoodness of fitComputer animation
12:20
Personal digital assistantMathematical analysisSign (mathematics)AreaNegative numberProduct (business)Bounded variationCategory of beingAlgorithmVirtual machineProcess (computing)Parameter (computer programming)FrequencySimilarity (geometry)ResultantLattice (order)MappingNormal (geometry)FreewareLevel (video gaming)TwitterRow (database)Function (mathematics)Virtual machineProcess (computing)Open sourceVector spaceMathematical analysisCoordinate system2 (number)Dimensional analysisParameter (computer programming)Software developerSoftwareSpacetimeAreaCASE <Informatik>Phase transitionAlgorithmComputer programmingPredictabilityNatural languageDialectPoint (geometry)Queue (abstract data type)Forcing (mathematics)Goodness of fitHuman migrationActuaryData structureMultiplication signMetreMassLine (geometry)MereologySheaf (mathematics)BitRight angleVector potentialState observerDifferent (Kate Ryan album)Information securityComputer animation
19:28
Data modelVirtual machineMathematical analysisReduction of orderBit error rateMiniDiscScaling (geometry)Term (mathematics)FrequencySpacetimeVector spaceProcess modelingDifferent (Kate Ryan album)Natural languageProcess (computing)Content (media)BitMathematical analysisSpacetimeComputer fileOpen sourceWordFreewareSoftwareComputer programmingTwitterJSONXMLComputer animation
21:43
FrequencyTerm (mathematics)Process modelingSpacetimeVector spaceMathematical analysisReduction of orderScaling (geometry)MiniDiscData modelBit error rateMUDMachine learningPoint (geometry)ActuaryVirtual machineParameter (computer programming)AlgorithmDifferent (Kate Ryan album)Performance appraisalMachine learningLaptopComputer fileGoodness of fitComputer animation
23:01
Vector spaceProcess modelingSpacetimeMathematical analysisReduction of orderData modelBit error rateMUDMiniDiscScaling (geometry)Term (mathematics)Similarity (geometry)Process (computing)Machine learningFunction (mathematics)Content (media)Sample (statistics)MappingMassSimilarity (geometry)Process (computing)Source codeComputer animation
23:32
Sample (statistics)ResultantExtension (kinesiology)Service (economics)
24:39
Sample (statistics)Texture mappingContent (media)Similarity (geometry)Function (mathematics)Process (computing)Machine learningoutput
25:01
Computer animation
Transcript: English(auto-generated)
00:07
Ladies and gentlemen, I give you Eita Horishita. Thank you. Thanks for coming to my presentation. My name is Eita Horishita and I'm an individual participant of this event and I'm from Japan.
00:24
And today's topic is the natural language processing for the phospho-Z. And so what I would like to show for today is this kind of maps. There are two types. And this one is a choropleth map which shows some values of the regions by colors, as you know.
00:45
But it's not a typical choropleth map, but it shows the similarity of, the relative similarity of the master plan of the city by colors. So, okay, let me point out. For example, so this place is red color and this city is almost red color.
01:13
It means these cities have the similar master plan, the similar content of the master plan.
01:21
So that kind of maps. And let me check. So another one is the document mapping. And it's not related to any kinds of, like, geography. So it's the XY coordinate, but XY coordinate is completely virtual.
01:44
So it's not a geographic map, but each point is text data. And more concretely, it is Twitter data from Twitter. So each point is Twitter data and the location, it shows also similarity of the document.
02:10
So similar Twitter, similar text are located at the closer place. So in other words, automatically the documents are categorized and classified with the same meanings.
02:26
So I have a video, so let me start. And if you are zooming, you can visualize with a heat map.
02:43
And so this place has a similar content of Twitter. It's like a bot, but. And the other places has the other content. And similar documents are plotted at the closer place, the same place. So you can automatically categorize these documents.
03:10
I just visualize this on QGIS. So for example, this isolated island is a weather report or so.
03:21
That kind of things. And today, I'd like to explain why I created this kind of maps or this kind of techniques. And how I created with some use cases and challenges as well.
03:41
But before talking, let me introduce myself for better understanding of the background. I used to be a city planner as my first carrier in Tokyo. And at the time, I started using GIS. It's a typical reason for urban planning.
04:01
And after that, after a few years experience, I decided to move to Sri Lanka to start up my coffee project. Because I am a coffee geek, maybe more than GIS. But at the same time, I used GIS for coffee. Because good cultivation of coffee is strongly related to its geographic features.
04:23
Such as elevations, soil conditions, temperatures, rainfalls, conditions for transportation. But at the time, I started thinking, I can visualize. So in both phases, I used GIS and I found it very useful.
04:43
And I found I can visualize or evaluate the numeric quantitative things on GIS. However, how can I evaluate qualitative matters such as people's emotions or friendly atmosphere or motivations or so on.
05:03
So that is why I started running natural language processing in short NLP. Because I believe that it will be a kind of breakthrough for this question. About my technical background, I use QGIS.
05:21
And sometimes I use LIFLET for JavaScript library for visualizing on the web. And some Python modules, especially related to GIS and NLP related things. And let's get back to the maps I introduced in the beginning.
05:41
So there are two maps and both show the similarity of the contents of documents. One is this conceptual corpus map is by colors and another one document map is by distance. And this conceptual corpus map input is 120 urban plants around Tokyo and 170 cities abstract from Wikipedia.
06:07
And similar plants are illustrated by similar colors. And another one document map is input 5,000 Twitter data. And similar text are plotted at the closer place.
06:21
And map is a completely virtual non-geographic XY coordinate. And about technical features, similarity of document is numerically calculated based on keywords. So basically frequency of keywords. So frequently appeared keywords in the document should be regarded as the feature word of the document.
06:49
However, if it is used in most of the other document as well, it's not the feature word. But it's just a common word. So less important. So like this way, in the document, the key word, the importance of key word and its partiality is calculated automatically on the program.
07:14
And the document is converted into numerical features.
07:24
And about the clustering and deciding the location of the coordinates, unsupervised machine learning approach is used for its clustering process. And unsupervised means you don't have to, for example, this document map, you don't have to define any categories before processing.
07:45
Instead, computer program automatically decides the location and classify the categories in the way of soft clustering.
08:02
And so document vectorization and clustering is a typical, very typical natural language processing topic. So there are many kinds of methods and there are many kinds of open source programs for this use. So the important thing is you have to choose the best algorithm and the best modules.
08:26
And it's usually available in Python. And you have to understand how it works and how the hyperparameters for machine learning works.
08:44
And choose the best one is a very important point. And a little bit more about the use cases. It's very useful for understanding unexpected similarity of the cities for visualized way. So for example, talking about this map, this place, and maybe this place is similar color.
09:10
So content of our planning is somewhat similar, but I didn't know that. And so we can understand, so unexpected, like we can find unexpected things by this method.
09:26
And the categorizing input by objectively evaluated way. So machine learning is used, so computer define these things. So it's totally objectively evaluated way.
09:41
And you may find interesting document by exploring the maps without read all of them. Let's talk about the document map. So it is very difficult for us to read over thousand documents. However, if it is plotted like this kind of maps, you can remove some categories
10:03
because you don't have to read them, so some categories. And you can find out the categories you are really interested in. And after that, you can dive into the specific categories and you can find out the one you really wanted to see.
10:25
So for this purpose, this kind of method are used. And next, I'd like to explain how to create this kind of things. There are totally six or seven steps for creating these maps.
10:44
Firstly, of course, you have to correct the document. And as far as it is text data and as far as you can correct them, you can input all kinds of data. For these cases, I use the city plans or Wikipedia data or Twitter because it's easy to get.
11:02
But if you have any kinds of data, text data, you can make it as an input. And after correcting the document, next step is morphological analysis. It aims to divide the sentences into words and words with a class of words or a stem of words.
11:27
So it is a morphological analysis. And after dividing the words with a class or a stem, the second step is vector space modeling,
11:40
so calculating the frequency of word or importance of word and its partiality. And in this phase, the text data is converted into numeric features, but it's converted into the sparse matrix.
12:04
And after that, each numeric document features compare each other and the similarity is calculated. And the third step is dimensionality reduction.
12:23
Output of this second phase, vector space modeling, the output has over 1,000 dimensions, and sometimes it's over 10,000 dimensions. And we cannot understand, we cannot see such kind of high dimension.
12:41
That is why we have to reduce this high dimension into two or three. We can understand only two dimension or three dimensions. So sometimes in this phase, some machine learning process is used as well.
13:02
And if it is reduced into two dimension, XY coordinate, you can visualize this on your GIS software like QGIS. So that is an example of document mapping. And if it is reduced into three dimension XYZ coordinate, it's also visualized on the GIS software,
13:26
but for this time, I converted XYZ coordinate into RGB values with normalization and relating them to its geographic features.
13:41
This is a conceptual corpus map I introduced in the beginning. And the last one is visualized on the map. And so for visualizing, I used QGIS and ReflecJS. And talking about more use cases, conceptual corpus map, the first one,
14:08
it's useful for visualizing the result of unstructured interview or free home questions. So when I was a city planner, I conducted many kinds of residence participatory meeting.
14:27
It was kind of a trend of Japanese city planner at that time. So each city has a record minutes of meeting of residence participatory meeting. And so if these minutes of meetings are collected together and visualized on the map as a conceptual corpus map,
14:49
you can find out, so this city and this city have the similar problem or similar discussions as a result of the meetings.
15:08
So this is not, I've never tried this, but I just, so it can be used for this kind of purpose.
15:21
And the second one is illustration of political class by region. Or you can find out the unexpected similarity of the regions. It is the use cases for conceptual corpus map and document mapping. It can be utilized for trend analysis. So document is plotted as a point, so you can detect the centroid of the point.
15:47
And if it is visualized in the chronological way, so for example, so last year's center of topics, centroid of topics at this point. And this year's is here, so it moved from here to this way.
16:07
So next year, the centroid of the point, centroid of the topic will be going this way. That kind of predictive analysis can be done by this method. And it's somewhat like, it seems like fantasy, but I actually doing that by using the patent data and academic paper.
16:34
And then we can realize, so the next year, this kind of technologies will rise up or something.
16:44
So this kind of trend analysis can be done by document mapping. And flight space detection as well, for example, if I visualize many kinds of like academic papers,
17:01
and so it is categorized and visualized on the heat map, and this place is really well developed or researched. But the space between these areas are not researched well, so this kind of flight space detection is also can be done by this document mapping.
17:25
These are the use cases. But actually, there are many challenges as well. In the first place, the accuracy is not enough, actually. So I need to choose the best algorithm for the purpose, and I need to choose the best parameter for machine learning process.
17:46
So it's still on the development phase and trial phase, actually. And another one is I want to add some additional data, like the result of emotional analysis.
18:03
But I couldn't realize it by using open source software, so it's still on the development phase. And the third one is the development of QGIS programming for NLP-based analysis.
18:22
This kind of analysis and creation of document mapping or conceptual mapping, this kind of things can be done through QGIS programming, but by just one click, it's very useful.
18:41
It can be very useful, I think. So actually, there are many things to be done, and I understand there's much room for improvement, but I think there are many possibilities for NLP for Phosphology.
19:01
So if you come up with any ideas, good ideas to utilize this kind of natural language processing method for Phosphology, please let me know. That's all from me. Thank you so much.
19:28
Which natural language were you using, and are there differences between languages on how well this method can be applied?
19:44
I think it's very useful for at least the patent data or academic papers, because it's very structured and the contents are really rich. And Twitter data is a little bit short for processing. And the CT planning I introduced for today is not so good, because it's often published in the PDF file,
20:10
and I convert it from PDF file to text, but the PDF file is unstructured, so it's very difficult to implement accurate processing.
20:24
What I really meant was were you using Japanese or English? Okay, okay. Both possible. And I used this kind of open source or free software, and for English, I used NLTK, Natural Language Toolkit.
20:44
In Japanese, there are many software or programs for metaphorical analysis. So, for example, makeup I used, and some language is very difficult, because, for example, in Japanese, there is no space between words and words.
21:06
So that is why for metaphorical analysis, we have to use the good internal dictionary to devise the words. But it's possible. You can do it by open source software or program by using this.
21:25
So, language doesn't matter, I think. Sorry, again.
21:51
If there is infrastructure, how many virtual machines do you need to process? If the size of the document or the total size of the file becomes massive, I think it's impossible to process on my laptop.
22:14
So, you have to use some kind of, yes, especially for the larger size of the document,
22:23
you have to use the good machine for deep learning or specialized for deep learning or something. So, you mentioned that you were searching for better hyperparameters
22:42
for the machine learning algorithms and also different machine learning algorithms. How do you evaluate which works better than the other? For evaluation of the accuracy, it's a difficult point, actually, and I cannot evaluate.
23:02
So, after the processing, the first place I have to do is checking and comparing with my impression.
23:22
I'm sorry for my poor English. For example, I processed this kind of maps using city planning, and actually, this is my hometown, and this is the other town.
23:44
And I think in my impression, these cities must be similar. However, it's not similar, totally different. So, somewhat different, but it has to be similar to the certain extent. So, I thought the accuracy is not perfect enough.
24:03
So, I think after processing, you have to compare with your first impression answer with outcome. And if it is 6% correct, it is, I think, good result.
24:20
And another 20% is new findings behind your impression, like that. I'm sorry, it's not a good question, not a good answer, but follow up with me.
24:44
You can contact me. Okay, how can I? So, yeah. So, do you have any? Yeah, yeah, yeah. Yes, of course. Okay, let me write down, okay?
25:05
Are there any other questions? Any other questions? Peter, thank you very much. Thank you very much.
Recommendations
Series of 11 media
Series of 3 media