We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Geosocial Big Data Analysis Using Python and FOSS4G with the Case Study of Korean Data

00:00

Formale Metadaten

Titel
Geosocial Big Data Analysis Using Python and FOSS4G with the Case Study of Korean Data
Serientitel
Anzahl der Teile
183
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent
Produktionsjahr2015
ProduktionsortSeoul, South Korea

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Nowadays, there are many researches on the analysis of Geosocial big data, such as geotweeet and as foursquare venues and OSS(Open Source Software) has an important role on this. In the analyzing geosocial big data, there are several different steps such as data collection, data parsing, data conversion, statistical analysis, visualizing and database management. So, the integrated system architecture and the compatible analysis environment has a key role to acquire the relevant analysis results. The Python programming support the interoperable analysis environment for the various and different software functions and enable to process for geosocial big data in the integrated platforms. FOSS4G support software environment for geovisualization and data management for the collected data. In this study, the way and process of geosocial big data analysis is introduced with case study of geotweet and foursquare venues and the analysis results are presented with the case study of Korean data. For this study, Python API libraries for tweeter(tweepy) and foursquare(pyforsquare) used to collect the geosocial data, and Pandas and Simplejson are used to parse and extract the valid data, and GDAL and PySAL are used to convert and analyze for GIS data. PyTagCloud and WordCloud are used to visualize the qualitative text. MongoDB is used to store the collected dataset and QGIS are applied for the geovisualization.
126
Inhalt <Mathematik>Darstellung <Mathematik>VorhersagbarkeitHypermediaFacebookMultiplikationsoperatorKurvenanpassungKlasse <Mathematik>VisualisierungMobiles InternetInverser LimesQuick-SortBetafunktionInstantiierungBildverstehenRuhmasseAutorisierungTwitter <Softwareplattform>Lesen <Datenverarbeitung>AnalysisCASE <Informatik>DifferenteLaufzeitsystemPhysikalisches SystemCodierungDatenhaltungStichprobenumfangURLMinkowski-MetrikFunktionalanalysisElektronische PublikationObjekt <Kategorie>StatistikCodeFunktion <Mathematik>Vollständiger VerbandMessage-PassingExogene VariableSchlüsselverwaltungText MiningMAPBeobachtungsstudieSoftwareentwicklungEreignishorizontDomain <Netzwerk>RechenwerkFreewareService providerPhysikalische TheorieProgrammbibliothekModulare ProgrammierungDatentypBenutzerbeteiligungDateiformatFormale SpracheDistributionenraumMusterspracheNeuroinformatikVisuelles SystemFlickrVariableMehrrechnersystemGenerizitätDatenanalyseDatenverwaltungShape <Informatik>Deskriptive StatistikSpeicherabzugMehrplatzsystemKartesische KoordinatenBitXMLComputeranimation
HypermediaMultiplikationsoperatorIndexberechnungEreignishorizontMaßerweiterungRuhmasseMAPGruppenoperationZahlenbereichSummierbarkeitDienst <Informatik>Physikalisches SystemAnalysisStatistikChi-Quadrat-VerteilungDatenhaltungInformationsqualitätCASE <Informatik>Text MiningTeilmengeQuellcodeProgrammbibliothekMereologieClientQuantisierung <Physik>Metropolitan area networkDatenflussAlgorithmusElektronische PublikationRechter WinkelSchnittmengeNatürliche ZahlMathematische LogikBildschirmfensterForcingGamecontrollerElement <Gruppentheorie>Notepad-ComputerOrdnung <Mathematik>BildverstehenObjekt <Kategorie>Prozess <Informatik>Ein-AusgabePunktPhysikalische TheorieNichtunterscheidbarkeitKategorie <Mathematik>FunktionalanalysisLateinisches QuadratCoxeter-GruppeRechenwerkDissipationMagnetbandlaufwerkPhysikalisch-Technische BundesanstaltDistributionenraumZusammenhängender GraphNabel <Mathematik>GradientenverfahrenNatürliche SpracheEndliche ModelltheorieZentralisatorSymmetrische MatrixMapping <Computergraphik>AggregatzustandTwitter <Softwareplattform>Projektive EbeneDifferenteFlächeninhaltServerZweiComputerarchitekturVisualisierungWort <Informatik>URLBeobachtungsstudieCluster-AnalyseBestimmtheitsmaßInformationsmanagementMusterspracheDatenanalysePunktwolkeDialektComputeranimation
StatistikSkriptspracheHypermediaGenerator <Informatik>Twitter <Softwareplattform>Framework <Informatik>DatenhaltungMereologieZweiSoftwareentwicklungWeb SiteÄußere Algebra eines ModulsCASE <Informatik>EreignishorizontWort <Informatik>BitOrtsoperatorProgrammbibliothekIterationQuick-SortMAPInhalt <Mathematik>AnalysisPaarvergleichURLTotal <Mathematik>DialektProzess <Informatik>SmartphoneComputerarchitekturDifferenteBenutzerprofilSoftwaretestMultiplikationsoperatorZahlenbereichVariableMapping <Computergraphik>Quantisierung <Physik>Gewicht <Ausgleichsrechnung>ResultanteFreewareEntscheidungstheorieDigitaltechnikSprachsyntheseTopologieTypentheorieBeobachtungsstudieUnendlichkeitFamilie <Mathematik>Gesetz <Physik>ModallogikRobotikEin-AusgabeSoftwareentwicklerEinfügungsdämpfungQuellcodeObjekt <Kategorie>EinhüllendeElektronische PublikationEndliche ModelltheoriePhasenumwandlungRechnernetzSchlussregelWinkelVarianzLateinisches QuadratRechter WinkelMetropolitan area networkNegative ZahlLineare AbbildungEinsDienst <Informatik>Grundsätze ordnungsmäßiger DatenverarbeitungSchlüsselverwaltungGüte der AnpassungComputeranimation
XMLComputeranimation
Transkript: Englisch(automatisch erzeugt)
This is Young Hong, I'm from Namsul University. Today's my topic is geosocial big data analysis using Python and Postgres with case study of Korean data. As you know, nowadays the Twitter and Facebook
is very popular and is one of the killer application for the smartphone, so everybody, everyone using this kind of application. And smartphone, because it has its best functionality, so the data has, nowadays, with coordinate. So we called social data that has a geotech
is called social data, and geotweet from the Twitter, and venues from the Postgres is the most famous one. So until now, there is many researches for the geosocial data.
First one is Fujita Hideuki. His topic is geotech tweet collection and visualization system. And as he is a computation guy, so he more focused on the computation method and data collection method and some visualization.
And second one is some different approach. His topic is called cloud, qualitative geovisualization of tweet, geotweet. In this case, he using some qualitative approach with the content analysis. So he is more focused on the content.
What is the people talking about using the tweet? So in this case, Fujita, he used almost three million geotweet for the analysis, but in this case, he used just 14,000 tweet for the analysis.
So the other one is special temporal and socioeconomic patterns in the use of tweet and Flickr. In this case, they are using the statistical analysis method. So they are trying to figure,
find out some geospatial relationship with the tweeted content and some geodemographic variables. And last one is geography of happiness. So they're using the sentimental analysis.
So in this case, they're using computational linguistic approaches. So there are so many different researchers using their own approach for the geotweet analysis. So we can say this kind of a multidisciplinary,
the geotweet analysis has a multidisciplinary aspect of the aspect. Actually, social data itself is, we can think about this kind of belong to the sociology, some arts, art and science, journalism and media. However, our data is located stored
on the Facebook or Twitter. So to approach this data, we need to know the some introducing the web programming. And after the collecting the data, we need to know the data management issue we have. And through analyzing this data, they have some different approach
like qualitative analysis for the linguist and quantitative analysis like a special statistics analysis. And last issue is like data visualization, how to map these data, how to present. So the challenges of this research is like
there are too many different data type format like tweet, post-care, Facebook. And each research has their own analysis environment, method, like they're using different language system
and different software packaging and database and statistical analysis method and geo-visualization method. And otherwise, they are so very different domain analysis as I mentioned is belong to sociology, geography, statistics, linguistics.
So my question is like, this study the interdisciplinary cooperation. So are there any way to integrate this method? So I found, I tried to found the solution from the Post-Suppose and Python. Because Python, this provides some integrated
earth environment and software library. As you know, Python is free and open and open programming. And they provide various kind of scientific package like WinFyson, Anaconda, a ton of people, Python. And Python, especially, Python provide
a lot of different kind of library. So until, if you go to the 5.py, you can find, there's currently 66,000, over the 66,000 package in there. So, and last one is that Python is very simple coding environment. So easy to learn, easy to coding.
And very clear and readable. So my research purpose is introduce some integrate platform to analyze this data using Python Post-Suppose. That includes data collection, management, data analysis, qualitative and quantitative method
and sentiment analysis and geo-visualization. And I try to present the case to study with the Korean geo-social data, like geo-social duty distribution and the special pattern of post-cap values. And I'm going to try to present some sentiment analysis
for Korean geo-tweet. And actually, in the beginning, actually I'm doing this around two years until now. So the beginning, this was very generic pattern, like there's social metadata, and so we're using API.
We get the file as JSON format. And convert to the Excel file, we can create chart or some descriptive statistics. And also we can, if we got the coordinate, we can convert to the shape file and using the just up there, we can make a map.
Data collection, there's the I-user streaming API, using the Twipy library. But usually it has a limitation for the users. So there's a limitation, like it's only 550 cores per hour for one user.
So for this reason, I add a couple of multiple user. When one user ID reached some limit, I switch the user and collect more data. And there's many, under 30 data coming,
because as you know, the geo-tweet is just 1% of the total collected tweet. Nowadays it's a little bit increased, like five, 7%. So data comes from tweet, we have tweet text.
It can be used for the qualitative analysis, like text mining or sentiment analysis. And using tweet ID, we can analyze some like behavior feature of the tweet users, like heavy user things. And it has also location coordinate,
so we can make some map or geo-visualization and spatial analysis. And it also has a date and time, and we can also apply some temporal analysis. Actually, until now I made two research. One is the spatial analysis of location-based
social network in Seoul, and second one is spatial distribution of Korean geo-tweet. And first one is a study of elevation data, location-based social network, using GIS. I collect the post-kill venues in Seoul,
using the Python post-kill API. I create some heat map and cluster analysis, like a hotspot, and applying some geo-GW model method, for two days later. So this map shows the heat map of the distribution
of the post-kill venue in Seoul. As you can see, the hot place is Gangnam area. And here's Itaewon, and Hongdae place. And I also analyze the category, because the venue has 10 category.
So number one category is food and professional services, and second one is shopping services. So this category is almost 80, 90%. Most of the venues belong to this category.
And this one is hotspot analysis with post-kill venue. I use the number of venue for the, eight if it shows number of using the venue. And second, third one is using the category percentage of category. I also apply some hotspot analysis.
And this one shows the GW model. I use the working population, and the land price is a variable for the model. And it shows the central region district, like Gangnam is showing some high score of R square.
And next one is a study of geo-tweet. So I analyze the Korean geo-tweet distribution, I using the Itaewon API. And it shows some distribution of geo-tweet in Korea, and some temporal feature of the geo-tweet.
And lastly, I also apply some text analysis. So this one shows the distribution of the geo-tweet in Korea. Actually, I collect all about, I cannot remember the exact number of over a billion tweet.
Just, I think, around 10% was the geo-tweet. Anyway, this map shows the distribution of the geo-tweet in Korea. Usually Seoul and central region is most of them located. This one, it shows us monthly. I collect this one, it's November 2014.
So this first day and last day, can see some up and down. Actually, this is Sunday, Saturday, Saturday, Sunday. So weekday is increased, and down, and up and down,
there's this kind of a pattern. And this one is a daily pattern. So it's going up, and 2 p.m. is the daily pattern. This is the peak time, and coming down, and coming up. This is the daily pattern of the tweet.
And I also applied some text analysis. This is the highest region of the geo-tweet region. I collected the tweet text and made some word cloud. So what was the keyword of each region? So one thing that I found is the high percentage
of the retweet, because this keyword is called from the retweet. And the keywords represent some locality, like the name of each regions. And I use this kind of library. And also interesting is that this is Pyeongtaekse.
They are located in the U.S. Army there, so most of it is English. And one famous word is, I've been using it for years. Shit. Anyway, so problem. Until now, I am using some
extraordinary statistics analysis. But it is very repeated work. I mean, every time I have to select a subset from my data set, and I am using different variable like this one and this one, and making them map this time, time, there are so many,
each case made different case in there. So it was very, takes time manually. So another one is, as time goes by, data comes big and big. Actually, I managed this as a text file. But at this time, I cannot manage. I have six nodes of collected data set in this,
almost one gigabyte, over the one gigabyte text file. I cannot handle it. So another one is, we need, actually, I have using two different system. One is the server, Linux server for the data collection. And all I'm using, I use the Windows desktop.
There are some profitability issues in there. So I try to, using the Python post project to interact with this analysis environment, and handling the large amount of data, and now I'm going to try to make some automated process
for this analysis. So this is a new architecture that I'm working. So here is social media server. And I use the Specialite for the database as a data server. And this is Python library.
There's two clients, one is visualization client and analysis client. Visualization, now I'm trying to using the quantum GIS, and some word cloud, text cloud, things. And analysis cloud, there's a natural language processing
library in there for the sentiment analysis. And some statistical analysis, statistical analysis for the Python. And Pandas for the data analysis. Instead of Excel, I think I can use Pandas. So analysis, this shows analysis process. So there's a social media data.
So if it has a geotech, I convert to the GIS database. And they have qualitative and quantitative data in there. So in case of qualitative data, I'm making some text mining, planning some sentiment analysis. And in case of quantitative aspect, I can apply on a special nice put,
creating heat map or symmetric mapping. And using the special state analysis, you can make some hotspot or teach a program there. So this one shows the Specialite database. The reason why I'm using the Specialite is a standalone and file-based database.
So it is very easy to handle. Actually, in the beginning, I'm trying to using the MongoDB. But, because I found some problem with QGIS, but nowadays the website doesn't work in. So I cannot get the problem. So I find the alternative.
But I think this one is much better, because it's just file-based. In case of me, I don't have to use a bulk user platform, so until now, it is okay. And it shows very high comparability and total probability. And easy to use, they support good GUI.
And for the sentiment analysis, I'm now using to the NRTK things. And this one shows, if we put in the Twitter text, they convert this to the three variable, like a positive number, negative number, and neutral number. So this variable can be used to apply
for the sentiment mapping of the Jyoti. So this is just one case, heat map, using quantum Jyoti. This one, last July, I collect, one month, I collect Jyoti with the sentiment value. This one shows high positive value,
and this one shows low positive value. With the weighted value, I using the weighted value as positive number, this one shows the heat map. So this is the most heated, positive region of the result of a sentiment analysis. So here I find certain cluster,
Jongno, Hyundai, and Yongsan. And I extract the tweet text from each region, and made like this word cloud, what was contained in there. So Jongno is a little bit different, like hiring, job, because of central business district.
And Hyundai is a very popular place for the young generation, so there's club, whatever. And each one is a little bit different, like a food cafe, see there. And this one is best positive tweet collected.
Here, as you can see, there's happy, or best, good, what's the word. Actually, the English library until now just support the English, so I cannot convert Korean tweet to as a sensitive value. But using these cases, now I'm trying
to find more Korean tweet for the sentiment analysis. So conclusion, as I mentioned, the analysis of the Jewish social data is very complex and multi-disciplinary process. In this slide, I tried to present some iterative architecture and Python prosperity.
As a future work, I'm not done yet. I have some advanced statistics with Python, and I'm working to work, making some automated Python script. Yeah, thank you.
Thank you. My first question, I have two questions. My first question is, you know, the data from social media is a bit messy,
and most of the data is produced, I mean created by small number of users. And maybe one person can generate many tweets or things like that. And in the framework, in the architectural framework
you displayed earlier, I didn't see data cleaning part for this, and are you considering that is my first question? Yeah, another question? You have two questions, right? Yes, and my second question is,
when for tweets, are you using the geolocation of the cell phones that are used, or the smart device that are used? If so, people may be tweeting about an event in different location than they are.
Yes. So how do you address that? Oh, actually, I did not think about that two kind of things, but I think that is also some feature of the geo-tweet users. I mean, here's some example, actually.
This region is, I mean, each region is belong to the highest two of these regions in Korea. But this region has only two, I think one guys make 70% of tweet, 80% of tweet.
So, but nowadays, many researchers try to research in kind of the heavy users feature of tweet. This is kind of another topic for the research, I think. So, but to making some more good research, I think we have to make more data as fast as possible.
The case is very different. For some reason, it's many users, some reason just only heavy users. And these things never show. And this thing, and the second one is, what was the second, yes, hi. The second one was people might be,
for example, in that region, a person in that region in the other region. Ah, desktop user, or some using smartphone user. So, but there's one column in there, location, so user's profile in the tweet content.
But I know there is the data, but I still test more research, more case of research. Any other question? I have a question for you. This is working. A lot of times with Twitter, there's free text and things like that, and this is sort of outside of geography.
But there's also a lot of hashtags used or characters, emojis, things like that. Were you able to process any of those or map any of those in your sentiment analysis? I'm not done yet. But as I mentioned in the last tweet, here's like this one, yes. Yeah, like the smiley face.
You know, does that map to a sentimentality? So, one qualitative content analyst tried to make some using this deal one. For what they are studying one by one. Not automatically, manually. But as I mentioned, the NLTK, Python,
they're using some library or some, for the, to get the value of the sentiment. If you're using this kind of symbol, I think we have to make our own, you know, more additionally for the, to get the sentiment. Okay. Is there any other questions?