Empowering social scientists with web mining tools
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46919 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020136 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
Complex (psychology)Data miningTask (computing)Data miningOpen setWeb 2.0InternetworkingTask (computing)XML
00:32
BitWeb 2.0Data miningHypermediaMultilaterationComputer animation
00:51
Data miningWeb 2.0Data miningPoint (geometry)View (database)ECosWeb pageWebsiteInstance (computer science)Computer animation
01:20
Convex hullWeb pageWebsiteProcess (computing)Instance (computer science)Scripting languageSource code
01:43
Host Identity ProtocolMenu (computing)Electronic visual displayLink (knot theory)Data miningShared memorySoftwareComputer programmingContent (media)Web pageWeb crawler2 (number)Different (Kate Ryan album)BitWeb 2.0Source codeComputer animation
02:13
Web crawlerInformationData miningRobotWeb pageFacebookWeb crawlerComputer programmingInstance (computer science)Web 2.0TwitterSoftwareContent (media)Shared memoryPhysicalismNegative numberPoisson-KlammerSource codeProgram flowchart
03:00
Point (geometry)PhysicalismView (database)Poisson-KlammerInstance (computer science)Web 2.0Source codeComputer animation
03:26
Source codeInternetworkingParadoxSource codeInternetworkingData miningState observerWeb 2.0Instance (computer science)GoogolService (economics)ParadoxTwitter
04:12
InternetworkingState transition systemSource codeGoogolMetaanalyseData miningWeb 2.0Instance (computer science)Field (computer science)Observational studyTwitterSource codeComputer virusXML
04:48
Cross-site scriptingData miningWeb 2.0Level (video gaming)Extreme programmingInstance (computer science)Direct numerical simulationXML
05:15
Web 2.0Instance (computer science)Scaling (geometry)Execution unitXML
06:10
Computing platformBitWeb 2.0Computing platformInstance (computer science)Different (Kate Ryan album)XML
07:00
Web 2.0Instance (computer science)Point (geometry)Multiplication signData miningProcess (computing)Rational numberXML
07:43
NP-hardWeb browserParallel computingGoogolWeb browserData miningWeb 2.0InternetworkingWeb pageUniverse (mathematics)Computer programmingCoroutineNeuroinformatikInstance (computer science)FamilyWebsiteProcess (computing)XML
09:02
Extreme programmingData managementSubject indexingData storage deviceScalabilityComputer programmingWeb 2.0XML
09:23
HypermediaSource codeXML
09:42
Electronic program guideTask (computing)Data miningPower (physics)
10:09
Demo (music)Web browserTask (computing)Data miningWeb 2.0Web pageServer (computing)Multiplication signDemo (music)Table (information)Client (computing)Web browserElectronic mailing listWindowInstance (computer science)BitAbstractionXMLComputer animation
11:12
QuantumContext awarenessWeb pageTable (information)Electronic mailing listCodeArtificial lifeWeb 2.0Online helpVideo gameSource codeComputer animation
12:49
Inclusion mapOnline helpInformationFunction (mathematics)Message passingCategory of beingComputer fontRankingBookmark (World Wide Web)Interface (computing)Bookmark (World Wide Web)Point (geometry)Computer fileCodeWeb browserWebsiteWeb pageFamilyComputer animationSource codeXML
13:24
Electronic mailing listWebsiteSource codeComputer animation
13:46
Bookmark (World Wide Web)Scale (map)BitLine (geometry)Data miningGroup actionXMLUML
14:13
Web pageContent (media)MultiplicationDatabase normalizationMatching (graph theory)HeuristicUniform resource locatorWeb crawler8 (number)Data miningWeb 2.0Content (media)Multiplication signWeb pageMultiplicationInstance (computer science)Computer animation
14:50
Uniform resource locatorWeb 2.0Instance (computer science)Web page19 (number)Row (database)Parallel portXML
15:07
Demo (music)Content (media)Web 2.0Web pageData miningLine (geometry)WordService (economics)NeuroinformatikLocal ring
15:44
Server (computing)Game controllerNeuroinformatikService (economics)Complex (psychology)InternetworkingComputer programmingWeb pageLoop (music)
16:21
MultiplicationLetterpress printingExecution unitInternetworkingLoop (music)Web pageInstance (computer science)Bit
16:41
Web crawlerSubsetInterface (computing)BitWeb 2.0
17:17
Web crawlerInterface (computing)outputWeb pageSubsetKeyboard shortcutInterface (computing)Instance (computer science)Web 2.0Web crawlerComputer animationSource code
17:43
Interface (computing)Multiplication signWeb 2.0Computer programmingWeb crawlerComputer animationSource codeProgram flowchart
18:02
Normed vector spaceMultiplication signInterface (computing)Web crawlerWeb 2.0Instance (computer science)Subject indexingComputer animation
18:22
ScalabilityUsabilitySubject indexingDatabaseProcess (computing)Graph (mathematics)Point (geometry)Multiplication signData miningIntelligent NetworkComputer programmingLine (geometry)Graphical user interfaceWeb 2.0Computer animationXML
19:36
InformationInstance (computer science)HypermediaComputer configurationReading (process)Sanitary sewerProxy serverFacebookVideoconferencingMachine visionProjective planeMultiplicationReal numberBlock (periodic table)TwitterCartesian coordinate systemData miningRoboticsShared memoryRight angleGoodness of fitUniform resource locatorAngleWeb 2.0Contrast (vision)YouTubeRobotPolygon meshRevision controlMobile app
23:44
Open sourcePoint cloudFacebook
Transcript: English(auto-generated)
00:06
Okay, hello everyone. So I'm really glad today to be in this new room which is all about open science and tools and technologies. And so I'm here to speak about empowering social scientists with web mining tools. So we will see together what is web mining and how we can teach researchers
00:24
how to do so and what tools we developed to help them achieve amazing tasks. So, hello everyone. I am Guillaume Plique, aka Yomgi the Real on the internet. It's a youth mistake and so I am a research engineer for a research laboratory in
00:44
France which is called Science Po Medialab, but we will talk about that a bit more later. So, what is web mining? So who here knows about web mining? Okay, that's nice. So then I will skip. So what is web mining? Just a reminder for everyone.
01:03
So I will only talk about web mining as a tool to be able to like collect data from the web and afterwards how we are going to analyze this data and like produce insights from this data. So basically on a technical point of view, web mining is actually two or three things. The first thing being scraping. So what is scraping? Scraping is the act of retro-engineering the
01:25
HTML of a web page to be able to extract back the data that produces the HTML page. So for instance, here you have an example which is a page from the EchoJS website which is actually accurate news for JavaScript basically. And so scraping would be to like open your inspector, check how the HTML has been written to
01:46
display this visual page and try to like extract from the HTML the data that we are interested in. So for instance, here it would be the title of the article shared and the link to the article shared and so on.
02:00
So this is the first thing. Scraping. So extracting data from web pages using retro-engineering and so on. So scraping. The second thing web mining is is actually crawling. So crawling is a bit different. Here we are going to design a bot or a spider or a program which is going to browse the web automatically and that will like slowly compose a network of pages, of sites, etc.
02:26
And we are interested in two things. What is the actual content on those pages and what is the network that is drawn by like this whole navigation on the web. So scraping, crawling and the third thing is actually like collecting data from APIs.
02:44
So nowadays for instance Facebook or Twitter or LinkedIn share some data with you and so we can use and leverage their APIs to be able to collect data and then gain some insights. So this is it. Web mining here for the purpose of the talk will be scraping, crawling APIs.
03:05
So the question is why is this useful to social sciences? I'm putting social into brackets because basically it could be useful to any science I guess. So physics or chemistry or so on. But since I am working for social scientists I will speak from the point of view of social sciences.
03:22
So why is this useful to collect data on the web? So the bad take on this is actually okay. Every social sciences data collection is biased. If you do like for instance questionnaires or if you do interviews you have biased data which is mostly due to what we call the observers paradox. So when you ask people something they will like be biased because you are asking them the thing and you are in the room observing them and so on.
03:48
The thing which is really interesting with the internet is that people express themselves without being asked to. So they are like just going to express their opinion but nobody is observing. I know lol because I'm observing right now.
04:01
And so it's less biased. So web mining would be a superior source of data for social sciences because it's not biased. So this is the bad take. The good take on this is that internet data comes with its own biases. For instance if you collect data on Google Trends of course you have like other biases that you will find and you should be aware of those biases.
04:21
And so to be able to control and manage those new biases you have to apply meta studies and science and technology studies which is a large field of social sciences which study those issues. And so the conclusion, the good take on this is that web mining is still another very very interesting and very large data source.
04:41
So why not collect it? We should collect it because it's just a good thing and we can. So the issue here is that web mining is hard. To be able to perform web mining tasks, to be able to scrape, to crawl, you need to know the web. And when I say the web I say the whole web. So you need to know how
05:01
DNS works, HTTP works, HTML works, CSS works, JS, the DOM, Ajax, SSR, CSR, XPath and so on. You've got a lot of things to know and learn about the web to be able to like retro engineer it. So how do you teach researchers and for instance social scientists those web technologies?
05:23
So basically the same as everyone else. So you could like teach them CSS and HTML and so on and try to like empower them through this teaching. But what most consider as an easy layer of technologies, I don't know here but there is like a misconception in technologies that says that the web is actually really easy, it's really not.
05:43
And we are really standing on the shoulders of giants. Does someone here has already tried to teach someone who is like new to the web technologies how the web actually works? Did someone does this job? Okay. Usually when you do that you notice that you are standing on a huge mountain of scales which is actually really daunting.
06:05
So it's not really easy to teach people about web technologies. So another question here is how to teach researchers how to scrape for instance. So they know about web technologies, they know a bit about JavaScript and Python. So how can we empower them and teach them how to scrape?
06:23
And then you also have other issues which are a bit different which as for instance you are fighting the platforms and their APIs. Platforms will try to like prevent you from applying scraping and trolling. You've got some legal issues in some countries. In some countries, for example Denmark, like teachers avoid teaching people scraping because it's considered like lock picking for instance.
06:43
It's considered a bit illegal or gray. And you have to wiggle when you publish something using scraping because sometimes you have to say oh no I did not scrape, I had a monkey army clicking on the button really fast. So you have a lot of hoops to jump through.
07:00
And what's more and this is something I really want to stress today is that Jupiterizing researchers is not a solution. Sometimes we say okay we are going to empower researchers, we are going to teach them everything they need to know. They are going to learn Python, Jupyter, web technologies and they are going to scrape by themselves. This is a really good solution but it's not really applicable to the real world.
07:22
So for instance and what's more in social sciences, some researchers don't have the time nor the will to learn all those skills. And we should be okay as a community, we should be okay with that. It's okay, researchers don't have to learn the skills and the question then is how are we going to empower them all the same.
07:43
And what's more and this is the second point again against the Jupiterization of researchers is that web mining is actually really really really hard. It's really a craftsmanship. Basically web mining is a job, it's not a skill. So internet for instance is a dirty dirty dirty place. So you've got conventions basically.
08:05
So you are supposed to code a website like correctly, cleanly but basically everything is really badly implemented. And so browsers today are really like heuristical wonders. They have a lot of routines and programs to make sure that the web page that you send which is really messy will be read by the browser correctly.
08:24
So you have to know all of those things when you want to do web mining. What's more you need to know about things which are considered advanced in informatics. Which is for instance how to multi-thread a program, how to parallelize things, how to throttle your HTTP request.
08:45
And if you don't know how to do that you will harm actual people. For instance at the beginning of our journey we did not know how to throttle HTTP request. So we basically cut all our university's access to Google. Which is a bit problematic. Not too much.
09:02
And you need to know all about those kind of stuff which are really complicated if you want to be able to actually perform web mining. You need to know a lot of skills. So what I mean here is that it really is a craftsmanship, it really is a job. And you can't expect people to be researchers and web miners.
09:23
So the question then is how are we going to empower researchers all the same. And the answer here is by designing tools suited to their research question. So we need to have designers. Who is a designer in this room? Him. Yeah. So we need more designers.
09:42
And so how did we do that? So I worked for a laboratory which is called Sciences Po Media Lab. And the seminal idea of the lab was to gather three kinds of people. So social science researchers, designers such as this guy, and engineers such as me. And so we are going to mix those people and we are going to design tools which are suited to the researcher's questions and work.
10:07
This is basically it. So what I propose here is to guide you through some of the tools we designed to be able to empower, really empower social scientists to perform web mining tasks.
10:25
And so the first one we did was called r2.js. So it's a bad pun on R2D2. And the idea here was beginning from the following thing. If you know about like modern web technologies, you will fastly encounter something which is
10:42
called like dynamic rendering, which means that the page is not rendered on the server. It's rendered on the client using JavaScript and so on. It's really complicated. And if you want to emulate them, emulate this to be able to scrape, it's kind of difficult. So the idea was to actually parasite the web browser to perform some web mining tasks.
11:02
I know it's a bit abstract, so I'm going to try a small demo time so everything will break now. This is how it works. So for instance, let's say you have a researcher who wants to scrape this web page, get the whole list as a CSV table. So you're going to go to the page and then you are going to inject some
11:22
parasite code to help you like scrape the data and provide the researcher with the data. So I use a bookmarklet, which is called R2, which is loaded directly into the web page context. And R2 is going to help me to do some stuff.
11:41
First, it can do some sound, which is its most interesting feature. Then I will be able to use something like old school jQuery and using CSS stuff, really basic stuff, I will be able to scrape data.
12:04
So here I'm just attempting to scrape the data from the website, but directly within the web page's JavaScript context. And when I have that, I'm also able to help the researchers by doing this thing.
12:26
Yes, it doesn't work because I didn't know. Sorry, it's a bad live coding situation.
12:46
Yeah, yeah. And so now I have like the data that does not work. So basically I've scraped the thing as a CSV file and I'm now able to provide it to the researchers.
13:03
The main point here is that it's still code. But the fact is you can like use this same code to generate bookmarklets, custom bookmarklets for the researchers. So it means that I will go to this kind of interface. I will paste my code here and then I will create something which is actually a bookmark for the researchers. You just have to copy it on his web browser and then it will only go to this page, click on the button and it will download the CSV for him.
13:27
And so for this kind of scenario, we have researchers that like do some really qualitative search on the website and just want to like pick some list and aggregate them. And so we use this tool, r2.js, to provide them with this kind of like haddock and tailored bookmarklets.
13:45
So this is the first thing, r2.js. The question here is that can we, which is a bit more hefty.
14:06
So we created something which is called now Minet. And what is the goal of Minet? So the goal of Minet is actually to provide you with some common line tools which is going to like handle all the pesky details of web mining for you. So basically all the things which are in bold are the thing on which you are going to focus and work, actually work.
14:26
All the thing which is around is the thing I handle for you so you don't have to. So multi-threading, multi-processing and so on, I will do that for you. You just focus on the fact which is I want to get, for instance, one million page from the web and then extract the actual content and data for it.
14:44
Do we have time for a demo? No. Maybe so. So basically Minet is a CLI tool. It's a Unix tool which like comply to the Unix philosophy. So it does one thing, only one thing well and this thing is web mining, which is a big hefty thing but it does it well.
15:04
And you can like pipe commands and so on and it looks like that, for instance. So basically you are going to use the common line tool to be able to, for instance, fetch a lot of pages from the web then scrape massively and in a parallelized session from a lot of web pages.
15:22
You could also extract raw content from articles so you can do NLP stuff on it afterwards and so on. So it's a really, it's a Swiss Alumni for web mining which is scalable and which is lo-fi. It doesn't run on any database, it works on CSV file, it pipes to the stdout and on the pipes and it's really like, yeah, lo-fi.
15:44
And what's more, the thing which is really important for us and which is like true for R2 and Minet is that it relocalizes data collection on the researcher's computers themselves. Sometimes you need a service but sometimes you don't. And it's important for the researchers to be able to like control his or her data by
16:02
being able to do this stuff on his or her computer so they are like really in control. And basically in social sciences we rarely do what we call like big data, TEM, we don't do this stuff. Everything stands and is able to fit in a single computer.
16:21
And what's more, if you really want to do some Jupyter stuff, and that's alright, you have a programmatic API which hides all the complexity for you. So if you want to do something like, okay, I want to fetch one million pages from the internet and I want to do it right, you can just do a simple for loop and you will have all this stuff handled for you.
16:41
And so then we have seen like how to enable researchers to scrape, to collect data from APIs. And so the question is, what's the next step? Can we design something which is a bit more ambitious, something which is more like a GUI? And we actually can. So for instance, in the lab we are developing a tool which is called Hive, which is a web
17:02
crawler which has a dedicated interface which enables researchers to like crawl the web, crawl a subset of the web, and be able to make sense of it without having any kind of technical knowledge. So how do we enable researchers to crawl the web? Using this tool, so Hive. It looks like that for instance, so you have an interface, everything is like you push button and you input
17:25
things using the keyboard, you don't have to know how to code, you don't have to know how to program, and you are still able to like crawl millions of web pages and be able to like construct, build a corpus which is a subset of the web on which you are actually working on.
17:43
And so finally we use this design, we use designers, actual designers, to be able to serve a robust methodology which has been proven to work on many sociological works. And we designed the tool to be able to represent this kind of methodology.
18:02
And what's more, I just want to emphasize last time that it's really non-trivial. Some people have already tried to like build a crawler and crawl the web using a spider here, to build and program the spider. Is this really easy? No. So you have to make some things, and like for instance on this particular interface we
18:23
had to like build our own indexing database to be able to like index multilayered graphs. And we have a talk which was given here like two years ago, if you want to check. And it was, it was, yeah, it's called the TRAF basically. That's the job. Yeah, so as a conclusion, my main point here is that researchers should not
18:46
be expected to like learn all the ins and outs of web mining and programming. And so there is always a tradeoff when you design tools for them suited to their needs between scalability, so how much data they can handle and fetch, and usability, so how easy is it to use the tool and to do this kind of stuff.
19:05
So we need to be able to design a user path. And to do so, we need to be able to take a step back on what we are doing and take the time to like abstract our design path. And I hope that's what we are doing right now. So what's the future? We would like to do a GUI for a Minet so the researchers are able to like use it without needing to learn the Unix command line.
19:29
And so if anyone is up for it, we need people. We are recruiting also. And thank you for listening.
19:48
Questions? Questions. Yes. With what?
20:00
The robot that takes me? I will say to you officially that I do, but I do not. No, basically for us it's not an ethical question, it's more of a technical question because it's really heavy for us to like fetch the thing. And we don't know how to do that very well, so we don't. But we could. We could do it.
20:22
Do you use it? Yes. We use creepy on the hive for instance. So they, but I think the version we use does not respect the robots that they stick. Yes.
20:53
We try to use those tools with researchers a lot. So those are basically tools that scrape automatically by learning what you are trying to extract.
21:02
It did not work very well. We also try to design our own tools but failed miserably basically. And it's something we are still interested in, but we haven't found the thing which really works yet with our researchers.
21:26
Are the researchers who are attacking with these tools? Yeah, sure. For instance, right now we are working... Repeat the question. What? Repeat the question please. Yeah, so the question is, is there an example of like actual research project or question we are actually using web mining for?
21:43
And so currently we are working on a project which aims at studying how people in France share and read medias. And so we are using web mining in multiple of fashions. For instance, we collect all the articles, texts from 400 medias. We collect all the tweets mentioning those URLs.
22:03
We collect all the Facebook posts mentioning those URLs and many more. So YouTube videos from those media outlets and so on. So we collect a really large amount of data to be able to see whether like medias polarize around political questions or not, etc.
22:20
Yes. To afford what? Okay. So for Facebook there is a trick which is actually interesting which is that
22:44
they still need to serve a mobile application which is heavily used in India. And in India, like the contrast trick it did because else you will like block users. So you can like hit on it like a madman and it still works. It's not really a good... It's a good option but the fact is they won't serve you like the actual data.
23:03
Sometimes you have like relative dates and you need to parse those relative dates to be able to find things. But it's usually a good solution to be able to scrape massively Facebook for instance. So we found workarounds. And if we don't find workarounds we use like proxy meshes which are able to like hit from multiple angles fastly.
23:27
Not yet. It will soon because we need to. But if you want you can help us. Any other question? Thanks.
Recommendations
Series of 2 media