We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Empowering social scientists with web mining tools

00:00

Formal Metadata

Title
Empowering social scientists with web mining tools
Subtitle
Why and how to enable researchers to perform complex web mining tasks
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Web mining, as represented mostly by the scraping & crawling practices, is not a straightforward task and requires a variety of skills related to web technologies. However, web mining can be incredibly useful to social sciences since it enables researchers to tap into a formidable source of information about society. But researchers may not have the possibility to invest copious amount of times into learning web technologies in and out. They usually rely on engineers to collect data from the web. The object of this talk is to explain how Sciences Po's médialab designed & developed tools to empower researchers and enable them to perform web mining tasks to answer their research questions. Here is an example of issues we will tackle during this talk: How a social sciences laboratory life can be a very fruitful context for tool R&D regarding webmining How to create performant & effective webmining tools that anyone can use (multithreading, parallelism, JS execution, complex spiders etc.) How to re-localize data collection: researchers should be able to conduct their own collections without being dependent on external servers or resources How to teach researchers the necessary skills: HTML, the DOM, CSS selection etc. Examples will be taken mainly from the minet CLI tool and the artoo.js bookmarklet. Speaker Guillaume Plique is a research engineer working for SciencesPo's médialab. He assists social sciences researchers daily with their methods and maintain a variety of FOSS tools geared toward the social sciences community and also developers.
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
Complex (psychology)Data miningTask (computing)Data miningOpen setWeb 2.0InternetworkingTask (computing)XML
BitWeb 2.0Data miningHypermediaMultilaterationComputer animation
Data miningWeb 2.0Data miningPoint (geometry)View (database)ECosWeb pageWebsiteInstance (computer science)Computer animation
Convex hullWeb pageWebsiteProcess (computing)Instance (computer science)Scripting languageSource code
Host Identity ProtocolMenu (computing)Electronic visual displayLink (knot theory)Data miningShared memorySoftwareComputer programmingContent (media)Web pageWeb crawler2 (number)Different (Kate Ryan album)BitWeb 2.0Source codeComputer animation
Web crawlerInformationData miningRobotWeb pageFacebookWeb crawlerComputer programmingInstance (computer science)Web 2.0TwitterSoftwareContent (media)Shared memoryPhysicalismNegative numberPoisson-KlammerSource codeProgram flowchart
Point (geometry)PhysicalismView (database)Poisson-KlammerInstance (computer science)Web 2.0Source codeComputer animation
Source codeInternetworkingParadoxSource codeInternetworkingData miningState observerWeb 2.0Instance (computer science)GoogolService (economics)ParadoxTwitter
InternetworkingState transition systemSource codeGoogolMetaanalyseData miningWeb 2.0Instance (computer science)Field (computer science)Observational studyTwitterSource codeComputer virusXML
Cross-site scriptingData miningWeb 2.0Level (video gaming)Extreme programmingInstance (computer science)Direct numerical simulationXML
Web 2.0Instance (computer science)Scaling (geometry)Execution unitXML
Computing platformBitWeb 2.0Computing platformInstance (computer science)Different (Kate Ryan album)XML
Web 2.0Instance (computer science)Point (geometry)Multiplication signData miningProcess (computing)Rational numberXML
NP-hardWeb browserParallel computingGoogolWeb browserData miningWeb 2.0InternetworkingWeb pageUniverse (mathematics)Computer programmingCoroutineNeuroinformatikInstance (computer science)FamilyWebsiteProcess (computing)XML
Extreme programmingData managementSubject indexingData storage deviceScalabilityComputer programmingWeb 2.0XML
HypermediaSource codeXML
Electronic program guideTask (computing)Data miningPower (physics)
Demo (music)Web browserTask (computing)Data miningWeb 2.0Web pageServer (computing)Multiplication signDemo (music)Table (information)Client (computing)Web browserElectronic mailing listWindowInstance (computer science)BitAbstractionXMLComputer animation
QuantumContext awarenessWeb pageTable (information)Electronic mailing listCodeArtificial lifeWeb 2.0Online helpVideo gameSource codeComputer animation
Inclusion mapOnline helpInformationFunction (mathematics)Message passingCategory of beingComputer fontRankingBookmark (World Wide Web)Interface (computing)Bookmark (World Wide Web)Point (geometry)Computer fileCodeWeb browserWebsiteWeb pageFamilyComputer animationSource codeXML
Electronic mailing listWebsiteSource codeComputer animation
Bookmark (World Wide Web)Scale (map)BitLine (geometry)Data miningGroup actionXMLUML
Web pageContent (media)MultiplicationDatabase normalizationMatching (graph theory)HeuristicUniform resource locatorWeb crawler8 (number)Data miningWeb 2.0Content (media)Multiplication signWeb pageMultiplicationInstance (computer science)Computer animation
Uniform resource locatorWeb 2.0Instance (computer science)Web page19 (number)Row (database)Parallel portXML
Demo (music)Content (media)Web 2.0Web pageData miningLine (geometry)WordService (economics)NeuroinformatikLocal ring
Server (computing)Game controllerNeuroinformatikService (economics)Complex (psychology)InternetworkingComputer programmingWeb pageLoop (music)
MultiplicationLetterpress printingExecution unitInternetworkingLoop (music)Web pageInstance (computer science)Bit
Web crawlerSubsetInterface (computing)BitWeb 2.0
Web crawlerInterface (computing)outputWeb pageSubsetKeyboard shortcutInterface (computing)Instance (computer science)Web 2.0Web crawlerComputer animationSource code
Interface (computing)Multiplication signWeb 2.0Computer programmingWeb crawlerComputer animationSource codeProgram flowchart
Normed vector spaceMultiplication signInterface (computing)Web crawlerWeb 2.0Instance (computer science)Subject indexingComputer animation
ScalabilityUsabilitySubject indexingDatabaseProcess (computing)Graph (mathematics)Point (geometry)Multiplication signData miningIntelligent NetworkComputer programmingLine (geometry)Graphical user interfaceWeb 2.0Computer animationXML
InformationInstance (computer science)HypermediaComputer configurationReading (process)Sanitary sewerProxy serverFacebookVideoconferencingMachine visionProjective planeMultiplicationReal numberBlock (periodic table)TwitterCartesian coordinate systemData miningRoboticsShared memoryRight angleGoodness of fitUniform resource locatorAngleWeb 2.0Contrast (vision)YouTubeRobotPolygon meshRevision controlMobile app
Open sourcePoint cloudFacebook
Transcript: English(auto-generated)
Okay, hello everyone. So I'm really glad today to be in this new room which is all about open science and tools and technologies. And so I'm here to speak about empowering social scientists with web mining tools. So we will see together what is web mining and how we can teach researchers
how to do so and what tools we developed to help them achieve amazing tasks. So, hello everyone. I am Guillaume Plique, aka Yomgi the Real on the internet. It's a youth mistake and so I am a research engineer for a research laboratory in
France which is called Science Po Medialab, but we will talk about that a bit more later. So, what is web mining? So who here knows about web mining? Okay, that's nice. So then I will skip. So what is web mining? Just a reminder for everyone.
So I will only talk about web mining as a tool to be able to like collect data from the web and afterwards how we are going to analyze this data and like produce insights from this data. So basically on a technical point of view, web mining is actually two or three things. The first thing being scraping. So what is scraping? Scraping is the act of retro-engineering the
HTML of a web page to be able to extract back the data that produces the HTML page. So for instance, here you have an example which is a page from the EchoJS website which is actually accurate news for JavaScript basically. And so scraping would be to like open your inspector, check how the HTML has been written to
display this visual page and try to like extract from the HTML the data that we are interested in. So for instance, here it would be the title of the article shared and the link to the article shared and so on.
So this is the first thing. Scraping. So extracting data from web pages using retro-engineering and so on. So scraping. The second thing web mining is is actually crawling. So crawling is a bit different. Here we are going to design a bot or a spider or a program which is going to browse the web automatically and that will like slowly compose a network of pages, of sites, etc.
And we are interested in two things. What is the actual content on those pages and what is the network that is drawn by like this whole navigation on the web. So scraping, crawling and the third thing is actually like collecting data from APIs.
So nowadays for instance Facebook or Twitter or LinkedIn share some data with you and so we can use and leverage their APIs to be able to collect data and then gain some insights. So this is it. Web mining here for the purpose of the talk will be scraping, crawling APIs.
So the question is why is this useful to social sciences? I'm putting social into brackets because basically it could be useful to any science I guess. So physics or chemistry or so on. But since I am working for social scientists I will speak from the point of view of social sciences.
So why is this useful to collect data on the web? So the bad take on this is actually okay. Every social sciences data collection is biased. If you do like for instance questionnaires or if you do interviews you have biased data which is mostly due to what we call the observers paradox. So when you ask people something they will like be biased because you are asking them the thing and you are in the room observing them and so on.
The thing which is really interesting with the internet is that people express themselves without being asked to. So they are like just going to express their opinion but nobody is observing. I know lol because I'm observing right now.
And so it's less biased. So web mining would be a superior source of data for social sciences because it's not biased. So this is the bad take. The good take on this is that internet data comes with its own biases. For instance if you collect data on Google Trends of course you have like other biases that you will find and you should be aware of those biases.
And so to be able to control and manage those new biases you have to apply meta studies and science and technology studies which is a large field of social sciences which study those issues. And so the conclusion, the good take on this is that web mining is still another very very interesting and very large data source.
So why not collect it? We should collect it because it's just a good thing and we can. So the issue here is that web mining is hard. To be able to perform web mining tasks, to be able to scrape, to crawl, you need to know the web. And when I say the web I say the whole web. So you need to know how
DNS works, HTTP works, HTML works, CSS works, JS, the DOM, Ajax, SSR, CSR, XPath and so on. You've got a lot of things to know and learn about the web to be able to like retro engineer it. So how do you teach researchers and for instance social scientists those web technologies?
So basically the same as everyone else. So you could like teach them CSS and HTML and so on and try to like empower them through this teaching. But what most consider as an easy layer of technologies, I don't know here but there is like a misconception in technologies that says that the web is actually really easy, it's really not.
And we are really standing on the shoulders of giants. Does someone here has already tried to teach someone who is like new to the web technologies how the web actually works? Did someone does this job? Okay. Usually when you do that you notice that you are standing on a huge mountain of scales which is actually really daunting.
So it's not really easy to teach people about web technologies. So another question here is how to teach researchers how to scrape for instance. So they know about web technologies, they know a bit about JavaScript and Python. So how can we empower them and teach them how to scrape?
And then you also have other issues which are a bit different which as for instance you are fighting the platforms and their APIs. Platforms will try to like prevent you from applying scraping and trolling. You've got some legal issues in some countries. In some countries, for example Denmark, like teachers avoid teaching people scraping because it's considered like lock picking for instance.
It's considered a bit illegal or gray. And you have to wiggle when you publish something using scraping because sometimes you have to say oh no I did not scrape, I had a monkey army clicking on the button really fast. So you have a lot of hoops to jump through.
And what's more and this is something I really want to stress today is that Jupiterizing researchers is not a solution. Sometimes we say okay we are going to empower researchers, we are going to teach them everything they need to know. They are going to learn Python, Jupyter, web technologies and they are going to scrape by themselves. This is a really good solution but it's not really applicable to the real world.
So for instance and what's more in social sciences, some researchers don't have the time nor the will to learn all those skills. And we should be okay as a community, we should be okay with that. It's okay, researchers don't have to learn the skills and the question then is how are we going to empower them all the same.
And what's more and this is the second point again against the Jupiterization of researchers is that web mining is actually really really really hard. It's really a craftsmanship. Basically web mining is a job, it's not a skill. So internet for instance is a dirty dirty dirty place. So you've got conventions basically.
So you are supposed to code a website like correctly, cleanly but basically everything is really badly implemented. And so browsers today are really like heuristical wonders. They have a lot of routines and programs to make sure that the web page that you send which is really messy will be read by the browser correctly.
So you have to know all of those things when you want to do web mining. What's more you need to know about things which are considered advanced in informatics. Which is for instance how to multi-thread a program, how to parallelize things, how to throttle your HTTP request.
And if you don't know how to do that you will harm actual people. For instance at the beginning of our journey we did not know how to throttle HTTP request. So we basically cut all our university's access to Google. Which is a bit problematic. Not too much.
And you need to know all about those kind of stuff which are really complicated if you want to be able to actually perform web mining. You need to know a lot of skills. So what I mean here is that it really is a craftsmanship, it really is a job. And you can't expect people to be researchers and web miners.
So the question then is how are we going to empower researchers all the same. And the answer here is by designing tools suited to their research question. So we need to have designers. Who is a designer in this room? Him. Yeah. So we need more designers.
And so how did we do that? So I worked for a laboratory which is called Sciences Po Media Lab. And the seminal idea of the lab was to gather three kinds of people. So social science researchers, designers such as this guy, and engineers such as me. And so we are going to mix those people and we are going to design tools which are suited to the researcher's questions and work.
This is basically it. So what I propose here is to guide you through some of the tools we designed to be able to empower, really empower social scientists to perform web mining tasks.
And so the first one we did was called r2.js. So it's a bad pun on R2D2. And the idea here was beginning from the following thing. If you know about like modern web technologies, you will fastly encounter something which is
called like dynamic rendering, which means that the page is not rendered on the server. It's rendered on the client using JavaScript and so on. It's really complicated. And if you want to emulate them, emulate this to be able to scrape, it's kind of difficult. So the idea was to actually parasite the web browser to perform some web mining tasks.
I know it's a bit abstract, so I'm going to try a small demo time so everything will break now. This is how it works. So for instance, let's say you have a researcher who wants to scrape this web page, get the whole list as a CSV table. So you're going to go to the page and then you are going to inject some
parasite code to help you like scrape the data and provide the researcher with the data. So I use a bookmarklet, which is called R2, which is loaded directly into the web page context. And R2 is going to help me to do some stuff.
First, it can do some sound, which is its most interesting feature. Then I will be able to use something like old school jQuery and using CSS stuff, really basic stuff, I will be able to scrape data.
So here I'm just attempting to scrape the data from the website, but directly within the web page's JavaScript context. And when I have that, I'm also able to help the researchers by doing this thing.
Yes, it doesn't work because I didn't know. Sorry, it's a bad live coding situation.
Yeah, yeah. And so now I have like the data that does not work. So basically I've scraped the thing as a CSV file and I'm now able to provide it to the researchers.
The main point here is that it's still code. But the fact is you can like use this same code to generate bookmarklets, custom bookmarklets for the researchers. So it means that I will go to this kind of interface. I will paste my code here and then I will create something which is actually a bookmark for the researchers. You just have to copy it on his web browser and then it will only go to this page, click on the button and it will download the CSV for him.
And so for this kind of scenario, we have researchers that like do some really qualitative search on the website and just want to like pick some list and aggregate them. And so we use this tool, r2.js, to provide them with this kind of like haddock and tailored bookmarklets.
So this is the first thing, r2.js. The question here is that can we, which is a bit more hefty.
So we created something which is called now Minet. And what is the goal of Minet? So the goal of Minet is actually to provide you with some common line tools which is going to like handle all the pesky details of web mining for you. So basically all the things which are in bold are the thing on which you are going to focus and work, actually work.
All the thing which is around is the thing I handle for you so you don't have to. So multi-threading, multi-processing and so on, I will do that for you. You just focus on the fact which is I want to get, for instance, one million page from the web and then extract the actual content and data for it.
Do we have time for a demo? No. Maybe so. So basically Minet is a CLI tool. It's a Unix tool which like comply to the Unix philosophy. So it does one thing, only one thing well and this thing is web mining, which is a big hefty thing but it does it well.
And you can like pipe commands and so on and it looks like that, for instance. So basically you are going to use the common line tool to be able to, for instance, fetch a lot of pages from the web then scrape massively and in a parallelized session from a lot of web pages.
You could also extract raw content from articles so you can do NLP stuff on it afterwards and so on. So it's a really, it's a Swiss Alumni for web mining which is scalable and which is lo-fi. It doesn't run on any database, it works on CSV file, it pipes to the stdout and on the pipes and it's really like, yeah, lo-fi.
And what's more, the thing which is really important for us and which is like true for R2 and Minet is that it relocalizes data collection on the researcher's computers themselves. Sometimes you need a service but sometimes you don't. And it's important for the researchers to be able to like control his or her data by
being able to do this stuff on his or her computer so they are like really in control. And basically in social sciences we rarely do what we call like big data, TEM, we don't do this stuff. Everything stands and is able to fit in a single computer.
And what's more, if you really want to do some Jupyter stuff, and that's alright, you have a programmatic API which hides all the complexity for you. So if you want to do something like, okay, I want to fetch one million pages from the internet and I want to do it right, you can just do a simple for loop and you will have all this stuff handled for you.
And so then we have seen like how to enable researchers to scrape, to collect data from APIs. And so the question is, what's the next step? Can we design something which is a bit more ambitious, something which is more like a GUI? And we actually can. So for instance, in the lab we are developing a tool which is called Hive, which is a web
crawler which has a dedicated interface which enables researchers to like crawl the web, crawl a subset of the web, and be able to make sense of it without having any kind of technical knowledge. So how do we enable researchers to crawl the web? Using this tool, so Hive. It looks like that for instance, so you have an interface, everything is like you push button and you input
things using the keyboard, you don't have to know how to code, you don't have to know how to program, and you are still able to like crawl millions of web pages and be able to like construct, build a corpus which is a subset of the web on which you are actually working on.
And so finally we use this design, we use designers, actual designers, to be able to serve a robust methodology which has been proven to work on many sociological works. And we designed the tool to be able to represent this kind of methodology.
And what's more, I just want to emphasize last time that it's really non-trivial. Some people have already tried to like build a crawler and crawl the web using a spider here, to build and program the spider. Is this really easy? No. So you have to make some things, and like for instance on this particular interface we
had to like build our own indexing database to be able to like index multilayered graphs. And we have a talk which was given here like two years ago, if you want to check. And it was, it was, yeah, it's called the TRAF basically. That's the job. Yeah, so as a conclusion, my main point here is that researchers should not
be expected to like learn all the ins and outs of web mining and programming. And so there is always a tradeoff when you design tools for them suited to their needs between scalability, so how much data they can handle and fetch, and usability, so how easy is it to use the tool and to do this kind of stuff.
So we need to be able to design a user path. And to do so, we need to be able to take a step back on what we are doing and take the time to like abstract our design path. And I hope that's what we are doing right now. So what's the future? We would like to do a GUI for a Minet so the researchers are able to like use it without needing to learn the Unix command line.
And so if anyone is up for it, we need people. We are recruiting also. And thank you for listening.
Questions? Questions. Yes. With what?
The robot that takes me? I will say to you officially that I do, but I do not. No, basically for us it's not an ethical question, it's more of a technical question because it's really heavy for us to like fetch the thing. And we don't know how to do that very well, so we don't. But we could. We could do it.
Do you use it? Yes. We use creepy on the hive for instance. So they, but I think the version we use does not respect the robots that they stick. Yes.
We try to use those tools with researchers a lot. So those are basically tools that scrape automatically by learning what you are trying to extract.
It did not work very well. We also try to design our own tools but failed miserably basically. And it's something we are still interested in, but we haven't found the thing which really works yet with our researchers.
Are the researchers who are attacking with these tools? Yeah, sure. For instance, right now we are working... Repeat the question. What? Repeat the question please. Yeah, so the question is, is there an example of like actual research project or question we are actually using web mining for?
And so currently we are working on a project which aims at studying how people in France share and read medias. And so we are using web mining in multiple of fashions. For instance, we collect all the articles, texts from 400 medias. We collect all the tweets mentioning those URLs.
We collect all the Facebook posts mentioning those URLs and many more. So YouTube videos from those media outlets and so on. So we collect a really large amount of data to be able to see whether like medias polarize around political questions or not, etc.
Yes. To afford what? Okay. So for Facebook there is a trick which is actually interesting which is that
they still need to serve a mobile application which is heavily used in India. And in India, like the contrast trick it did because else you will like block users. So you can like hit on it like a madman and it still works. It's not really a good... It's a good option but the fact is they won't serve you like the actual data.
Sometimes you have like relative dates and you need to parse those relative dates to be able to find things. But it's usually a good solution to be able to scrape massively Facebook for instance. So we found workarounds. And if we don't find workarounds we use like proxy meshes which are able to like hit from multiple angles fastly.
Not yet. It will soon because we need to. But if you want you can help us. Any other question? Thanks.