Dive into Scrapy
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 74 | |
Number of Parts | 173 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/20145 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Bilbao, Euskadi, Spain |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
EuroPython 201574 / 173
5
6
7
9
21
27
30
32
36
37
41
43
44
45
47
51
52
54
55
58
63
66
67
68
69
72
74
75
77
79
82
89
92
93
96
97
98
99
101
104
108
111
112
119
121
122
123
131
134
137
138
139
150
160
165
167
173
00:00
Division (mathematics)Software developerInformation systemsWorld Wide Web ConsortiumMoment (mathematics)Goodness of fitService (economics)MereologySoftware developerElectronic mailing listWeb 2.0GoogolReverse engineeringCellular automatonMetropolitan area networkCovering spaceSampling (statistics)XMLComputer animation
01:10
GoogolSemantic WebResultantSemantics (computer science)Term (mathematics)Single-precision floating-point formatWeb 2.0Lecture/Conference
01:42
World Wide Web ConsortiumGoogolWater vaporStatisticsWeb 2.0XMLComputer animationLecture/Conference
02:08
World Wide Web ConsortiumInformationCodierung <Programmierung>HTTP cookieCurve fittingStandard deviationGrand Unified TheoryCodeRegulärer Ausdruck <Textverarbeitung>Process (computing)Line (geometry)EmailDependent and independent variablesBitCodeError messageWeb 2.0String (computer science)ParsingRegular languageFormal languagePoint (geometry)Query languageHTTP cookieStandard deviationDataflowElectronic mailing listSeries (mathematics)1 (number)Pattern languageLibrary (computing)Type theoryElectric generatorInformationForm (programming)Reading (process)Web pageLimit (category theory)Bit rateCodeWindowExpressionCASE <Informatik>Multiplication signState of matterSource codePosition operatorWebsiteMetropolitan area networkXMLLecture/Conference
05:51
GoogolBoom (sailing)Mathematical singularityBinary fileParsingSoftware bugBeta functionParsingAutomorphismRegulärer Ausdruck <Textverarbeitung>CodeComputer iconEvent horizonTotal S.A.Metropolitan area networkInformation systemsModul <Datentyp>ParsingParsingCodeExpert systemExecution unitCASE <Informatik>Series (mathematics)Escape characterWebsiteLibrary (computing)Multiplication signMereologyMetropolitan area networkWrapper (data mining)Event horizonSoftware testingLecture/ConferenceXML
08:27
Gastropod shellTrigonometric functionsInteractive televisionVideo game consoleDependent and independent variablesEuler anglesStatisticsBitGroup actionSoftware frameworkObject (grammar)Software testingMultiplication signProjective planeExecution unitWebsiteWeb crawlerDependent and independent variablesGastropod shellOpen setOpen sourceVideo game consolePattern languageGenderPoint (geometry)FrequencyStudent's t-testParticle systemBoolean algebraEscape characterCellular automatonSource codeComputer animation
10:42
Online chatMaxima and minimaPersonal area networkLength of stayMetropolitan area networkValue-added networkUniformer RaumHand fanUnruh effectMountain passData Encryption StandardSocial classProduct (business)Function (mathematics)Physical systemSoftware testingFile formatFile Transfer ProtocolCodeData modelAsynchronous Transfer ModeModal logicEndliche ModelltheorieIntegerReal numberWeb crawlerWater vaporEndliche ModelltheorieAttribute grammarLevel (video gaming)Source codeCASE <Informatik>Electric generatorLink (knot theory)System callFunctional (mathematics)Domain nameDependent and independent variablesFile formatObservational studySymbol tableMereologyInternetworkingPattern languageConformal mapWebsiteSocial classMultiplication signPhysical systemoutputData dictionaryComputer fileFront and back endsResultantParsingStandard deviationNormal (geometry)Sampling (statistics)BootingData structureUniform resource locatorCuboidMultiplicationComputer animation
14:54
FingerprintString (computer science)Hecke operatorDiscrete element methodData flow diagramIntelGrand Unified TheoryMiddlewareState of matterVideo game consoleSheaf (mathematics)HTTP cookieRotationComputer architectureSet (mathematics)Field (computer science)Proxy serverDefault (computer science)Level (video gaming)Multiplication signIP addressEndliche ModelltheorieDifferent (Kate Ryan album)LoginInternetworkingWebsiteRandomizationScheduling (computing)Web browserMereologyProduct (business)Visualization (computer graphics)Standard deviationHTTP cookieDependent and independent variablesDescriptive statisticsCuboidLeakSemiconductor memoryProcess (computing)INTEGRALWeb crawlerMiddlewareCASE <Informatik>Term (mathematics)Escape characterVideo game consoleDesign by contractSoftware testingStatisticsPower (physics)HypermediaRhombusRight angleChemical equationMetropolitan area networkPopulation densityCategory of beingSemantics (computer science)Moment (mathematics)Complex (psychology)PredictabilityTelecommunicationEvent horizonAssociative propertyGraph coloringArithmetic meanPattern languageElectronic mailing listSystem callOffice suitePoint (geometry)State of matterFormal languageWordContent (media)Form (programming)1 (number)Domain nameDoubling the cubeComputer animation
21:45
Web crawlerTerm (mathematics)Web crawlerSlide ruleYouTubeLibrary (computing)Presentation of a groupCASE <Informatik>Open source
22:27
Service (economics)SicPort scannerOpen setProduct (business)Metropolitan area networkScheduling (computing)Visualization (computer graphics)Process (computing)DemonOpen sourceLogical constantProfil (magazine)Web serviceUser interfaceWeb crawlerService (economics)Point cloudForm (programming)Product (business)MereologyCASE <Informatik>WordSource codeError messageArchaeological field surveyAuthorizationSystem callComputing platformPole (complex analysis)Computer clusterOcean currentWeb 2.0XMLUML
25:07
GoogolBoom (sailing)Multiplication signWeb browserWebsitePlanningINTEGRALLecture/Conference
26:08
GoogolBoom (sailing)Metropolitan area networkOpen sourcePower (physics)Level (video gaming)Projective planeDirection (geometry)Profil (magazine)Lecture/Conference
26:52
OvalGoogolBoom (sailing)Probability density functionLibrary (computing)DatabaseLecture/Conference
27:39
GoogolBoom (sailing)Lecture/Conference
Transcript: English(auto-generated)
00:04
Okay, good morning. Well, I'm a developer at Bandy List at Scraping Hub. I'm a Pythonist and Django now from the early releases of Django. And I also like to, well, to reverse engineer stuff, so let's check why we need
00:28
web scraping. Okay. First of all, we always like APIs. I mean, if every service offers an API, it will be really awesome, but people don't know
00:44
all the trade-offs that came with API. For example, they know how you act as a service. The most interesting part of those services are not normally available through the API.
01:03
I can think of a couple of examples. For example, if you're checking Google Places, they only offer five reviews of each business, and normally you want to get all the reviews from a single business. So, the unique workaround is to use web scraping, and there's also this semantic web term that
01:27
came a couple of years ago, and who of you knows what is RDF, or those semantic terms? I mean, nobody uses them.
01:43
The world is really broken. There were some stats, like five years ago, from Opera. They were checking how really broken was the web, and they told that the most popular tag was a title, not body, so you can make sense of how really broken is the web.
02:08
So what is web scraping? The main goal of web scraping is to get the stricter data from unstructured sources, in this case, web pages, and you may be asking, what are the kind of things that we can
02:26
do with web scraping? Well, as the last bullet point says, your imagination is the limit, but, for example, the most examples are for price monitoring, for leads generation, for aggregate information,
02:45
let's say I want to aggregate jobs positions, or any other kind of information, and, well, if we want to start with web scraping, we need to know HTTP. We need to speak HTTP.
03:00
Some people think that it is obvious, but it's not. So, well, there are some, all of us know that there are some methods, like get, post, those are the typical ones, but there are way more methods than those ones. We also need to know all the status codes, like 200, that's okay.
03:25
404, that's not found, for example. 418, that's a t-code code. 500 are all the error codes, and who knows what is the code 999 for?
03:41
Well, that's the code that Yahoo responses when you get blocked by them. We also need to know how to deal with headers and the query string. For example, the asset language here is quite interesting because it depends on which
04:03
language are you receiving the website. Also, user agent is quite useful, not just for emulating being a real browser, but also you can try to emulate being a mobile device.
04:21
Many occasions, it is way easier to scrape a mobile layout than a desktop one. We also need to know how to deal with purchase and with cookies. So, if we want to perform a request using Python, well, we just check the standard library, as we remember, Python is batteries included, so let's check standard library,
04:43
and we found this error leaf, too, but if you check the API, well, I don't recommend it to you unless you want to suffer to use error leaf, too, but this kind of trade needs a Python request library, and it's like HTTP for humans.
05:04
It has a really clean API, and I recommend it, and it is easy to use as is. So, if you want to perform a request, well, it's one line, plus import. So, you performed the request, and you get a chunk of a bit of string.
05:24
So, how do you deal with that big chunk of data? Well, many people think about using regular expressions or string manipulation methods, but what they don't know is that HTML is not a regular language.
05:43
This is a really famous stack overflow answer. Basically, the last line is have you tried using an XML parser instead? Okay, so what HTML parsers do we have available on the Python ecosystem?
06:03
Well, we have LXML. It's a really fast C libraries, and it's the de facto way to parse HTML, I think. And there's also beautiful soup. Beautiful soup is not a parser.
06:22
It's just a wrapper around parsers. For example, the HTML parser is the standard library one. It also offers a wrap over LXML, and also there is this HTML5 library that it works really good if you are trying to scrape really broken websites.
06:44
I recommend that one. Okay, so let's take a full example of how to perform a request and get some data. So, in this case, I want to get all the talks from this conference. So, while I perform a request, I parse it through LXML, and we just perform some expats to get the data.
07:09
It is really clean. It's really easy. I think everybody understands that piece of code. And one thing that I want to say is many people don't learn expats.
07:27
Well, if you want to be in the work escaping business, you need to learn expats. Otherwise, there's people that try to reinvent expats.
07:41
I've seen so many, for example, Golang libraries lately that try to do some kind of wild expats. It doesn't work. So, well, we have this piece of code, and let's say I want to perform two million requests to amazon.com or whatever site.
08:02
How does that piece of code scale? How do we test it, etc.? It's not that easy. So, you can say, okay, let's pound some threads, or let's use the event. I mean, it doesn't work, or it works for a little time until it gets painful.
08:28
So, what I recommend is Scrapify early on. It has a bit of a living curve, but it's really worth it. So, for those who already know Scrapy, Shane has been telling you about Scrapy.
08:45
So, well, the creators are here at this room of Scrapy. So, Scrapy is an open source and quality framework for extracting the data unit from websites in a fast, simple, yet extensible way.
09:00
Of course, it's open source, and it has a really healthy community. And okay, let's get to it. We have, well, Scrapy has an interactive console, the interactive shell. So, we can just launch it through a Scrapy shell and the URL that we want to check.
09:21
And it's a really good tool for checking some expats, doing some quick tests. But, well, it's also useful for debugging your spiders. So, it's as easy as when we perform this common Scrapy shell URL, and we get on the interactive shell.
09:42
So, we can play with some objects that are already populated. For example, we connect this response URL. We can get an expat from the response, and we can open it, the response from the browser, or even fetch any new website. So, if you are starting with Scrapy, I recommend you to launch the Scrapy shell console and play with it.
10:11
So, let's start the Scrapy project. It's an easiest Scrapy project and the name of the project, and it creates a layout. This is way similar to what Django does.
10:22
And yeah, I mean, we got a lot of ideas from Django. It's a really good project. And what I really like about Django is that it offers a unique way of doing things that works good every time. And Scrapy enforces that, and I really like it.
10:44
So, okay, let's check what a spider looks like. Not another one. It's more like this. Well, it is just a Python class with some attributes, and you can see a method. So, how do we do a spider? What's the anatomy of a spider?
11:03
Well, we have some mandatory attributes at the class level, like the name of the spider. They allow the domains and the start URLs. The start URLs are going to be the requests that are going to be performed. And that method, parse method, it is just a callback.
11:22
So, basically, the Scraping giant performs a request to a sample.com slash one HTML, and we just get the response on that callback. In this case, we're just logging that we got a response. So, let's check more than example.
11:41
In this case, we are not using a start URL, but a function, a generator, in this case. So, we are generating, in this case, three requests. Example.com slash one, slash two, slash three. And we are pointing them to the parse method callback.
12:01
And we are expecting data from those websites at the callback. So, basically, what we're doing is we're performing an XPath, going through some H3 elements, and we are dealing items. We'll get later to see what are items, but basically, they are like dictionaries.
12:24
It is on a structure of Scrapy, and it's quite useful. And then we are going through the links, and we are dealing requests. So, on the same callback, we can deal either items or either request.
12:41
It works out of the box. And, well, the same example. In this case, we have released Scrapy 1.0, and we are ready to load. We don't use items. I will show what are items, but we can just deal dictionaries, normal Python dictionaries,
13:01
and it works as well. So, items. Items are just a class with some attributes, and they define how a structure looks like. But they are pretty good to validate data. We have an intent pipeline, and we can do plenty of stuff with those.
13:25
So, what kind of stuff? For example, we have the concept of item loaders. So, basically, we just populate items. It's like an ORM. And we have input and output, so pre and post professors that are normal Python functions.
13:48
So, basically, imagine that we are scraping, let's say, a date from any website, and we want to format that date into, let's say, the ESO JavaScript standard.
14:02
So, with item loaders, it is just a function that gets the date and transforms it. And then we have item exporters. So, Scrapy has a built-in support for generating feed exports in multiple formats, like JSON, CCB, XML, and storing them in multiple backends.
14:23
So, it's quite cool. We can just run any spire, and the result could go to an FTP, to the local file system, out of the box. And for all of you who use Django, we have a thing called Django item.
14:42
So, basically, we map an item to the definition of a Django model, and it just works. That's pretty useful. So, what happens under the hood? Well, let's check the architecture. Basically, we have a thing called the Scrapy engine that runs on top of Twisted.
15:06
And we have one thing called the scheduler that is in charge of scheduling requests. And it goes to the downloader that fetches the website from the internet and feeds them back to the spires.
15:21
And we have different stages and middlewares through all the stages. So, we can modify the request, modify the response, modify all the items. It's pretty pluggable. And then we have, I mean, the spiders return either request or either items.
15:41
The request goes back to the scheduler, and the items go through the item pipeline. So, well, I think it's quite easy to check what is the flow. So, what kind of things we can do on an item pipeline? Well, we can set some default values for fields.
16:03
Imagine that we have some fields on an item, and we want to set default value. It's also quite useful for validating escape data. So, we can say which items are not valid. It's also quite useful for checking for duplicates.
16:24
So, imagine that some website is really broken and Payonation doesn't work, okay? And you are getting the same website again and again, while you are not getting the same item in this case. It is also useful for storing items. Just imagine that you want to save items to Amazon DynamoDB or any other DB out there.
16:46
So, you can just write an item pipeline to save each item or on Amazon DynamoDB, for example. And also, it's a place to write third-party integrations.
17:02
So, if you have an item in a product description and you want to translate it to another language, you could say, okay, I will integrate with Google Translate API, translate some fields,
17:21
and, well, the item gets the translated fields. We have middlewares. What are middlewares for? Well, they can process requests and items. Basically, they are useful for session handling.
17:40
That's out of the box on Scrapy. That's already working. So, it handles cookies for you. But it is a middleware request. So, imagine that you get a 500th response or a malformed response from any website.
18:01
You can say, okay, let's retire this request and it gets back to the scheduler and it will be scheduled later. We can also modify requests. For example, say, I want to proxy this request to a specific proxy. And we can also use it for randomised user agent, so we can say that each different request uses different user agent.
18:26
And, well, Scrapy is, batteries include, we have, it has login from Scrapy 1.0. It's the Python standard login. It has also a very powerful model for stats collection.
18:43
It also supports testing contracts. It's called contracts. I think it's a twisted term. And we have, it also offers a tenant console that it's a way to inspect an already running Scrapy process. So, you can inspect an already running Spider.
19:02
You can do quite a lot of things like, for example, check if there are memory leaks or pause on receiving the Spider. And the tenant console is really handy. So, for all of you that want to check an example of a Scrapy project, just go to github.com.
19:25
There we have a lot of Spiders from different conferences. And we basically scrape data for all the speakers of these conferences and do some visualisation.
19:41
So, we can know how many of them are male or female. So, it is quite interesting to see how women are, I mean, more attendees are growing year by year. This is the interesting part for all of you that already know Scrapy.
20:03
So, how to avoid getting banned. There's a handful of quick tips. The first of all is rotate your user agent or use a user agent that simulates a real browser or even Googlebot. And also disable cookies is mandatory.
20:24
So, if you are not accessing a protected user data, you can just disable cookies and it works. And it works really cool. I mean, normally these websites try to track you through cookies.
20:44
Also, randomise download delays helps. They might be tracking on how much time are between all the requests. So, you can randomise the time between requests and they can detect the UOTC.
21:03
Also, use a pool of rotating IPs. Well, that's the most classic approach. So, you can buy a bunch of proxies and proxy all the requests through those proxies. But there is also Croneira.
21:21
This is a product from us, from Scrapy in Half. And basically, we provide you an IP and you perform all the requests through that IP. And we take care of handling bans, blocks, replacing proxies, rotating them. So, it's like magic, okay?
21:40
Let's say I want to perform 2 million requests to one method, okay? It works. So, I've been speaking about Scrapy more in terms of doing targeted crawls. But in case you want to know how you can approach a broad crawl,
22:04
there is a library that we have open sourced recently. It's called Frontera. Well, yesterday my coworker gave a presentation about this slide. So, please check those slides and check YouTube.
22:21
Or come by our booth and we can discuss how to use Frontera. So, okay, let's say we have written a bunch of spiders and now we want to deploy them somewhere. Well, we have a tool called Scrapy Demon.
22:43
Of course, it's open sourced. And it provides a web service where you can send, it's all over JSON. So, basically it's a service demon to run Scrapy Spiders. So, you can deploy your project and run a Scrapy New Spiders, a Scrapy New jobs,
23:03
checking the status of those jobs and it works okay. But we also have Scrapy Cloud. It's also from Scrapy Hub. It's a commercial platform, but we have a free quota and basically it's a visual web interface
23:26
where you can deploy your spiders, schedule them, manage them, monitor them. It's really also useful for QA people. Well, I recommend if you are into Scrapy, I recommend you to give it a try.
23:42
We have a free quota and if you can buy our booth, we can provide you a bigger one. About us, a bit about us. Well, we do tons of open source starting from Scrapy. We have open source recently from Terra.
24:02
You can just check our GitHub profile. I'm really proud of our team and how they approach with open source. And we are a fully remote disability team. We are 110 people worldwide, fully remote.
24:23
And we have really great talent out there. So, well, this is the mandatory sales slide. Basically we do professional services of Scrapy and we have two products, Scrapy Cloud and Corela.
24:41
I've already told you what's there. So please, if you're interested, just ask our booth about them. Well, we're hiring constantly. So if you want to feel like a Spiderman, get in contact with us.
25:03
It's a nice place to work, a fully remote team. And well, that's all. Gracias. And I think it's time for Q&A. So if you have any questions. We have a lot of time for questions.
25:34
Everybody, so I'll ask something. Did somebody raise their hand?
25:40
Oh, there, sorry. What about JavaScript intensive websites like with lots of Ajax requests? In my experience, I've been using Scrapy, but on top of that, some headless browser like Splinter or something like that.
26:06
Do you have plans to integrate something like that? It's Plus. You can find it on our GitHub profile. And basically it's an inscriptable headless webkit engine that offers a JSON API.
26:23
And we have also another project open source called Scrapy.js that integrates with this Plus directly. So, yeah, you can use this Plus for, I mean, for performing all the requests through a webkit engine.
26:44
Any more questions? So I would like to ask all of the previous two talks have been about HTML. Do you have any solution, for example, if the data I want is in PDF?
27:01
That's a tricky one. Well, I mean, yeah, I mean, you can fit Scrapy with any kind of data, really. But I think that you need to check any PDF library on Python to deal with PDF.
27:21
But, yeah, you can use all the Scrap engine, the pipeline. Definitely. At the Scrapy level, there is no support for PDF. You need to use a third party library. Thanks. Not so much a question, but a comment.
27:42
Of course, one of the big drawbacks of Scrapy is it's not Python 3. So I just wanted to mention we'll be doing a sprint on that at the weekend. If anybody's available, please come. Yeah, I mean, we are holding a Scrapy workshop this Friday. Also for all of you that want to learn more about Scrapy.
28:02
We have our booth outside, so please come by and say hi. And we have some cool stuff as well. And also we are trying to hold some sprints this weekend about Scrapy. So if you're interested, please tell us. We are really open to that.
28:23
Thanks.