Web Scraping Best Practises
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 79 | |
Number of Parts | 173 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/20204 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Bilbao, Euskadi, Spain |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
EuroPython 201579 / 173
5
6
7
9
21
27
30
32
36
37
41
43
44
45
47
51
52
54
55
58
63
66
67
68
69
72
74
75
77
79
82
89
92
93
96
97
98
99
101
104
108
111
112
119
121
122
123
131
134
137
138
139
150
160
165
167
173
00:00
Metropolitan area networkWorld Wide Web ConsortiumBinary fileMaxima and minimaInclusion mapInternet forumEscape characterCurve fittingHand fanForm (programming)Value-added networkGrand Unified TheoryExecutive information systemWebsiteNP-hardWebsitePhysical systemMarkup languageWorld Wide Web ConsortiumMultiplication signLimit (category theory)Scripting languageDistanceSoftware frameworkMathematicsSlide ruleSemantics (computer science)Band matrixMereologyProjective planeRepresentation (politics)ForestStapeldateiInternetworkingUniverse (mathematics)Library (computing)Content (media)Execution unitStatisticsSource codeSampling (statistics)Arithmetic meanGoodness of fitData compressionBitProcess (computing)RobotLevel (video gaming)Computer animation
02:49
Software development kitDiscrete element methodGoogolMetropolitan area networkHaar measureEuler anglesDivision (mathematics)Arc (geometry)Regulärer Ausdruck <Textverarbeitung>World Wide Web ConsortiumMaxima and minimaData structureContent (media)MereologyWorld Wide Web ConsortiumSource codeGoodness of fitMultiplication signScripting languageSoftware testingOnline helpProcess (computing)WebsiteElectronic mailing listWhiteboardLecture/Conference
03:41
GoogolBoom (sailing)Metropolitan area networkExt functorArmContent (media)Process (computing)Film editingSource codeRemote procedure callTwitterLibrary (computing)Error messageMarkup languageTask (computing)Connected spaceWebsiteProjective planeCASE <Informatik>BitSemantics (computer science)HTTP cookieComplex (psychology)World Wide Web ConsortiumHacker (term)Scaling (geometry)Server (computing)Lecture/Conference
05:40
Metropolitan area networkWebsiteMach's principleSoftware bugInternet forumLink (knot theory)World Wide Web ConsortiumWeb crawlerOrder (biology)Limit (category theory)Uniform resource locatorArc (geometry)StapeldateiLength of stayIntranetMainframe computerDemo (music)Link (knot theory)Web pageSingle-precision floating-point formatProjective planeWeb crawlerUniform resource locatorBitCombinatoricsSearch engine (computing)Term (mathematics)LogicProxy serverType theoryStapeldateiCodeWebsiteSource codeState of matterGoodness of fitLimit (category theory)CASE <Informatik>10 (number)Scaling (geometry)Set (mathematics)MultiplicationPattern languageUniqueness quantificationOrder (biology)Rule of inferenceSystem callModal logicPhysical systemPermutationData structureLevel (video gaming)Multiplication signComputer iconGradientWritingScripting languageMathematicsCombinational logicVertex (graph theory)Matching (graph theory)Theory of relativityXMLProgram flowchart
09:53
GoogolBoom (sailing)Metropolitan area networkUniform resource locatorLogarithmState of matterStapeldateiOpen setMaxima and minimaStandard deviationGoodness of fitWebsiteSource codeEvent horizonMultiplication signHypermediaLink (knot theory)Uniform resource locatorWeb pageFormal languageInternetworkingPoint (geometry)CASE <Informatik>Web crawlerLibrary (computing)State of matterScaling (geometry)MereologyBookmark (World Wide Web)Instance (computer science)Single-precision floating-point formatServer (computing)Control flowCloud computingProjective planeString (computer science)Structural loadSoftware maintenanceContent (media)InformationPressurePattern languageSystem callSmoothingSurjective functionLecture/ConferenceXML
12:27
Boom (sailing)GoogolStandard deviationArmMetropolitan area networkBinary filePort scannerAerodynamicsArc (geometry)Uniformer RaumRight angleFormal languageContent (media)Library (computing)Regular graphObject (grammar)ParsingParsingBitExpressionSyntaxbaumRegulärer Ausdruck <Textverarbeitung>Code2 (number)CASE <Informatik>Web pageMultiplication signDependent and independent variablesComputer configurationSet (mathematics)Revision controlDivision (mathematics)AreaData structureNumberTable (information)CodeConstructor (object-oriented programming)WebsiteCorrespondence (mathematics)Condition numberRule of inferenceSystem callDifferent (Kate Ryan album)Musical ensembleMereologyLecture/ConferenceXMLProgram flowchart
15:41
GoogolBoom (sailing)Metropolitan area networkArmOnline chatTabu searchGamma functionNewton's law of universal gravitationPlastikkarteVisual systemWeb pageValue-added networkMaxima and minimaToken ringCodierung <Programmierung>Element (mathematics)CAN busNormed vector spaceWebsiteChi-squared distributionElectronic program guideComputer virusExpressionObject (grammar)WebsiteWeb pageFile formatLoginForm (programming)Module (mathematics)Machine learningParsingString (computer science)WordCASE <Informatik>World Wide Web ConsortiumSelf-organizationProjective planeVirtual machineMultiplicationServer (computing)Point cloudMedical imagingElectronic program guideWeb crawlerLink (knot theory)Point (geometry)Single-precision floating-point formatLibrary (computing)BitFocus (optics)Pattern recognitionScaling (geometry)10 (number)ParsingToken ringElement (mathematics)Category of beingType theoryNatural numberSampling (statistics)Computer clusterContent (media)Demo (music)Multiplication signSequenceBuildingContext awarenessCartesian coordinate systemInformationAlgorithmField (computer science)System callRule of inferenceComplex (psychology)Suite (music)LogicStructural loadInformation privacyDiscrete group19 (number)Moment (mathematics)Data structureInterface (computing)Heegaard splittingLecture/ConferenceXMLProgram flowchart
22:17
GoogolBoom (sailing)Metropolitan area networkSoftware bugRegular graphChi-squared distributionPlane (geometry)Proxy serverInvariant (mathematics)Bit rateWebsiteContent (media)Medical imagingPattern recognitionMultiplication signData structureArtistic renderingDifferent (Kate Ryan album)World Wide Web ConsortiumExtension (kinesiology)Scripting languagePoint cloudTemplate (C++)Natural numberProxy serverStructural loadWeb browserInformation securityWeb crawlerGoodness of fitBounded variationProgrammer (hardware)Type theoryOpen setDemosceneBlack boxBit rateAreaSoftware developerProjective planeSlide ruleSpacetimeLink (knot theory)Web pageTextsystemData conversionData managementRoboticsMathematicsMereologyServer (computing)Covering spaceProduct (business)RobotExpressionParticle systemMusical ensembleCommunications protocolDoubling the cubeSoftware testingCategory of beingScaling (geometry)ArmInternet service providerVulnerability (computing)Vector spaceThermal radiationIdentifiabilityLecture/ConferenceXMLProgram flowchart
26:19
GoogolCuboidWebsiteProgrammer (hardware)Profil (magazine)LoginHoax
27:00
Maxima and minimaMetropolitan area networkGoogolQuicksortLibrary (computing)Computer animationLecture/Conference
27:28
Metropolitan area networkWorld Wide Web ConsortiumVariety (linguistics)Library (computing)Basis <Mathematik>Default (computer science)BitOffice suiteProcess (computing)Software developerCategory of beingFehlererkennungMathematicsWebsiteQuicksortWeb crawlerPlanningCASE <Informatik>HeuristicMultiplication signContent (media)Error messageComputer configurationSoftware frameworkInformationPower (physics)World Wide Web ConsortiumClient (computing)Arithmetic meanScaling (geometry)Data structureGoodness of fitDirect numerical simulationRoboticsFlow separationBuildingPoint cloudInternet forumQuery languagePlug-in (computing)Absolute value
Transcript: English(auto-generated)
00:05
Okay, hi everyone. Thanks for coming. I'm very sorry about the technical difficulties We clearly should have had a bit more time to set up and prepare And I really don't please try not to look ahead too far in the slides. I know it's going to be difficult But there you go
00:21
Okay, so I'm going to talk about web scraping best practices I originally called this advanced web scraping and because we're going to touch on a lot of advanced topics But it's not advanced in the sense that you need to be past the beginner level or anything to understand it So I changed it to best practices, and I hope that
00:40
Everybody can follow this talk and understand what's going on if you can't please just shout or let me know So a bit about me Let's see Eight years ago about that. I started scraping kind of an anger and that was around the time when We did the script the scrapie web web scraping framework and since that time we've been involved in a couple of other projects
01:04
It's great including Porsche and Frontera, and if you don't know what they are don't worry. I'll get to them later So why would you want to scrape well lots of good sources of data on the internet and actually we come across a lot of companies and Universities and research labs and of all different sizes who are using web scraping
01:25
But You know getting data from the web is difficult. You can't rely on API's you can't rely on semantic markup So that's where web scraping come in these are some stats. You probably can't read them very well Because it's small
01:42
but basically Web scraping has been on the increase recently and we've seen that ourselves, but this has been also something We've seen from other companies reporting these stats are from a company called encapsula That provide anti bot scraping technology, and it's a sample of their customers So it's probably not completely representative of the internet as a whole
02:02
But still it's very interesting to see and another thing that it that I can see from this as well is that? Smaller websites have a larger percentage of bot traffic probably because they have less users, but it's something to keep in mind Especially if you write bad bots and you know they cause more trouble for smaller websites smaller websites might have bandwidth limits for example
02:24
and many HTTP libraries they don't compress content, and so you easily go over and their bandwidth limits Also, of course you know doing a bad job means your web scrapers are very hard to maintain This is a notorious problem of course because websites change
02:42
So when I think about web scraping I like to think of it as in two parts and the first is Actually getting the content, so it's finding good sources of content and downloading it and then the second is the extraction Actually Extracting structured data from that downloaded content, and I've kind of structured this talk in two parts as well that follows this
03:05
So and as an example of web scraping I just said that scraping help get scraped all the time And it's not just people testing out scrapey or something like that or our tools but actually a couple of weeks ago. We posted a job ad on our website and The next day it was up on a job listing web support job listing board
03:24
And none of us posted it there, so we thought well, how did that happen? and I think we were probably scraped and so a question for the audience would be to think about how would you write that scraper and I would break it down into okay How do I find good sources of content and how do I extract that data it turns out that we tweeted about the job
03:45
So hashtag remote working and so maybe somebody picked it up from Twitter got retweeted That'd be an easy source of content, and we did use semantic markup, so Perhaps they extracted it from that and that's relatively to write such a scraper that could do This is is relatively easy task it you could do it in a day maybe
04:04
But then if you wanted to do say To handle cases where people didn't use semantic markup, or you wanted to find people who didn't Post to tweet about it or post it to some other website Then it becomes a much bigger and much more complex task And I think that kind of highlights the scope of web scraping from the kind of very easy
04:22
Cool fun hacks that don't take very long to the very ambitious and very difficult projects that that happen So getting up moving on to downloading Yeah, I'm gonna mention the Python requests library Probably many people know it. It's a great library for HTTP and
04:43
Doing simple things as simple as it should be But when you start scraping at a little bit more scale you really want to worry a bit more about a few other things Like for example retrying requests that fail Certainly when we started out You know you'd run a web scrape and it it might take days to finish
05:03
and then about three-quarters of the way through you get a network error or you get a You know the website itself that you're scraping, but suddenly return 500 internal server error for 10 minutes So if you don't have some policy to handle this It's a huge pain in the ass, so yeah, you want to think about that
05:21
I also this in this example you can see I'm using a session and well I don't know if you can see it or not because it's small, but consider using sessions with Python requests sessions Handle cookies they also use connection keep-alive, so you don't end up repeatedly opening and closing connections to the sites you scrape
05:41
But I would say as soon as you start crawling you really want to think about using scrapey right away This little example here is not much code. It uses scrapeys crawl spider Which is a common pattern for scraping for crawling You know just defining one rule a start URL And that's enough to go from the our your Python website for this conference to actually
06:05
Follow all the links to speakers, and you just need to fill in some code to parse the speaker details So it's really not much code. It's and it solves all the problems like highlighting Saws all the problems like retrying etc and you can cache the data locally which is good if you're gonna live demo stuff
06:25
Yeah, so a Single a single crawl like that often turns into crawling multiple websites At PyCon us in 2014. We did a demo And it's up on scraping hubs github account. It's called PyCon speakers where we actually scraped
06:43
data from a whole lot of tech conferences and This is a really good example to look at because it shows you can it shows a way to manage and how a scraping project looks When you've got a lot of spiders and scraping provides a lot of facilities for managing that like you can list all this all the spiders that are there a
07:01
Spider is a bit of logic that we write for a given website And it also shows best practices in terms of you know It's easy with scrapey to put common logic in common places and share it across multiple websites when they're crawling the same type of thing There's a lot of scope for code reuse So definitely for scraping multiple websites. Yeah
07:21
So some tips for crawling and find good sources of things Some people maybe might not think about using sitemaps and scrapey actually has a sitemap spider that that makes this very easy and transparent But often, you know, it could be a much more efficient way to get to content
07:42
And that also means of course don't follow unnecessary links Yeah, this is you can waste an awful lot of resources for everybody following stuff that doesn't need to be followed consider crawl order Yeah, so if you're discovering links on a website, it may might make sense to crawl
08:00
To crawl breadth first And limit the depth you go to this this can help you avoid crawler traps where maybe you're I don't know Repeatedly scraping a calendar for example and just going through the dates is a common example. I Used to work in a company before that had a Had a search engine and crawlers every now and then would enter follow some link into it and follow all the search facets and turn
08:23
every permutation and combination of search And this generated huge load, of course So as you decide to scale up So I was talking here about maybe single website scrapes, which is I guess the most common use case at least especially for scraping
08:43
And you know single websites grapes can be big right? I mean, it's we frequently do maybe hundreds of millions of pages But at scale say for example, you're writing a vertical search or focused crawler Then we're talking maybe Maybe tens of billions or even hundreds billions of discovered URLs
09:03
So so you might crawl a certain set of pages But the amount of URLs you discover on that page on those pages So your entire state that you need to keep in your URL front here as well. It can be can be much much larger So maintaining all of that is a bit of a headache It's a lot of data and one common way to do it is people just write all that data somewhere
09:23
and then perform the big batch computation to maybe To figure out the next set of unique URLs to crawl typically using Hadoop or mapper trees. It's a very common thing. It's Maybe not just a good example of that And then incremental crawling would be where you are continuous crawling actually would be where you're continuously feeding URLs to your
09:46
crawlers This has the advantage that you can respond much more quickly to changes. You don't need to stop the crawl and resume it But also nowadays maybe you want to repeatedly hit some websites Maybe you're following, you know social media or something like that are good sources of links. So it's much more useful
10:04
But it's much more complex at the same time and it's a harder problem to solve Maintain politeness is a little point on the bottom and But there's something really you want to consider when you're doing it on any scale I think almost anybody can fire up a lot of instances nowadays on ec2
10:20
Are your favorite cloud platform And just download loads of links download loads of pages really quickly Without putting much thought into what those pages are are particularly the the impact it's going to have on the websites are crawling In a larger crawl where you're crawling from multiple servers, you would typically well, sorry you would typically
10:42
Only crawl a single website from a single server and that server could then maintain politeness So you can ensure whatever your crawling policies are you don't you don't break it So from Tara, I thought I'd briefly mention it Alexander is very cough gave a talk on it yesterday
11:02
this is a Python project that we worked on and we're working on That that implements this crawl frontier So it maintains all the state about visited URLs and tells you what you should crawl next and there's a few different Configurable back ends to it. So you can use it embedded in your scrape recrawl
11:21
Or you can just use it via an API with your own thing And you know, it implements some more sophisticated revisit policies So if you say want to go back to some pages more often than others and maybe you know Keep keep content fresh. It can do that. And and I think Alexander particularly talked about doing it at scale
11:41
So he had a crawl of the Spanish internet and easily be talking about that in the poster session as well So, please come visit So just to summarize quickly what we talked about downloading Requests is an awesome library for simple cases but once you start crawling It's better to move to scrape you quickly
12:01
Maybe even you even want to start there and if you need to do anything really complicated or sophisticated or at scale consider front air So moving on to extraction Extraction is the second Part that I wanted to talk about of course Python is a great language for extracting
12:20
For extracting content or for messing with strings or messing with data. There's probably a lot of talks at this conference about Managing data with Python, but even just a simple, right, you know Built-in features to the language and the standard library make it very easy to play with to play with text content Regular expressions, of course is is one thing that's built into the library and probably
12:45
Yeah, we should mention something about it. It's regular expressions are brilliant for textual content Yeah, it works great with things like telephone numbers or post codes But if you find yourself ever matching against HTML tags or HTML content
13:02
You've probably made a mistake and there's probably going to be a better way to do it I see this code all the time of regular expressions and yeah, it works fine, but it's hard to understand and to modify And often it actually doesn't work fine So other techniques well use HTML parsers
13:22
So that we have we have some great options Yeah, so if you want this is when you want to extract based on the content are based on the hate the structure of HTML pages So often you will see okay this area here surrounded by this Underneath that table is HTML parsers. Absolutely the way to go
13:45
Yeah, so just a brief example, oh, yeah on the right hand side I just had some examples of HTML parsers LXML, HTML5, beautiful soup, gumbo, and of course Python has its own built-in HTML parser
14:04
I'll talk about them a bit more in a minute. So don't worry if you can't see that So just as a brief example of what they do is take some Take some raw HTML here that that looks like text and and create a parse tree My favorite way of dealing and then you know use some technique
14:22
Usually these parsers provide some method to navigate this parse tree and extract the bits you're interested in I Don't know if you can see that so I'll skip this quickly, but I quite like XPath As a way to do this It's very powerful you can in this case just select all bold tags or a bold tag under a div
14:41
You know the text from the second div tag It lets you Specify rules. It's really worth learning If you're going to be doing a lot of this Yeah, here's a here's an example from scrapey you don't really need to read that But basically it just lets you scrape it provides a nice way for you to call XPath our CSS selectors
15:03
on responses So this is probably the definitely the most common way to scrape content from a small set of known websites And I definitely want to mention beautiful soup as well. This is a very popular Python library
15:20
Maybe in the early days it was a bit slow, but the more recent version you can use different parser backends So you can even use use beautiful soup on top of that XML the main difference with With the example I showed previously is that beautiful soup is a pure Python API So you can navigate content using using Python constructs and Python objects versus XPath expressions
15:46
The other thing is of course you might not need to do this at all Maybe somebody has already written something to extract what you're looking for So, yeah, definitely Maybe there's stuff you wouldn't even think of some examples of things that we've done
16:02
Is we wrote a login format module for scrapey that Automatically fills in log fills in forms and logs even to websites. We have a date parser module that takes textual strings and can build a data object from it and web pages another another
16:22
Another project that we wrote which looks at a HTML page and will pull out links that perform pagination Which is often useful I was going to live demo this but I think we're probably short on time and Yeah, maybe it's not worth tempting face. We had enough technical problems already
16:41
but Porsche is a visual visual way to build web scrapers It's applicable in many of the cases where we I had previously mentioned where we would use XPath or beautiful soup But it's advantages. It's got a nice UI where you can Visually you say oh I want to select this element. This is the title. This is the
17:03
Image, this is the text and I was going to demo this about scraping the euro Python website Maybe if somebody wants to drop by our booth later, I can do it. I can show you But it's really good. It can save you a lot of time. However It's not as applicable, you know, if you really want to be if you if you have some kind of complex rules
17:25
Complex extraction logic and you can you can it might not always work with this And of course if you want to use any of the previously mentioned stuff like automatically Extracting dates and things there might not be built into Porsche yet
17:41
So scaling up extraction, we see Porsche is great. It's much quicker to write Extracting extraction for websites, but at some point it becomes pointless again. You might be scraping 20 websites. That's fine hundred People have used it to scrape thousands But what about tens of thousands or maybe even hundreds of thousands at this point you want to look for different techniques?
18:03
There are some libraries that can extract articles from any page they're easy easy to use and I want to focus on quickly on Library called web struct that we worked on That helps with automatically extracting data from from HTML pages and the example I'm going to use is named entity recognition
18:23
So in this case, we want to find elements in the text and assign them into categories So we start with annotating web pages so of the type of stuff the type of We label web pages basically with what we want to extract as examples We're going to use the tool called web annotator and but there are others
18:44
Here's an example of labeling In this case, we want to find organization names So the old tea cafe is no no is an organization and we would label it within a sentence within a page That format is not so useful for machine learning and for the kind of tools we want to use so we would
19:06
Of course that text is split into tokens each token in this case is a word And we label every single token in the whole page as being either outside The what we're looking for as being or as being at the beginning of an organization or inside an organization and
19:24
Given that encoding then we can apply more standard machine learning algorithms Yeah in our case we found conditional random fields As a good way to go about it, but an important point is that needs to take into account the sequencing of tokens Are the sequencing of yeah of information
19:43
Some features so we feed it basically not just the tokens itself But actual features and the features can be things like about the token itself But they can also take into account the surrounding context and this is a very important point We can take into account the surrounding text or even, you know, HTML elements that it's embedded in
20:03
So it it can be quite powerful So one way to do it and this is what we've been doing recently Is to use our web struct project and this helps, you know load the annotations that were done previously in web annotator Call back your Python call back to your Python modules that you write yourself to do the feature extraction
20:25
and then it interfaces with With something like Python CRF CRF suite to actually perform the extraction So just this is just briefly to summarize You know We use slightly different technologies depending on the scale of extraction
20:44
HTML parsing and Porsche are very good for a single page or single website or for multiple websites if we don't have too many The machine learning approaches are very good. If we have a lot of data, we compromise a bit on maybe the accuracy But that's that's the nature. I
21:01
Just wanted to briefly mention a sample of a project we've done recently. Actually, we're still working on it You might know the Saatchi Art Gallery It's a it's a gallery of contemporary art in London and we did a project with them for that to create content for their global gallery Guide now. This is an ambitious project to
21:20
showcase Artworks and artists and exhibitions from around the world. So it's a fun project and it's nice to look at artworks all day So, of course we use scrapey for the crawling we deployed it to scrapey cloud which is a scraping hub Service for running scrapey crawls and we we use web pager one of the tools I mentioned earlier to
21:44
To actually paginate and so the crawl we prioritize the links to follow and so we do so using machine learning so we don't Try and waste too many resources on each website with scrape and once we hit the target web pages We then use web pager to to paginate and so that's the crawling side on the extraction side
22:02
We use web struct very much like I previously described and one interesting thing that came up. I thought was that when we were extracting Images for art are for Artists we often got them wrong And we had to use a classification to we actually classified them
22:21
Based on the image content using face recognition to see which one were artists versus artworks So it's working pretty well. This isn't Scraping 11,000 websites, hopefully to continue and increase So one important thing of course is to measure accuracy To test everything improve incrementally
22:42
And it's also good to not treat these things like too much like a black box try and understand what's going on Don't make random changes. It tends to not work so well so briefly, we've covered Downloading content we've covered extracting it seems like we have everything to go and scrape At large-scale, but there's still plenty of problems, and I'm just going to touch on a few in the last five minutes
23:08
Of course web pages have an irregular structure and this can break your crawl pretty badly it happens all the time and from people using Superficially some websites look like they're structured But it turns out somebody was using a template in a word processor or something and there's just loads of variations that kill you
23:25
Other times, I don't know maybe the developers have too much time in their hand and they write a million different kinds of templates You can discover halfway through that the website's doing multivariate testing, and it looks different the next time you run your crawl I wish there was a silver bullet or some solution. I could offer you for these but there's not
23:42
And another problem that will come up is sites requiring JavaScript and our browser rendering We tend to have we have a service called splash which is a scriptable browser that presents an HTTP API so this is very useful to integrate with scrapey and some other services
24:00
You can write your scrapers in Python and just have the browser You know have splash take care of the browser rendering and we can a script extension space in in Lua And selenium is another project if you're start thinking like okay follow this link type this here Selenium is a great way to go Oh, yeah, finally of course you can look at web it's web inspector or something see what's happening
24:24
This is maybe the most common thing for scrapey programmers Because it's quite efficient you can just you know often there's an API behind the scenes that you can actually use instead Proxy management is another thing that you might want to consider Because some websites will give you different content depending on where you are
24:41
We crawled one website that actually did currency conversion So I thought I was being very clever by selecting the currency at the start but it turns out the website did a double conversion and some products were like a So are they ban hosting centers often where they've had one or two abusive bots it could be somebody else
25:02
This is just part of the nature of scraping in the cloud Yeah, so for reliability sometimes for speed You might want to consider proxies Please don't use open proxies They sometimes modify content. It's just it's just not a good idea
25:21
Tor I generally don't like it for large content scraping. It's not really what it's intended for but We've done some things with maybe government agencies or security in the security area where we really don't want any blowback from the scraping And and you know, it needs to be really needs to be anonymous Otherwise, there are plenty of private providers, but very in quality
25:42
Finally last slide is just briefly want to mention about ethics of web scraping And I think the most important question to ask yourself is what harm is your web scraping doing? either in a technical side or With the content that you scrape or you are you misusing it? Are you hurting the sites you're getting it from?
26:01
Yeah, so crawl on the technical side crawl at a reasonable rate And it's best practice to use a user to identify yourself via user agent and to respect robots text especially on broad crawls, that's when you visit lots of websites and That's it. Do we have some questions?
26:33
Thank you Imagine you have to log in into some websites and if you use a tool that will generate some fake
26:41
credentials and stuff for example You have a profile of a programmer or a profile of a farmer or for rock star and so on. Thanks Okay, so about logging into websites Well the tool I mentioned just finds the login box and lets you configure your your user ID and that you want to use
27:00
So it doesn't handle managing multiple contacts I have seen people do that, but it's not something I've done myself Yeah, so sorry, that's all I can say about it really Any other questions
27:27
Hi First of all, thanks for the scrapey library. I mean, it's an awesome thing and we're using it That's great to hear. I actually these guys you should be thanking in the audience Yeah, I may have gotten the ball rolling but and stand up guys stand up
27:43
But these are their contributors Really here think there's more of them up there, but I don't know why they're they're being shy Sorry, yeah, I probably have a few questions, but I'll only ask a couple of guests
28:02
First is I'd like to mention by query why? That was an awesome development Change for us from XPath. Yeah, and have you maybe try that? I mean This is one thing we use regularly and it proves And yeah, I've heard of it, but I haven't really paid it properly
28:23
So yeah, we check it out and I think there might be scope for including other approaches to To extraction. Yeah, okay and one is did you maybe think about master spiders or spiders that can you said that API's are brittle and
28:42
Yeah, but you could still think of web frameworks and some behave in similar ways and maybe You could get a way to extract certain information from certain kinds of websites Yeah, absolutely. And like we have a collection of spiders for all the forum engines, for example
29:02
It's not individual websites, but it's the underlying edge empowering it and that works really well Yeah, we're building collections of those kind of things My example about API's I didn't really meant to dis API as in general and they're often quite useful But some cases they don't have the content you're after And in some cases the content is maybe lags behind or it's a bit less fragile than what's on the website
29:23
And that's been my experience But definitely if there is a web and API available you should you should check it out It works fine. It's creepy, too Okay, and just last question a little bit more technical Do you have plans for anything to I don't know to
29:45
To handle throttling or to handle robots dot text or To reschedule 500 errors or something like that. Yeah, I know there's a auto throttle plug-in Yep, but that does I mean it slows you down significantly on a good website
30:03
Yeah, though it does work for For slow websites. Thanks You're welcome. Yeah, the trucking is an interesting one and often internally what we do is we deploy with auto throttle by default and then Override it when we know the website can do better and more differently So it is a case especially when you're calling a single website or a small set of websites. It's worth tuning that yourself
30:24
It's hard to find good heuristics And and definitely it's something we do all the time when we write individual scrapers I'd be interested in your thoughts about how we could Come up with some better heuristics by default. It's definitely a very interesting topic
30:41
Yeah, and retrying again scrapey does retry stuff The by default but you can configure the for example the HTTP error codes that signify an error that you want to be retried Because they're not always consistent across websites Thank you So a slight follow-up to the retry thing you mentioned this briefly under the talk do you actually like do things like
31:05
back offs and jitters and stuff because From my job. We have very interesting situations with synchronized clients and other fun It's good to avoid Yeah, yeah, yeah, yeah, definitely definitely and actually and I glossed over a lot of details
31:23
And I mean I said we run in scrapey cloud but that takes care of a lot of the kind of infrastructure that we typically need and Alexander gave a talk on the crawl frontier Which is crawling at scale and there's a lot more that goes into that that it happens outside of scrapey itself A very the first thing of course that we noticed as soon as we started scrolling from ec2 is DNS errors all over the place
31:46
But there's several technical hurdles that you need to overcome. I think to do it to do a larger crawl at any scale Okay, Thank You Shane, thanks very much