We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Web Scraping in Python 101

00:00

Formal Metadata

Title
Web Scraping in Python 101
Title of Series
Part Number
103
Number of Parts
119
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceBerlin

Content Metadata

Subject Area
Genre
Abstract
M.Yasoob Khalid - Web Scraping in Python 101 This talk is about web scraping in Python, why web scraping is useful and what Python libraries are available to help you. I will also look into proprietary alternatives and will discuss how they work and why they are not useful. I will show you different libraries used in web scraping and some example code so that you can choose your own personal favourite. I will also tell why writing your own scrapper in scrapy allows you to have more control over the scraping process. ----- Who am I ? ========= * a programmer * a high school student * a blogger * Pythonista * and tea lover - Creator of freepythontips.wordpress.com - I made soundcloud-dl.appspot.com - I am a main contributor of youtube-dl. - I teach programming at my school to my friends. - It's my first programming related conference. - The life of a python programmer in Pakistan What this talk is about ? ================== - What is Web Scraping and its usefulness - Which libraries are available for the job - Open Source vs proprietary alternatives - Whaich library is best for which job - When and when not to use Scrapy What is Web Scraping ? ================== Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. - Wikipedia ###In simple words : It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting. We can extract any data through web scraping which we can see while browsing the web. Usage of web scraping in real life. ============================ - to extract product information - to extract job postings and internships - extract offers and discounts from deal-of-the-day websites - Crawl forums and social websites - Extract data to make a search engine - Gathering weather data etc Advantages of Web scraping over using an API ======================== - Web Scraping is not rate limited - Anonymously access the website and gather data - Some websites do not have an API - Some data is not accessible through an API etc Which libraries are available for the job ? ================================ There are numerous libraries available for web scraping in python. Each library has its own weaknesses and plus points. Some of the most widely known libraries used for web scraping are: - BeautifulSoup - html5lib - lxml - re ( not really for web scraping, I will explain later ) - scrapy ( a complete framework ) A comparison between these libraries ============================== - speed - ease of use - what do i prefer - which library is best for which purpose Proprietary alternatives ================== - a list of proprietary scrapers - their price - are they really useful for you ? Working of proprietary alternatives =========================== - how they work (render javascript) - why they are not suitable for you - how custom scrapers beat proprietary alternatives Scrapy ======= - what is it - why is it useful - asynchronous support - an example scraper
Keywords
80
Thumbnail
25:14
107
Thumbnail
24:35
CodeWorld Wide Web ConsortiumSelf-organizationSpeech synthesisScripting languageComputer animationLecture/Conference
Student's t-testBoom (sailing)Metropolitan area networkOpen sourceComputer programVideoconferencingSource codeMereologyBlogBitStudent's t-testWebsiteYouTubeFreewareComputer animation
Metropolitan area networkWorld Wide Web ConsortiumComputerInformationWebsiteSoftwareWordLevel (video gaming)HypertextHeat transferWeb browserInternetworkingFirefox <Programm>Bit rateComputer programVideoconferencingForm (programming)InternetworkingLibrary (computing)HypertextHeat transferWeb browserSystem callWorld Wide Web ConsortiumCommunications protocolWebsiteProcess (computing)SoftwareBit rateLatent heatWordInformationFamilyLecture/ConferenceComputer animation
Arithmetic logic unitWeb crawlerInternet forumWebsiteInformationProduct (business)World Wide Web ConsortiumMetropolitan area networkWorld Wide Web ConsortiumVideo gameWebsiteMedical imagingInformationProcess (computing)IP addressReal numberCASE <Informatik>Sound effectInternet forumCycle (graph theory)Limit (category theory)Product (business)SoftwareSearch engine (computing)Discounts and allowancesWeb crawlerLecture/ConferenceComputer animation
VideoconferencingWebsiteWorld Wide Web ConsortiumUniform resource locatorYouTubeLecture/Conference
World Wide Web ConsortiumParsingMathematical singularityWebsiteExt functorEscape characterMetropolitan area networkInsertion lossDataflowMaxima and minimaLibrary (computing)Focus (optics)BitMereologySoftware frameworkParsingProduct (business)Multiplication signResultantRegular graphDigital librarySocial classRegulärer Ausdruck <Textverarbeitung>Parameter (computer programming)String (computer science)Spring (hydrology)Uniform resource locatorMessage passingLevel (video gaming)World Wide Web ConsortiumNetwork topologyWebsiteReading (process)Markup languageProper mapProcess (computing)Computer animation
Metropolitan area networkRevision controlProduct (business)Markup languageLibrary (computing)Wrapper (data mining)Keyboard shortcutWikiSign (mathematics)EstimatorLecture/ConferenceComputer animation
Maxima and minimaParsingRegular graphRadio-frequency identificationStandard deviationRevision controlMetropolitan area networkFirmwareLibrary (computing)Right angleRevision controlExpressionComponent-based software engineeringRegulärer Ausdruck <Textverarbeitung>Food energyData conversionMereologySymbol tablePairwise comparisonEntire functionStatistical hypothesis testingParsingPattern languageLecture/ConferenceComputer animation
Metropolitan area networkMultiplication signRegulärer Ausdruck <Textverarbeitung>Table (information)InformationXML
Metropolitan area networkWorld Wide Web ConsortiumRegulärer Ausdruck <Textverarbeitung>Web pageSoftware frameworkWorld Wide Web ConsortiumScripting languageLecture/ConferenceComputer animation
Metropolitan area networkWordMaxima and minimaWeb crawlerWorld Wide Web ConsortiumIcosahedronDensity of statesSoftware frameworkFocus (optics)Binary fileLibrary (computing)Text editorCartesian coordinate systemParsingCellular automatonPoint (geometry)WebsiteArithmetic progressionWorld Wide Web ConsortiumWordNegative numberLecture/ConferenceComputer animation
CuboidMetropolitan area networkBlock (periodic table)SoftwareLibrary (computing)Point (geometry)Statement (computer science)Web serviceBuildingBlock (periodic table)Web pageLecture/ConferenceComputer animation
Directory serviceMetropolitan area networkMaxima and minimaGamma functionStatistical hypothesis testingMathematical singularityBlock (periodic table)Web crawlerMultiplication signMereologyTerm (mathematics)Gastropod shellType theoryDirectory serviceData dictionaryMessage passingField (computer science)SpacetimeDescriptive statisticsProjective planeSet (mathematics)Social classLink (knot theory)Parameter (computer programming)Poisson-KlammerData structureConfiguration spaceProjektiver RaumData storage deviceSkeleton (computer programming)Computer fileGene clusterReal numberExplosionDegree (graph theory)GradientPoint (geometry)Order (biology)Sampling (statistics)QuicksortProduct (business)Lecture/ConferenceComputer animation
Metropolitan area networkWebsiteElectronic mailing listParsingArithmetic logic unitBoom (sailing)Public domainWeb crawlerSocial classField (computer science)Dependent and independent variablesParsingMessage passingUniform resource locatorElectronic mailing listWritingData structureBitMereologyGreatest elementRight angleLecture/ConferenceSource code
Metropolitan area networkPower (physics)SharewareInformationInsertion lossFlow separationWeb crawlerSpacetimeSystem callFunction (mathematics)NumberType theoryField (computer science)InformationPoint (geometry)Power (physics)Computer fileWeb pageSharewareData storage deviceScripting languageFlow separationAuditory maskingState of matterAxiom of choiceTorusProjective planeFrequencyGradientComputer animation
Metropolitan area networkScripting languageInformationWorld Wide Web ConsortiumWebsiteProjective planeProduct (business)GravitationPoint (geometry)Buffer overflowLibrary (computing)Capability Maturity ModelRegulärer Ausdruck <Textverarbeitung>Statistical hypothesis testingWeb pageStack (abstract data type)Multiplication signInformationWorld Wide Web ConsortiumScripting languageSoftware frameworkMetropolitan area networkSet (mathematics)Lecture/ConferenceSource codeComputer animation
Multiplication sign1 (number)Regulärer Ausdruck <Textverarbeitung>Goodness of fitScaling (geometry)WebsiteLink (knot theory)Lecture/Conference
Metropolitan area networkWebsiteIcosahedronVideoconferencingScripting languageTask (computing)World Wide Web ConsortiumSoftware engineeringScripting languageWorld Wide Web ConsortiumYouTubeMusical ensembleWebsiteVideoconferencingComputer programData managementBit rateComputer animation
Metropolitan area networkBlogEmailMaxima and minimaMultiplication signWebsiteException handlingMathematicsData structureCASE <Informatik>Lecture/ConferenceComputer animation
Library (computing)ChainWebsiteProcess (computing)MathematicsState of matterAdditionMereologySet (mathematics)Right angleIP addressPredictabilitySound effectBlock (periodic table)Public domainBitMarkup languageSoftwareRotationLecture/Conference
Metropolitan area networkFreewareBlogEmailIP addressWebsiteLimit (category theory)Multiplication signStructural loadDenial-of-service attackWeb pageNegative numberBit ratePattern languageUniform resource locatorSystem callPoint (geometry)Computer configurationGraphical user interface2 (number)Configuration spaceParallel portSoftwareCausalityFreewarePhase transitionComputer fileLevel (video gaming)Parameter (computer programming)1 (number)Computer animationLecture/Conference
Googol
Transcript: English(auto-generated)
So, first of all, I would like to thank the organizers of your Python who gave me a chance to stand here and speak in front of you.
So my talk today is web scraping and Python 101. So if you're already experienced with web scraping, this is not the right place for you to be, because it's a 101. So first of all, a little bit of introduction about me. I'm Muhammad Yasubullah Khalid. I'm a programmer, a high school student. I'm a blogger, Python instructor, and a tea lover.
So, my experience. I'm a creator of free Python tips. I made a couple of open source programs. Oh, just wait a minute. Okay. So I've made a couple of open source programs. I'm a contributor to YouTube DL. It's a downloader for videos, which supports almost 300 plus websites where you can download
videos from. And, finally, I teach programming at my school to my friends, and this is my first ever conference. So, what this talk is going to be about. This talk is about web scraping, the libraries which are available for this job in Python, and which library is better for which job. I will also give an introduction to Scrapy and some of its internals, and I will also
tell you when and when not to use Scrapy. So what is web scraping? Web scraping, web harvesting, or web data extraction is a computer software technique of extracting information from websites. Usually, such software program simulates human exploration of the world wide web by
either implementing low-level hypertext transfer protocol or implementing a fully-fledged web browser such as Internet Explorer or Firefox. So that was Wikipedia. Now let's come to my understanding of web scraping. So in simple words, it's the method to extract data from a website that does not have an API, or if we want to extract a lot of data which we cannot do through an API
due to rate limiting. Some websites allow you some specific amount of calls which you can make to their API. So if you want to extract a lot of data, then you cannot do that through an API. So you have to focus on web scraping for that purpose. And web scraping, we can extract any data which you can see while surfing the web.
So the usage of web scraping in real life, there are a lot of use cases of web scraping. We can extract job product information, job postings, internships. We can extract offers and discounts from the deal of the day websites. We can extract information from curation websites. We can crawl forums and social websites.
Extract data to make a search engine just like Google, Yahoo. And finally we can gather better data. And these are just some use cases. There are a lot of other use cases as well. So the advantages of web scraping are using an API. First of all, web scraping is not rate limited. You can use IP addresses for web scraping, a cycling rotating of IP addresses.
It is anonymous. You can anonymously access a website through a torn network. You can some websites not have an API. For example, Wikipedia didn't have an API for some years ago. So you could only use web scraping to extract data from Wikipedia. And some data is not accessible through an API like some for YouTube.
If you use YouTube's API, you cannot access the direct URL of the video. For the MP4 URL and many more. So the essential parts of web scraping. The web scraping follows some basic workflow. First of all, you have to get the website through using an HTTP library.
You have to parse the HTML documents using a parsing library. Then you have to store the results for further usage and analyzation. I'll focus more on parsing because it's the main bottleneck in web scraping. So the library is available for job in Python. Basically, these are the parsing libraries. First of all, we have BeautifulSoup, LXML, RII.
This is a regular expressions library of Python. It is not really for web scraping and it's unpopular in that regard. And I'll explain that later. Lastly, we have Scrapy. It's a fully blown away framework, web scraping framework made by Pablo Hoffman. So some HTTP libraries for web scraping. For that purpose, we have the request library.
You can simply do request.get, the URL, and then .html. You have the HTML file. Then you can either use URL lib and URL lib2. You can just do URL lib2.urlopen, the URL, and then .read. And finally, you can use HTTP lib and HTTP lib2 if you want to go low levels. But most of the time, request library is the best one for this purpose.
Then we have the parsing libraries. First of all, we have BeautifulSoup. It's really easy. It is a beautiful API. You can simply do BeautifulSoup. Pass in the HTML document as the argument. And then you can traverse through a tree using simply .title to get the title, .b to
get the b-tags. Then we have lxml. You can simply do lxml.html.fromString. Pass in the HTML document as string. Then you can simply apply x pass to extract the data from the document. Finally, we have regular expressions. You can simply do read.findAll or read.find. And then pass in the regular expression and the HTML document.
So let's focus on them in a little bit more detail. First of all, BeautifulSoup. It has a beautiful API. You can just do find and find all, and it's all. That you just need find method and find all method. It's really easy to use.
It can handle broken markup really easily. A lot of websites do not have a proper markup, HTML markup. So if you go across a website which does not have a proper markup, you should use BeautifulSoup only because it can handle broken markup. It's purely in Python, but it's really slow. So most of people disregard BeautifulSoup in production.
So we have lxml. The lxml toolkit provides Pythonic bindings for the C libraries of lxml2 and libxslt without sacrificing speed. This is just a wrapper around the C libraries. It's really fast. It's not purely in Python, as it's a binding for the C libraries. If you have new pure Python requirements, use lxml.
So when Google App Engine started, they didn't support lxml in the beginnings. Right now they support, but in the beginning they supported other libraries, but not lxml. Lxml works with all Python versions from 2.4 to 3.3, or should I say 3.4 because that's the latest version right now. Then we have regular expressions, the relibrary.
It's a part of the standard library. It's a regex library for Python. It's used only to extract minute amount of text. Entire HTML parsing is not possible with regular expressions. It's unpopularity. CO2 requires, it first requires you to learn its symbols, which are really difficult, like dot, asterisk, dollar sign, upper caret, backslash b, backslash w, backslash s,
backslash d, and it can become quite complex. You have to combine all those symbols, and then you have a regular pattern to extract the document, extract data from the document. However, it is purely baked in Python, which by that means it's a part of the standard library.
It's very fast, I will show later, and it supports every Python version from 2.4 to 3.4. Now a comparison of BeautifulSoup, Rea, and lxml. Here I've written a simple test to calculate the parsing speed of three libraries. After the test, we find that BeautifulSoup took 1851 milliseconds, lxml took 232 milliseconds,
and regex took 7 milliseconds just to parse the title from an HTML document. So we can conclude that lxml took 32x more time than Rea, and BeautifulSoup took 245x more time than Rea. So if you want to extract minute information, you should go with regular expressions.
So what to do when your scraping needs are high? You want to scrape millions of web pages every day like Google. You want to make a broad-scale web scraper. You want to use something that is thoroughly tested. So is there any solution? We have two solutions. First of all, you can deploy your own custom-made scraper, or either you can use a framework like Scrapy.
So I'll focus more on Scrapy because a fully tested framework is really fast. It's a full-blown away thoroughly tested framework. It's asynchronous. You can make a lot of requests in parallel. It's easy to use. It has everything you need to start scraping, from the HTTP libraries to the parsing libraries to the storing libraries.
And it's made in Python. So how does Scrapy compare to BeautifulSoup or lxml? BeautifulSoup and lxml are libraries for parsing. Scrapy is an application framework for writing web scrapers that crawl websites and extract data from them. In other words, comparing BeautifulSoup or lxml to Scrapy is like comparing
Jinja 2 to Django. I hope all of you know about Jinja 2 and Django. The Scrapy doc says that. But the major negative point about Scrapy is that it only supports Python 2.7, but not Python 3.x. The main reason for that is it is based on Twisted Networking Library. They're already working on getting 3.x support for Twisted.
So when Twisted gets the 3.x support, Scrapy is on the way. So when to use Scrapy? When you have to scrape millions of pages, when you want to synchronize support out of the box, when you don't want to reinvent the wheel, and when you're not afraid to learn something new. So there's a beautiful quote I ran across recently. If you're not willing to risk the unusual,
you will have to settle for the ordinary by Jim Rohn. So starting out with Scrapy. The workflow in Scrapy is very simple. First of all, you have to define a scraper. Define the items you're going to extract from the HTML document. Define the items pipeline. It is optional. It is just to store the data. And finally run the scraper. I'll just demonstrate the basic building blocks of Scrapy
because I don't have enough time to write a full scraper. And in Scrapy, a scraper is called a spider. So if you see the term spider, don't worry. It's the same. So using the Scrapy command line tool. Scrapy provides a handy command line tool which you can use to generate a basic skeleton of a scraper.
Just run Scrapy space start project space the project name. Here is the tutorial. So you'll get the following directory structure. This configuration file. And the package with items.py, pipelines.py, settings.py and then the spiders folder. A project can have multiple spiders.
So what is an item? Items are containers that will be loaded with the scraped data. They work like simple Python dictionaries but provide additional protection against populating undeclared fields to prevent typos. So you know which data you are going to store and which you are not going to store. So declaring an item class is really simple. Just import Scrapy.
Then define a class. Here it's DMOS item. It's taken from the Scrapy tutorial. Just I've defined a title, link and description fields. Scrapy.field is the simple, it's really simple. If you want to make for classified further, you can pass in arguments to Scrapy.field.
So extracting the data. You can, if you want to test your ex pass, you can simply use the Scrapy handy tool, the shell tool. Scrapy ships with the shell tool. You can simply type Scrapy space shell space the URL unless you want to test your ex pass. The Scrapy will open a session for you. It Scrapy provides ex pass, CSS selectors and rejects
to extract data from an HTML document. Extracting the title using ex pass is really simple. SEL.expass bracket the ex pass and then you simply do extract and that's it. That's how you extract data using Scrapy. So writing the first Scraper. A Spider is a class written by the user
to scrape data from a website. Writing a Scraper is easy. Just follow these steps. First of all, you have to subclass Scrapy.spider. Define the start URLs list. The list from which your Spider will start crawling. Then you have to define the pass method in your Spider. That is how you want to store the data and parse it.
So here's the full Spider. First of all, we have to write the name of the Spider in the class. The name is required to run the Spider later on. The allowed domain so that your Scraper does not deviate from the required domain. Then the start URLs so that the Spider knows from where to start scraping.
Then there is the parse method. Here we are looping over the response and then we are just throwing it in the fields. And then you have to use the items and that's it. So unleash the Scrapy powers. We have just defined a Scraper. We can just type Scrapy space call space demos by getting into the project folder.
So storing the scraped data. Here we have two choices. First of all, we can either use feed export. It's really simple. It stores the data based on the fields which are defined. Secondly, we have the items pipeline. It allows you to customize the way your Scraped data is stored.
So using feed exports, we can simply do Scrapy space call space demos and we simply have to add minus O which stands for output and then the file where we store the scraped data. Item pipelines are a separate topic and will be covered in the future. If you want to read on that, just open Scrapy docs.
They have really good information there. When not to use Scrapy. There are certain points you have to keep in mind while using Scrapy. If you just want to make a throwback script, don't use Scrapy. If you want to crawl a small number of pages, you don't need to use Scrapy because it is really useful if you want to scrape a lot of pages.
If you want to make something simple, don't use Scrapy. If you want to reinvent the wheel and want to learn the basics and make a project to counter Scrapy, go for it. So what should you use? So if you want to make a script, we just not have to extract a lot of information and if you're not afraid of learning something new,
then use regular expressions. You should use them only if you want to extract minute amount of information from a web page. If you want to extract a lot of data and do not have a pure Python library requirement, then use LXML. It's really fast. If you want to extract information from broken markup, then you have to settle with beautiful soup.
And if you want to scrape a lot of pages and want to use a mature scraping framework, then use Scrapy. So what do I prefer? Seriously speaking, I prefer regular expressions and Scrapy. I started web scraping with beautiful soup as it was the easiest and all the Stack Overflow questions had Scrapy as beautiful soup as the preferred solution.
Then I started using LXML and soon found beautiful soup really slow. I already showed you the test. They took a lot of time as compared to LXML. Then I used regular expressions for some time and fell in love with it for its speed. And now I use Scrapy only to make large scrapers or when I need to get a lot of data.
Once I use Scrapy to scrape 69,000 torrent links from a website. So now let's talk about YouTube EL. It's a program I developed and it also uses web scraping on the backend. It's a Python script that allows you to download videos and music from various websites like Facebook, YouTube, Vimeo, Dailymotion, Metacafe and almost 300 more websites.
So well, that was it. I hope you learned something from this, about web scraping from this talk. So it was my first conference. So forgive me for any mistakes. And if you want to talk to me, just meet me outside. If you want to ask something, then don't hesitate and I will try to answer finally questions.
Thank you very much for a very fast talk. We have plenty of time for questions and please go ahead. I mean, one challenging thing in using the Scrapy is that,
let's say there is a change in the HTML or DOM structure of any website. I mean, is there any like exception handling? I mean, we can use to detect changing in the DOM structure of the website and how we fall back or handle such case.
So if you use any scraping library, you will have to see. If a markup changes, your scraper will break and it will not save any stored data. You will have to change your scraper based on the changed layout of the website. So there is usually not any other way across it. So you'll have to modify your scraper a bit.
If the site blocks your IP address. So if our site blocks our IP address, we have some, we can use workarounds.
First of all, we can use IP rotation or you can use the Tor network. There are some websites, some professional websites which allow you to scrape through their domain. They have IP rotation. You can buy a lot of IPs and then rotate your IPs. That's the only way you can bypass that.
Yeah? Scraping hub? Yeah, I've been there. I've used it. Yeah, it's good. I like your character a lot. Your logo. Somewhat related question.
Does Scrappy have any kind of rate limiting support? For example, I don't want DDoS site and I don't care much about latency. So I want to rate limit my scraping to one page per second. Yeah, you can do that. There's an option, this configuration file. You can limit how many web pages you want to open in parallel.
You can either open two pages or you can use one page. If you open one page, you can also set an option where you can, your scraper will wait before opening another page. If you want to wait for two minutes before opening the next page, if you don't want to put a lot of load on the server, you can use those settings.
The only negative point of Scrappy is that it currently doesn't support 3.4. Everyone is rushing for Python 3, but still Scrappy doesn't support it. The Twisted Networking Library, they already have 60% support for Python 3 and they are going to achieve their milestone in some couple of months.
So I hope Scrappy will be there. Any more questions? Yeah, one question. How do you deal with pages that are purely Ajax-based or they render the page with JavaScript?
What is your workaround? Because Scrappy uses the DOM path, right? So what is your suggestion there? The way which I work with that problem is that you can use simply Chrome Inspector and inspect the Ajax calls. You can copy those Ajax calls. Usually there's an API. You can make a pattern of the API URLs
and then pass those URLs to Scrappy and it will use those API URLs. Because those API URLs return their data in HTML form. Any more questions?
No? Then thank you very much again and thank you very much for all your time.