We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Web Scraping in Python 101

Formal Metadata

Title
Web Scraping in Python 101
Title of Series
Part Number
103
Number of Parts
119
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceBerlin

Content Metadata

Subject Area
Genre
Abstract
M.Yasoob Khalid - Web Scraping in Python 101 This talk is about web scraping in Python, why web scraping is useful and what Python libraries are available to help you. I will also look into proprietary alternatives and will discuss how they work and why they are not useful. I will show you different libraries used in web scraping and some example code so that you can choose your own personal favourite. I will also tell why writing your own scrapper in scrapy allows you to have more control over the scraping process. ----- Who am I ? ========= * a programmer * a high school student * a blogger * Pythonista * and tea lover - Creator of freepythontips.wordpress.com - I made soundcloud-dl.appspot.com - I am a main contributor of youtube-dl. - I teach programming at my school to my friends. - It's my first programming related conference. - The life of a python programmer in Pakistan What this talk is about ? ================== - What is Web Scraping and its usefulness - Which libraries are available for the job - Open Source vs proprietary alternatives - Whaich library is best for which job - When and when not to use Scrapy What is Web Scraping ? ================== Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. - Wikipedia ###In simple words : It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting. We can extract any data through web scraping which we can see while browsing the web. Usage of web scraping in real life. ============================ - to extract product information - to extract job postings and internships - extract offers and discounts from deal-of-the-day websites - Crawl forums and social websites - Extract data to make a search engine - Gathering weather data etc Advantages of Web scraping over using an API ======================== - Web Scraping is not rate limited - Anonymously access the website and gather data - Some websites do not have an API - Some data is not accessible through an API etc Which libraries are available for the job ? ================================ There are numerous libraries available for web scraping in python. Each library has its own weaknesses and plus points. Some of the most widely known libraries used for web scraping are: - BeautifulSoup - html5lib - lxml - re ( not really for web scraping, I will explain later ) - scrapy ( a complete framework ) A comparison between these libraries ============================== - speed - ease of use - what do i prefer - which library is best for which purpose Proprietary alternatives ================== - a list of proprietary scrapers - their price - are they really useful for you ? Working of proprietary alternatives =========================== - how they work (render javascript) - why they are not suitable for you - how custom scrapers beat proprietary alternatives Scrapy ======= - what is it - why is it useful - asynchronous support - an example scraper
Keywords