Beyond scraping

EuroPython

Neut, Anthon van der

Formal Metadata

Title

Beyond scraping

Title of Series

EuroPython 2016

Part Number

101

Number of Parts

169

Author

Neut, Anthon van der

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/21108 (DOI)

Publisher

EuroPython

Release Date

2016

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Anthon van der Neut - Beyond scraping Scraping static websites can be done with `urllib2` from the standard library, or with some slightly more sophisticated packages like `requests`. However as soon as JavaScript comes into play on the website you want to download information from, for things like logging in via openid or constructing the pages content, you almost always have to fall back to driving a real browser. For web sites with variable content this is can be time consuming and cumbersome process. This talk show how a to create a simple, evolving, client server architecture combining zeromq, selenium and beautifulsoup, which allows you to scrape data from sites like Sporcle, StackOverflow and KhanAcademy. Once the page analysis has been implemented regular "downloads" can easily be deployed without cluttering your desktop, your headless server and/or anonymously. The described client server setup allows you to restart your changed analysis program without having to redo all the previous steps of logging in and stepping through instructions to get back to the page where you got "stuck" earlier on. This often decreases the time between entering a possible fix in your HTML analysis code en testing it, down to less than a second from a few tens of seconds in case you have to restart a browser. Using such a setup you have time to focus on writing robust code instead of code that breaks with every little change the sites designers make.