Anthon van der Neut - Beyond scraping
Scraping static websites can be done with `urllib2` from the standard
library, or with some slightly more sophisticated packages like
`requests`.
However as soon as JavaScript comes into play on the website you want
to download information from, for things like logging in via openid or
constructing the pages content, you almost always have to fall back to
driving a real browser.
For web sites with variable content this is can be time consuming and
cumbersome process.
This talk show how a to create a simple, evolving, client server
architecture combining zeromq, selenium and beautifulsoup, which
allows you to scrape data from sites like Sporcle, StackOverflow and
KhanAcademy. Once the page analysis has been implemented regular
"downloads" can easily be deployed without cluttering your desktop,
your headless server and/or anonymously.
The described client server setup allows you to restart your changed
analysis program without having to redo all the previous steps of
logging in and stepping through instructions to get back to the page
where you got "stuck" earlier on. This often decreases the time
between entering a possible fix in your HTML analysis code en testing
it, down to less than a second from a few tens of seconds in case you
have to restart a browser.
Using such a setup you have time to focus on writing robust code
instead of code that breaks with every little change the sites
designers make. |