Beyond scraping

EuroPython

Neut, Anthon van der

Formale Metadaten

Titel

Beyond scraping

Serientitel

EuroPython 2016

Teil

101

Anzahl der Teile

169

Autor

Neut, Anthon van der

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben

Identifikatoren

10.5446/21108 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2016

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Anthon van der Neut - Beyond scraping Scraping static websites can be done with `urllib2` from the standard library, or with some slightly more sophisticated packages like `requests`. However as soon as JavaScript comes into play on the website you want to download information from, for things like logging in via openid or constructing the pages content, you almost always have to fall back to driving a real browser. For web sites with variable content this is can be time consuming and cumbersome process. This talk show how a to create a simple, evolving, client server architecture combining zeromq, selenium and beautifulsoup, which allows you to scrape data from sites like Sporcle, StackOverflow and KhanAcademy. Once the page analysis has been implemented regular "downloads" can easily be deployed without cluttering your desktop, your headless server and/or anonymously. The described client server setup allows you to restart your changed analysis program without having to redo all the previous steps of logging in and stepping through instructions to get back to the page where you got "stuck" earlier on. This often decreases the time between entering a possible fix in your HTML analysis code en testing it, down to less than a second from a few tens of seconds in case you have to restart a browser. Using such a setup you have time to focus on writing robust code instead of code that breaks with every little change the sites designers make.