Merken

Beyond scraping

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
so on with the standard sessions or outcome but every 1 of 4 or next talks and that's going to
talk about the script and was so because of welcome OK thank you value for the introduction of Beyond
scraping that is the main title and of course for this beyond scraping depends on what side you're coming from and I'm coming from the past if you look at scraping last 20 years ago was very easy on the deleted the that was built up 40 user could be easily retrieved in an automatic fashion but nowadays that's not possible anymore you JavaScript to make the experience much more nice for the end user and if the data size presented for the end user but not necessarily in a specific way to automate the downloading can be very hard to get something before I started to proper toll would like to see some hands was used URL it from the standard library who's use requests maybe use your hands it was used beautiful so prefer reader version for use linear most likely less still the user and q yeah it gets interesting and who used by virtual display I think it's still some people so this is all the exercise you get unless you want to leave early of course I'm the talk is not very technical you will not see any height code but these are the buzzwords you do all this together in a proper way with the right idea behind that you'll be able to scrape current web sites and I I would say like 99 per cent of them without too much trouble some background but
for many people don't know me beyond that fold it to the shared set bite conference descend down by education I'm a computational linguists forcefully I couldn't do anything that might and during that time because at the time I was writing my teases was writing the 1st byte integer after death or party during that I was doing 3 D and 2 D computer graphics and there I actually missed an opportunity in 93 start using fighting 1 of the students from the University of Amsterdam started working for me introduced me to the language we already had a C program which to intervene in Turkic languages hanging off and I did 1 the 3rd 1 in the program but I like to fight and I actually like that because of the indentation a lot of people don't understand that so when the 1st look at the little indentation but I came from using transputers and I'll come to and they use indentation unfolding evidence of that was for me it's some stuff vitamin in 1998 and finally got the opportunity to do something commercial in item 1 . 5 . 2 on Windows and optical as the graphical user some people might know me from C implementation in order from the or dictionary by so uh a very complete or the dictionary much more complete than the 1 in the standard library implement re-implemented and see back in 2007 it was my 1st experience with the making item packages and more recently I picked up you all are seems to be kind of that's made it into your more 1 . 2 compatible passes the final partial and so on I started because I find it kind of strange to have a human readable data formats that would throw away the comments when you read that in the rodent bite back out so it's run from the faster it now there's all kinds of extra things and those are available from might be I expected so scraping about what is the actual problem well you want to download information from all kinds of websites but sometimes you want to see some state you would interact with the website and change the state of Missouri downloaded data you already know what is there but you want to increase your score somewhere where you want to make sure that somebody knows that you visited although you're actually on holiday on lying on the beach and did not just start of your fire your browser so before you know before go into the let's briefly should look at the pages and so you know what I use for terminology from page costly is the structure of text tree structure of the text can have attitudes and the text can data so if you look at this small example national file between structure is shown by the annotation if you use a debugger in within your browser it and then step for you to actually see what structure if you don't have to write a schema like that you can write it all behind each other it's difficult to was to look at what the structure is if you look at it a tag there from the the 2nd model but it has 3 attributes 8 rev idea class and has some data on the so depending on what kind of library used to go into HTML you also so you can also say that the the other side is data that is to say his body it sometimes helps to have like multiple things together especially if you have things like italics you might not just want to have like a superior dead ends at the end because the data from the diagonal element of automatically goes away with all the intermediate text just put together the data that so of that page maps some URL to some data and that's often unique but it might not be unique for a uh you might get something different for you know you look at that later ends of that looking at right now the year the old version of like changing data it's like you some form data you submit the form depending on what you know how you feel about the form you'll get a different result on the basis to go to although it's same neural but also what happens is if you have some state in it could be that might influence what kind of data you get depending on given a specific URL and nowadays it's depending a lot of fun JavaScript projects we get there you have websites that have only 1 you roll it never changes but all the time you get different data depending on the state and the office with the 6 on those on that single page brief interlude this different ways of developing software and I just wanted touch on that so you understand why I did with things the way it can use a complete framework that covers anything that you want to do is learn it and then implement what you know that little part that you want to do within that framework using configuration or do right some code depending on the framework that some frameworks for doing that back and developments and also more framework like tools that you can use for scraping the other way it's going from the bottom using some build existing building blocks and gluing them
together with your own culture if you develop like I do for some for some customer was interested in getting some wheels framework is not necessarily the best way to go if the framework expected does what you need to do it you don't have to change the framework itself they might that goal with framework but if you need to go and dive into the framework and changes the % of code that you use their you 1st have to find the temperature and do changes and the biggest problem existing that after running the code for a year not looking at that you completely forgotten about how the framework works so you have a big problem updating your own coding understanding of object the the blue box together in the box expect essentially do what you want you only have to look at your own blue that's the code you wrote yourself in the 1st place and after you get much much more likely to understand what you did a year ago you might even given the that you would have to start from scratch you might do it in the same way so i'm gonna send something like building the building blocks together that I showed you earlier on the other hand you have reasons for simple websites so those are the ones you can actually excess with you by using neural at 2 and requests sometimes you want to use form data to get access to the data that you need answered especially request help you do that if you can get the data that you want your other 2 and haven't used requests and recommended you actually look at it and the use of libraries we do some basic stuff for you like direction xt doing things like handing over cookies is more complex and if there's something of script on the side of things we get that because you have to look at what does the opposite of 2 how can I do that by hand scanner get data that the other scripts do it some neural at the request and then inserted into page or directly cookies are used to keep state and um at 1st this specifically mentioned them because they are often used to preserve your authentication information data that is valuable to get all still there might not be able to type of the available for free so it's not like you get some neural and you get to the data you want to lock in 1st and so then be able to proceed to to get to getting the data doesn't negation or easily there was some building or that there is still some building authentication in your browser itself and use a very coarse pop up window where you would like a username and password more often there's some form you'd have to fill out of the page and so that form the information from that form of the bank and create some cookie and that these people as that is used to each step over the last mature exactly how much 7 years or so open IDE has come up which allows you to present that developer to concentrate on getting the information across that you want and not depend on not have to write the too much of the loading code as president for advantage if you were have here the tag directly I'll water Google and that you if necessary have some more physical contact or you can actually physically trace the person who loved in because nowadays school and you know if you set up a new account will ask you for telephone number but it can send something code have to study in France and Germany right now is not possible to get a telephone account without showing a possible so some some vector X that is being done and so that that might be convenience but it might also be that people want to know that you're a real person and that please have a real there's some real telephone associated but with with a person that accesses the site so if scientists JavaScript's then you relative to and requests are of little use that i've done when things came up the JavaScript I have done the passing of the what office could just by hand but you have to be together script it's often difficult to trace will exceed if you have a browser head and compare what you get with your relative to what you see in your browser is not only different unless of course like to switch off JavaScript in your browser and that this often to its 1st indications like and I could easily scraped website or do I have to use more advanced tools to get to the data that but JavaScript probably all know it's it's kind of data parts of the tree dates military and by requesting additional data from the back and so why do we do that so what do that let developers of this is primarily because it's a nice user experience and if you don't have to update all the website it quicker updates which it gets to be nice if user experiences and reduces the bandwidth that you JavaScript has different several downsized from escaping perspective is that you don't get too easy to do that to the to the website and some of the most of them
is also a big problem is that JavaScript you essentially don't know when the page is finished doing relative to request or to to page comes back you know we have all the data if you have a page as office script you have to wait till it's done processing but it might never be done processing it might wake in the loop or it might have some channel open for additional data to come from a beggar never know what itself so if you can see something about new browser you probably can use selenium started browser and and talk to the browser from Python and get you material so you just use selenium like like he was using the mouse you drive to the pages and you take things if there if that's necessary feel things out so then originally was used for testing or using protesting and that is the used quite easy because a few test something you made a page you just have to see the basics is what you expect it to be you already know structure you know what ideas you use what classes you use you know how to get to the particular elements in their height esteem out and the but the advantages so if you use a linear and that test never the discrepancy between because you're actually using a browser between what you see talking to was to this 11 and uh open browser and other normal user will see the principle you can get into anything that normally use a nice fathers of selecting this also is because surprises open if you if the program is not accident gets because you sleep loop or you're you're waiting for something which then you can just started DeBoever agency what about what the space look like building the river fiber of whatever works for you the important thing is that the program has to run as a programs stops selenium closes dominant closes does your browser and then you will be able to see what went wrong because if something went wrong when you try to access an element that is they're not they're in tree you will crush your program might crash depending of course on how you write it and any useful information that you could get from the browser is done you have to start up browser external legal to the basics and at what this structure what did I expect all motivational element there changed back and try to get these the so if you use learning this you can do a superset of the URL at 2 in the core requests I think that you can do because of all the have script that is handled correctly and so there is a Main there's 2 main differences 1 of the difference is that you open a browser and if you use your Elliott you don't open a browser you can use your knowledge or request easily from a from job on that at the server without any problems that is not possible with selenium without doing some extra stuff selenium opens the browser and browser needs to be in the training so
let's look at we have some more of the problems that were invented you're never sure where the data is that page loads JavaScript started JavaScript has to has some special function to actually wait till the complete page is loaded before it starts executing and I have no clue when it stops executing sometimes you just wait for 5 seconds but because you know in normal situation the thing things will be there but much more safe if to check if the particular piece of data that you are interested in x league uploaded on that if you have a table of elements like that might be a few elements already loaded and you don't know how many are going to be is done lowering are not so the 2nd interval we that the web pages a structure and there's different ways of getting to a particular piece of data on the web page that you might want to extract the things you probably want to extract idea of data for some attribute value URL to a PDF file or to some other you can get it that depending on how the pages build up by using the idea of the ideas that should be unique although I have seen several pages especially generated but with Microsoft's CMS systems that we used to say might be on the same page is with that at that point it was decided not to use the idea because I don't know if the browser and using Beautiful Soup will exceed the browser might take the 1st idea beautiful of 2nd is like let's not used depending on how the website is structured you can search by class that this course if something is quality in a specific way and the coloring is done a specific class you can get that 1 item but it's not necessarily like always be the case that these classes are not reuse of several positions in the and of events that you think you can programmatically you will go over its recent work on the top I have HTML and then at the body and thinking about that it's not particularly fast and there something called XPath which can if you haven't used it yourself it's more or less like a regular expression to get to a particular piece of data based on the tag names and some attributes exploded not very complicated but if you don't use it on a daily basis it's kind of hard to remember how to do things there's a better reusable options that I tend to use and it's CSS select about it's not as powerful thing is the this powerful enough for all of my purposes and didn't like for instance this here that says get any URL that is nasty URL on some side of course but it might be longer the character actually make sure that only has started at 38 stress of age any element has to have these start with this spring and then the a has to be after you've element that has important as plots and and all kinds of rules like this this kind of thing I think you so many might not support this but this is beautiful so rational adults and this CSS allows you to get to particular elements and it's going to be like a can that as soon as you point to be a element you can get if you are interested in that the full URL of this data is represented on the user selects after my preference over X but because I can also use it when I make a website in using this user's files to exclude determine the look and feel of the sun but they're like I said there are restrictions that you have to be aware of a sponsor lineament beautiful so don't implements user selections as complete you brought about so was a typical selenium session before we go into how to do it differently you open a browser and go to senior you click log-in button issue that you have to authenticate you wait until the redirection to the opium ID provider side is is reached you provide your credentials as for the whole subject in itself how do you automatically provide credentials you don't want to have everybody read your own login name and password this if you take but 1 of the simpler ones is if this makes subdirectory in the SSH directory if you're running Linux that has already the restrictions and not checked restrictions on accessibility only by the owner of file then if you print shows you wait till you get back to request page after the open idea has a lot of an ID system has notified Europe that everything is OK then you fill out of search criteria to restricts the new or look for new data that has happened has been added since the last time you checked then you might get a table a list of items you take 1 of these preferences in the table and then you find might be on the final page and get the data from there you are extracted from HTML or you find a link in the link might be to from some files media followers modifier the main problem with this is that the building is very time consuming and every time you log in the if you wait and it's like you know talking about seconds if your program doesn't it give the and your program doesn't exactly know how to analyze the structure of the of the last page where you actually retrieve the file information order the textual data they have to restart your program and it has a log in again so we're talking about tens of seconds if not minutes to before you can get to where you want and if you client waiting is likely to suffer is not working anymore this kind of
that so how can we improve on that that you don't have to restart selenium every time 1 there's always several ways but the way I solved this going into a client server architecture where the server talks with selenium and my client and scratch or can we started the continuing where left off the server keeps it but the selenium session open and that keeps the browser open even if the if the class predictions for them to do that you need some protocols we think about like our was set up to have to be very sophisticated and get the data to the server which is the request you get data from the server to the client for analysis and knowing what what states that the program is on the website is so you can take appropriate action for rewrite your client program to date what are appropriate action recently when I set this up a couple of years ago I thought about like I'll write some files with increasing founding members the server will just look at the directory and all get stuff from there but then I looked at 0 and Q and actually allows you to to discuss things pretty easily it allows you to a many to 1 among other things that allows you to have a many-to-one connection between many clients and 1 server that also allows you to have multiple other multiples transferred in your clients were having no and still have server using 0 and he was very it's it's trivial to get the the server side on a different machine use using port numbers and specify with machines things are going on if they're not on local host 0 q not by default but so it allows sending unique codebase exchanges and it especially easy to get data you might not to use like special characters in your protocol but on the website that you download remote certain at some point to find non-ascii characters and we have to deal with those so you might as as well set up the whole thing using
unique so if we look at this thing that we did before the the session getting to some data if you have a client server-based solutions then the things look slightly different you open a browser but only if it's not already over on log in that and but only if you're not looking at if modeling loading yet but you're at the Open ID sites you I don't have to go to the Open ID side etc. etc. you don't have to do things that are already done then you just have to pick up where we laughter left off last time and you have to check
for those so it might just be if the finals so faced with the data exchange it you don't do any of the initial thing she just check if they're done and they you directly get turn around time starting a client program goes down from tens of seconds to a minute to part of a 2nd and you your data much waiting to be people so dividend goes for advanced set so if you define a protocol what do you need to know the protocol since some compounds with some parameters and gets a result that so you get like what kind of commands to we need what kind of parameters to there's only very few of them so the statement you have to be able to open I read no and I use specific of the like before that so I can open multiple windows on the server side if you don't do that you should have only 1 window to work and it's very difficult to do many to 1 or have multiple clients running because they have to be there would be complete the feeding Ford same window to do using that window ID and say go to some neural the page will show up in the browser selenium has any the next particle thing that you need select some specific items based on that uh on the item might be like MIT you can reuse again on a specific page window and then you want to interact with the specific item based on his ideas you might want to take on it to however radio button clicked or to go to some specific link clear input text area that might already be something where you want to write that's the next thing that you want to do we want to clear out the old password that is incorrect and if the new password and then very important return some HTML starting with a particular eye you can of course get to complete its ML page but it's inefficient often already know like 0 I'm only interested in this table the selected table you solatium and then you get a whole table back and the other thing that is almost necessary to have is like what is the grandeur that and looking at because if you go go to an open eye debates and you see Take somewhere you need to know the text gets back to your original site to continue working so you want to check want to be able to check if declined check us to server what is that we can extend this protocol it whatever makes things more efficient and this is essentially where I stopped a year and a half ago after adding a few few things um it might be just people work part uh by the more efficient to work to do things on the client side then pushed into search so he get HTML now that you need to add to do an analysis of that I use Beautiful Soup formats it's faster than going over trees selenium track getting the individual items of course not useful if you have to actually take on the items that you still have to do it on the server side I like you already indicated that this user selects supports this 1 though you get these of additional data back and it also wants to have it all you know it's not so this industry and between different races and then you can actually handle so the 1st problem that is all this with a client server server active architectures declines compression in don't have to start from scratch but it's the whole thing introduce the problem is that you have to have had the best all where you actually start to browse and if you want to run something on server or you don't want to that some point in time and browser start value typing in in some some e-mail from that's a solution using by virtual display creates a virtual the name of it creates a display where you can that you can use to start a browser we will actually not see that displays but for developing versus you purposes you can still get it if you start session so what I normally do is I don't use deficiency back at the private display back and why I'm developing and then when it starts running it's fine and that my client precious anyway all use the in situ connects to private to display start up and then she's like all the browser stopped because of whatever reason sometimes you get like super things like your website requires you to change your password every 6 months and you haven't done that and of course like it because of the different based on expected because you never prompted for that this different ways of extending this but I already have done in this respect the advertisements and user for every browser often in the back and the values of some of radiation that of course loads pages that use it that's much faster what doesn't work it's linear but it's the client-server architecture is capable of using the Tor network by starting Firefox with its own extensions and write it slightly more than slightly less powerful than selenium words for most purposes about available the availability of the sulfur like the previous talk suffers not yet abide by I to remove some stuff on the client side that is preparatory forward occurrences developed some foreign you would recognize breaking the script suffer from so I need to get out the 1 sponsored get up there will be able to see them by using rule will browser clients and little browser server and I will also at you to video uh with that information when this is all so that also almost the end of my talk I can take some questions now I can also give some roof world examples for what I use it's not for let's to the questions 1st through examples make various microphone the the on the height usually this kind of problems we have when they they just kind of single-page implication or and task-driven and it's usually took into EPI right of problems that can right but not use like if there's an API available you might just want to use the to get the data and looking at stages of that are not designed for that don't have an API to get you to the OK and so the main problem is that you on you need to be sure that the page is completely loaded yeah that's what it actually has a look at some some some specific element on the base if it's already there are not if the if the table if you if you immediately check you might not have the table at all in the table that's there but you don't know if all lines have been loaded so there might be some indication that there's going to be like 15 resulting so you have a table of 10 items you noted 5 results still need to be loaded sometimes just waiting for hoping that everything arrives yeah sure mothers selenium looks pretty complicated complex and a lot of stuff I used isn't it easier to do something like I don't know sleep onset content checked and I know the content of the data from the idea that we need use but you still need to use the learning to get like this is the content there Howard download all based but you also have to to go requests learning for that and synonymous so if you don't use selenium but which go back to like user requests anything that gets loaded by JavaScript you will not get at all because requesters entirely up to it so it's this different way of addressing these but it's the 1 problem but we had twins using synonyms to access data was that these pages sometimes have data thinkers and other elements that do not allow you to type in data and these are usually very complicated automated as you have these problems and do you ideas use to know and only about this xt is there there's multiple things that you can do I have to have seen these problems um so you have a very recall correctly like that there are many calls like just right in this field but is also samples were used where you click somewhere and stuff just sending characters and they give you have to make sure that your cursor is isn't right position and they will get there and uh dominance France's come academies they the website has that kind of problem so you can get around it's not trivial would you there there's different ways of getting the data actually the and would have to see if the protocol has some option of what a what of the 2 to use at the end of I think we have to have some 1 4 god when using also what 4 I got when using the same him to do about the same thing the treaty to someone simplest is that a lot of people don't want a trade of the scrapped sold using services like distant worlds chlorophyll that his proxy is that we try to determine detect patterns was grabbing on 1 this thing John match among the really good some capture did you you encounter this project alignment well what 1 of the reasons to do the client-server architecture is that 1 of the the that the most frequent things I've seen is that they notice that you log in like 7 times a day it's like why is that why that was like that of receive and those kind of things if you notice is 1 of the examples that have let's see StackOverflow like this they will actually that how often you fresh restrict that thank you if you want to advance on the use of the review using it like a thousand reviews incredible that you have to do special things and load balance where actually looking depends of course on the side like this is the feminine and and making available and the being you will look at the patterns that you're using Triton but especially if you behave like a normal if you have a problem it is like a normal person they can hardly kick you out and for me that's for some sites for clients that means I do scraping it takes 2 hours but they don't want to have it done once a day and they cannot this allow you to like they put production state state and references to students and media files on the day it will again assume that you need to read DFT and they don't want you to like within 5 seconds downloadable PDF files but if you download 1 every 2 minutes we still can provide you climb at the end of the day the term PDF files that world that's the way I am bit I just let my problem behave like it as if it was human and that has to be acceptable and the the there yeah that's set up a 2nd account have a 2nd account there is time for 1 last question and yes I just wanted to added during In some ways to round so then you have less with the release of the standby that's obviously and with fountain DS and chromium rate is selected area reliance on I mean there is less to around so then you have a less without using part of in the idea of solids let him as the most where you don't like you don't get a browser window the disadvantages students did not using their real browser s so that might be detected and the other thing is if things go wrong you have nothing to look at you just have your HTML structure and the nice thing is if you buy the use by virtue of the state displays used in senior she the browser did you would normally be using is like always it's in that state you don't it is much more recognizable if it if it now after 6 months I ask you for you to change your password if you see that's like you have to change your password and and instead of like getting the HTML in like of what is the next time that but that is also possible is just the multiple ways of of addressing these things but all everything has advantages disadvantages OK good thank you and you want
Web Site
Datensichtgerät
Versionsverwaltung
Strömungsrichtung
Web Site
Code
Computeranimation
Metropolitan area network
Standardabweichung
Programmbibliothek
Skript <Programm>
Baum <Mathematik>
Standardabweichung
Resultante
Mereologie
Computergraphik <Kunst>
Browser
Formale Sprache
t-Test
Computer
Aggregatzustand
Element <Mathematik>
Euler-Winkel
Technische Optik
Computeranimation
Eins
Homepage
Richtung
Netzwerktopologie
Metropolitan area network
Negative Zahl
Dämpfung
Bildschirmfenster
Minimum
Skript <Programm>
Singularität <Mathematik>
Gebäude <Mathematik>
Applet
Web Site
p-Block
Sinusfunktion
Software
Menge
Ganze Zahl
Dateiformat
Projektive Ebene
Information
URL
Ordnung <Mathematik>
Message-Passing
Aggregatzustand
Maschinenschreiben
Web Site
Subtraktion
Quader
Wasserdampftafel
Klasse <Mathematik>
Mathematisierung
Zahlenbereich
Implementierung
Transputer
Framework <Informatik>
Code
W3C-Standard
Informationsmodellierung
Bildschirmmaske
Perspektive
Software
Datentyp
Programmbibliothek
Skript <Programm>
Passwort
Datenstruktur
Optimierung
Softwareentwickler
Hilfesystem
URL
Schreib-Lese-Kopf
Attributierte Grammatik
Cookie <Internet>
Einfache Genauigkeit
Vektorraum
Objektklasse
Elektronische Publikation
Office-Paket
Objekt <Kategorie>
Mapping <Computergraphik>
Mereologie
Basisvektor
Sprachverarbeitung
Cookie <Internet>
Codierung
Authentifikation
Bandmatrix
Baum <Mathematik>
Quelle <Physik>
Offene Menge
Mereologie
Punkt
Prozess <Physik>
Browser
Element <Mathematik>
Bildschirmfenster
Login
Raum-Zeit
Service provider
Computeranimation
Homepage
Eins
Netzwerktopologie
Spezielle Funktion
Client
Prozess <Informatik>
Trennschärfe <Statistik>
Skript <Programm>
Urbild <Mathematik>
Gerade
Softwaretest
Addition
Gebäude <Mathematik>
Applet
Ereignishorizont
Konfiguration <Informatik>
Teilmenge
Login
Server
Information
URL
Ordnung <Mathematik>
Normalspannung
Verzeichnisdienst
Instantiierung
Tabelle <Informatik>
Subtraktion
Web Site
Wellenpaket
Ortsoperator
Klasse <Mathematik>
Abgeschlossene Menge
Web-Seite
Service provider
Loop
Canadian Mathematical Society
Skript <Programm>
Passwort
Optimierung
Datenstruktur
URL
Attributierte Grammatik
Elektronische Publikation
Zehn
Zwei
Browser
Schlussregel
Mailing-Liste
Physikalisches System
Elektronische Publikation
Binder <Informatik>
Office-Paket
Regulärer Ausdruck
Chatten <Kommunikation>
Last
Offene Menge
Basisvektor
Hypermedia
Speicherabzug
Kantenfärbung
Normalvektor
Baum <Mathematik>
Offene Menge
Server
Subtraktion
Web Site
Punkt
Browser
Gruppenoperation
Klasse <Mathematik>
Zahlenbereich
Computeranimation
Service provider
Virtuelle Maschine
Multiplikation
Client
Informationsmodellierung
Dämpfung
Prognoseverfahren
Optimierung
URL
Analysis
Einfach zusammenhängender Raum
Protokoll <Datenverarbeitungssystem>
Eindeutigkeit
Unicode
Elektronische Publikation
Portscanner
Offene Menge
Login
Client
Server
Computerarchitektur
Verzeichnisdienst
Baum <Mathematik>
Aggregatzustand
Resultante
Offene Menge
Stereometrie
Bit
Punkt
Datensichtgerät
Browser
t-Test
Diskrete Fourier-Transformation
Element <Mathematik>
Bildschirmfenster
Homepage
Netzwerktopologie
Temperaturstrahlung
Client
Dämpfung
Mustersprache
Bildschirmfenster
Skript <Programm>
E-Mail
Quellencodierung
Gerade
Umwandlungsenthalpie
Parametersystem
Befehl <Informatik>
Datentyp
Datennetz
Systemaufruf
Biprodukt
Ein-Ausgabe
Bitrate
Linearisierung
Konfiguration <Informatik>
Motion Capturing
Datenaustausch
Datenfeld
Rechter Winkel
Grundsätze ordnungsmäßiger Datenverarbeitung
Server
Dateiformat
Projektive Ebene
Primzahlzwillinge
Information
Summand
Cursor
Aggregatzustand
Tabelle <Informatik>
Proxy Server
Subtraktion
Web Site
Ortsoperator
Term
Lastteilung
Weg <Topologie>
Datentyp
Stichprobenumfang
Passwort
Inhalt <Mathematik>
Indexberechnung
Datenstruktur
URL
Matching <Graphentheorie>
Protokoll <Datenverarbeitungssystem>
Zehn
Zwei
Schlussregel
Binder <Informatik>
Elektronische Publikation
Flächeninhalt
Last
Offene Menge
Parametersystem
Mereologie
Hypermedia
Wort <Informatik>
Partikelsystem
Computerarchitektur
Baum <Mathematik>

Metadaten

Formale Metadaten

Titel Beyond scraping
Serientitel EuroPython 2016
Teil 101
Anzahl der Teile 169
Autor Neut, Anthon van der
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/21108
Herausgeber EuroPython
Erscheinungsjahr 2016
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Anthon van der Neut - Beyond scraping Scraping static websites can be done with `urllib2` from the standard library, or with some slightly more sophisticated packages like `requests`. However as soon as JavaScript comes into play on the website you want to download information from, for things like logging in via openid or constructing the pages content, you almost always have to fall back to driving a real browser. For web sites with variable content this is can be time consuming and cumbersome process. This talk show how a to create a simple, evolving, client server architecture combining zeromq, selenium and beautifulsoup, which allows you to scrape data from sites like Sporcle, StackOverflow and KhanAcademy. Once the page analysis has been implemented regular "downloads" can easily be deployed without cluttering your desktop, your headless server and/or anonymously. The described client server setup allows you to restart your changed analysis program without having to redo all the previous steps of logging in and stepping through instructions to get back to the page where you got "stuck" earlier on. This often decreases the time between entering a possible fix in your HTML analysis code en testing it, down to less than a second from a few tens of seconds in case you have to restart a browser. Using such a setup you have time to focus on writing robust code instead of code that breaks with every little change the sites designers make.

Ähnliche Filme

Loading...