Beyond scraping
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 101 | |
Number of Parts | 169 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/21108 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
MKS system of unitsWebsiteLibrary (computing)ParsingWeb pageNetwork topologyData structureAttribute grammarBookmark (World Wide Web)SoftwareBuildingSoftware frameworkBlock (periodic table)CodeVirtualizationWeb pageData structureWebsiteBlock (periodic table)Revision controlStandard deviationNetwork topologyBuildingLipschitz-StetigkeitAttribute grammarSoftware frameworkBitWeb 2.0State of matterElectronic visual displayWeb browserHTTP cookieSoftwareResultantParsingMultiplication signForm (programming)InformationMappingSocial classGreatest elementWritingComputer fileBookmark (World Wide Web)File formatData dictionaryImplementationLibrary (computing)Computer programmingInterface (computing)Computer graphics (computer science)WindowComputational linguisticsText editorProtein foldingNatural languageStudent's t-testFront and back endsInterpreter (computing)HypothesisMereologySingle-precision floating-point formatOpticsTransputerProjective planeBasis <Mathematik>Different (Kate Ryan album)IntegerSet (mathematics)Ocean currentScripting languageElement (mathematics)Endliche ModelltheorieSoftware developerOffice suiteEuler anglesTouch typingOrder (biology)Lecture/Conference
08:01
WebsiteHTTP cookieAuthenticationInformationWeb browserNetwork topologyData structureMereologyIntrusion detection systemSoftware testingDiscrepancy theoryNormal (geometry)Web-DesignerFront and back endsPersonal identification numberWeb pagePopup-FensterWeb browserInformationPrice indexParsingBand matrixSoftware testingOpen setFiber (mathematics)DebuggerAuthenticationoutputLipschitz-StetigkeitLoop (music)State of matterBuildingWebsiteElement (mathematics)Instance (computer science)Bookmark (World Wide Web)Type theoryNumberLoginComputer programmingIntrusion detection systemModal logicAdditionHTTP cookieNetwork topologyProcess (computing)Form (programming)Social classBlock (periodic table)PasswordCodeWeb 2.0Scripting languageLibrary (computing)Perspective (visual)Software frameworkData structure1 (number)Presentation of a groupDampingMathematicsCodeMereologyResultantOnline helpDisk read-and-write headOffice suiteMessage passingVector spaceCuboidSpacetimeObject (grammar)Water vaporSoftware developerWindowNegative numberNormal (geometry)Direction (geometry)Computer animation
16:02
Online helpSubsetWeb browserWeb pageNetwork topologyComputer configurationWebsiteBuildingOpen setBookmark (World Wide Web)Internet service providerExecution unitLink (knot theory)Hill differential equationBookmark (World Wide Web)Attribute grammarCore dumpCASE <Informatik>Element (mathematics)Web pageDifferent (Kate Ryan album)Scripting languageServer (computing)Data structureSubsetWeb browserClosed setComputer programmingWebsiteInformationComputer fileTable (information)2 (number)Stress (mechanics)10 (number)Instance (computer science)Order (biology)HypermediaSelectivity (electronic)Computer configurationBasis <Mathematik>PasswordEvent horizonPosition operatorSpecial functionsMultiplication signSocial classClient (computing)Graph coloringElectronic mailing listRegulärer Ausdruck <Textverarbeitung>Physical systemDirectory service1 (number)LoginBuildingLink (knot theory)Line (geometry)Special unitary groupPoint (geometry)Network topologyStructural loadWave packetProcess (computing)Internet service providerCAN busRule of inferenceCanadian Mathematical SocietySpring (hydrology)Expert systemForceString (computer science)ImplementationOpen setProbability density functionWindowCrash (computing)SoftwareComplete metric spaceDifferenz <Mathematik>Computer animation
24:03
Server (computing)Client (computing)Data structureWeb pageVirtual machineUnicodeOpen setWeb browserBookmark (World Wide Web)Communications protocolFunction (mathematics)Parameter (computer programming)WindowLatent heatParticle systemWeb browserAreaoutputServer (computing)Statement (computer science)ResultantMereologyClient (computing)Endliche ModelltheorieElectronic visual displayEmailNumberWebsiteDifferent (Kate Ryan album)MultiplicationPoint (geometry)Virtual machineState of matterWeb pageCommunications protocolComputer programmingPredictabilityMathematical analysisOpen setConnected spaceDirectory serviceComputer fileGroup actionComputer architectureMultiplication signSocial classDampingData exchangeLink (knot theory)Parameter (computer programming)Type theoryKolmogorov complexity2 (number)10 (number)TrailTable (information)PasswordNetwork topologyFile formatComplete metric spaceCrash (computing)CodeLoginFront and back endsVirtualizationDisk read-and-write headBookmark (World Wide Web)Selectivity (electronic)String (computer science)Thread (computing)Computer animation
32:04
Element (mathematics)Web browserCommunications protocolWebsiteTable (information)Content (media)Client (computing)Web pagePasswordMultiplicationDifferent (Kate Ryan album)Field (computer science)Right angleSystem callComputer configurationElectronic visual displayPattern languageSoftwareComputer architectureResultantBuffer overflowCartesian coordinate systemCrash (computing)Type theoryStapeldateiLastteilungMultiplication signVirtualizationWordMotion captureInformationYouTubeSingle-precision floating-point formatProxy serverServer (computing)Price indexFront and back endsReal number2 (number)Extension (kinesiology)Line (geometry)Automatic differentiationPosition operatorInstance (computer science)Configuration spaceProbability density functionStructural loadSet (mathematics)Cursor (computers)Rule of inferenceScripting languageGodLinearizationBitTerm (mathematics)Density functional theoryComputer fileThermal radiationState of matterTwin primeProduct (business)Sampling (statistics)HypermediaStudent's t-testMatching (graph theory)Projective planeLevel (video gaming)Lecture/Conference
40:05
Student's t-testWindowMultiplication signPasswordSolid geometryWeb browserAreaElectronic visual displayBit rateState of matterData structureAsynchronous Transfer ModeVirtualizationMereologyLecture/Conference
Transcript: English(auto-generated)
00:00
So we can start the next session welcome back everyone for our next talk Anton is gonna be talking about scraping the web. So let's help welcome Thank You Fabio for the introduction, um Beyond scraping that is the main title and of course for this beyond scraping depends on what side you're coming from
00:26
And I'm coming from the past if you look at how scraping was 20 years ago. It was very easy The The way that the web was built up for the user could be easily retrieved in an automatic fashion But nowadays that's not possible anymore. You have JavaScript to make the experience
00:45
much more nice for the end user and if the data is Presented for the end user but not necessarily in a specific way to automate the downloading It can be very hard to get something before I start with the
01:02
Proper talk. I would like to see some hands Who's used URL lip from the standard library who's used requests? Maybe use the other hand Who's used beautiful soup preferably the version for
01:23
use Selenium slightly less still Who used zero MQ and it's okay here. It gets interesting And who used by a virtual display Okay, good still some people so this is all the exercise you get
01:41
Unless you want to leave early, of course. Um The talk is not very technical. You will not see any Python code But these are the buzzwords if you glue all this together in the proper way with the right idea behind that you'll be able to scrape current websites and I Would say like 99% of them without too much trouble
02:04
some backgrounds For me people don't know me beyond that. I fold the t-shirts at the Python conference It is and doubt by education. I'm a computational linguist Unfortunately, I couldn't do anything with Python during that time because at the time I was writing my thesis
02:21
Gido was writing the first Python interpreter After that or partly during that I was doing 3d and 2d computer graphics and There I actually missed an opportunity in 93 to start using fighting one of the students from the University of Amsterdam who started working for Me and it used me to the language, but they already had a C program with two
02:44
Interpreted languages hanging off it and I didn't want to have a third one in that program, but I liked Python I actually liked it because of the indentation a lot of people don't understand that When they first look at Python Innotation but I came from using Transputers and Occam too and they use the indentation and folding editor. So that was fine for me
03:05
I did some stuff with Python and in 1998 I finally got a opportunity to do something commercial in Python 1.5.2 on Windows and Tickle as the graphical use interface Some people might know me from C implementation ordered from the ordered dictionary by Ford and LaRosa
03:25
A very complete or the dictionary much more complete than the one in the standard library I implement re implemented that in C back in 2007 and it was my first experience with making Python packages more recently I
03:40
Picked up YAML parser seemed to be kind of dead. It made it into YAML 1.2 compatible parser the pyyaml parser and I Started that because I found it kind of Strange to have a human readable data format that would throw away the commands when you read it in and wrote it back
04:02
Back out. So it's a run tripping Parser it now does all kinds of extra things and those are available from pi pi as packages so Scraping the web. What is the actual problem? Well, you want to download information from all kinds of websites But sometimes you want to stay in some state you want to interact with a
04:24
Website and change the state not necessarily download the data You already know what is there but you want to increase your score somewhere or you want to make sure That somebody knows that you visited although you're actually on holiday and lying on the beach and didn't want to start up your fire up your browser
04:42
So before you before I want to go into detail Let's briefly look at web pages and I so you know what I use for terminology for me a web page coarsely is a structure of text a tree structure the text can have attributes and
05:02
The text can have data So if you look at this small example of an HTML file The tree structure is shown by the indentation if you use a debugger in within your browser It often indents that for you to actually see what a structure is Of course, you don't have to write HTML like that. You can write it all behind each other
05:22
Which is difficult to to look at what the structure is If you look at the a tag there that third from the second from the bottom it has Three attributes H ref ID and class and it has some data other side Depending on what kind of library you used to go into HTML
05:42
You also you can also say that the other side is data that is associated with body It sometimes helps to have like multiple things together, especially if you have things like italics you might not just want to have like a superior tag and that and
06:01
Pick up the data from that tag and it often automatically does away with all the intermediate Tags just puts together the data that you have So a web page maps some URL to some data That's Often unique but it might not be unique for you might get something different for you. Well, we'll look at it later
06:28
and Looking at right now The the old version of like changing data is like you some form data you submit a form and depending on what you've how you Filled out the form you'll get a different result on the base that you go to although it's the same URL
06:46
What also happens is if you have some state in a cookie That might influence what kind of data you get depending given a specific URL and nowadays It's depending a lot of on JavaScript what you actually get in there
07:02
We have websites that have only one URL it never changes But all the time you get different data depending on your state and the JavaScript that is executed on those bit on that single page We've interlude there's different ways of developing software and I just want to
07:22
Touch on that so you understand why I did the way things the way I did you can use a complete framework that covers anything that you want to do and learn it and then implement what you The little part that you want to do within that framework using configuration or do writing some code depending on the framework
07:42
There's some frameworks for doing web back-end developments, there's also more framework like tools that you can use for scraping the other way is going from the bottom using some build existing building blocks and gluing them together with your own code
08:01
If you Develop like I do for some from some customer who is interested in getting some results a Framework is not necessarily the best way to go if the framework exactly does what you need to do And you don't have to change the framework itself Then you might better go with a framework, but if you need to go and dive into the framework and change
08:24
The 10% of code that you use there you first have to find the 10% Do the changes and the biggest problem exists in that after running the code for a year not looking at it Completely forgot about how the framework works So you have a big problem updating your own coding understanding your own changes if you glue blocks together and the blocks
08:45
Essentially do what you want. You only have to look at your own glue That's the code you wrote in yourself in the first place and after a year you're much more likely To understand what you did a year ago. You might even given That you would have to start from scratch you might do it in the same way
09:01
So I'm gonna present something like gluing the building blocks together that I showed you earlier on Or that I had to raise the hands for Simple websites, those are the ones you can actually access with you by using URL at two and
09:20
requests sometimes you want to use form data to get actually to the data that you need and Especially the requests help you do that if you can Get the data that you want with your other two and haven't used requests. I Recommend that you actually look at it and these
09:41
Libraries they do some basic stuff for you like redirection actually Doing things like handing over cookies is more complex and if there's some JavaScript on the sides If things really get bad because you have to look at what does the JavaScript do How can I do that by hand? Can I get data that the other scripts do with some URL it?
10:05
Request and then insert it in the page or directly use it. Cookies are used to keep state and I Specifically mentioned them because they often used to preserve your authentication information
10:23
Data that is valuable to get off the web might not be able might not be available for free so it's not like you get some URL and you get to the data you might have to log in first and Then be able to proceed to get the getting the data
10:43
the authentication Originally, there was some building or there is still some built-in authentication in your web browser It's seldom used it is a very coarse pop-up window where you put like a username and a password more often there's some form you have to fill out on your web page and
11:03
That form the information from that form on the back end create some cookie and that is keep that is used to keep state Over the last I'm not sure exactly how much seven years or so Open ID has come up which allows you as a web developer to concentrate on getting
11:22
the information across that you want and not depend or not have to write the too much of the login code and it has an advantage if you Have your Website redirect to Yahoo or to Google in that you if necessary have some more
11:42
physical Come or you can physically trace the person who logged in because nowadays Google and Yahoo if you set up a new account Will ask you for a telephone number where they can send some pin code that you have to type in and for instance in Germany where I'm living now, it's not possible to get the telephone
12:01
Account without showing your passport So some back trace that is being done then so that it might be convenience But it might also be that people want to know that you're a real person and at least have a real or there's some real telephone associated with something With a person that actually accesses to the site
12:20
so if a site has JavaScript then URL lip to and requests are of little use that I've done when things came up with JavaScript. I have done these Parsing of the what the JavaScript does by hand, but you have to read the JavaScript. It's often Difficult to trace what it actually does if you have a browser
12:42
And compare what you get with URL up to what you see in your browser is normally different Unless of course like you switch off JavaScript in your browser and that is often and good first indication is like can I easily scrape the website or do I have to? Use more advanced tools to Get to the data that I want what JavaScript does
13:06
We probably all know is it can update parts of the tree the HTML tree and By requesting additional data from the back end So why do we do that or why do the web developers do that is primarily because it's a nicer user
13:25
experience and if you don't have to update all of the website you get quicker updates, which adds to the nicer user experience and reduces The bandwidth that you need JavaScript has
13:41
different several downsides from a scraping perspective is that you don't get too easily to the to the to the website and That was too fast There's also a big problem is that the JavaScript you essentially don't know when the page is finished if you do a URL it to
14:02
request or To to a page it comes back, and you know we have all the data if you Have a page that has JavaScript you have to wait till it's done processing But it might never be done processing it might wait in a loop or it might have some channel open for additional data to Come from the back end, and you never know what it stops
14:25
So if you can see something in your breath in your browser You probably can use selenium to start that browser and then talk to the browser from Python and get your material out So you just use selenium like like you would
14:42
using the mouse you drive the The pages and you click on things if if that's necessary and fill things out Selenium originally was used for testing or I use it originally for testing and that is easy Why is it easy because if you test something? You made a page and you just have to see if the page actually is what you expect it to be you already know
15:04
Structure you know what IDs you've used what classes you've used you know how to get to the particular elements in the height HTML tree and The but the advantage is if you use selenium That there's never a discrepancy because you're actually using a browser between what you see
15:26
Talking to us to the selenium Opened browser and what a normal user will see so in principle you can get to anything that a normal user gets a Nice advantage of selenium is also is because the browser is open if you if the program is not exited yet
15:43
because you have a sleep loop or You're you're waiting for some input Then you can just start the debugger and you can see what how what the base look likes The building the burger or fiber whatever works for you But a
16:01
big important thing is that the program has to run as soon as your program stops selenium closes down and it closes down your browser and You will not be able to see what went wrong because if something went wrong And you try to access an element that is there not there in the HTML tree you will Crash your program might crash depending of course on how you write it and any useful information
16:25
That you could get from the browser is gone You would have to start up a browser externally go to the base actually look at what is structure What did I expect? Oh, there's a new element there. They changed the back end and try to get these things so if you
16:43
Use selenium as you can do a superset of the URL it to and the core requests thing that you can do because of all the JavaScript that is handled correctly and And There is a Main there's two main differences
17:01
But one of the differences is that you open a browser and if you use URL it you don't open a browser You can use URL it or requests easily from a cron job On a headless server without any problems that is not possible with selenium without doing some extra stuff Selenium opens the browser and the browser meets the window. So you need a desktop. So let's look at these
17:28
We have some more of the problems. I Already mentioned this you're never sure where the data is there The page loads the JavaScript is started. The JavaScript has to as to some special
17:43
Function to actually wait till the complete page is loaded before it starts executing and Have no clue when it stopped executing Sometimes you just wait for five seconds, but because you know in normal situation the thing things will be there But much more safe is if to check if the particular piece of data that you are interested in actually got loaded
18:04
But if you have a table of elements Like there might be three elements already loaded and you don't know how many are gonna be there. Is it done loading or not? so The second interlude We saw that the web page has a structure and there's different ways of getting
18:24
to a particular piece of data on that web page that you might want to extract the things you Probably want to extract is either data or Some attribute value URL to a PDF file or to some other page
18:42
You can get at that Depending on how the web page is built up by using the ID The ID has should be unique. Although I have seen several pages especially generated but With Microsoft's CMS systems that had reused the same ID on the same page
19:00
Just like that at that point. I just decided not to use the ID because I don't know if the browser and Using beautiful soup will actually the browser might take the first ID and beautiful soup the second it is like let's not use that Depending on how the website is structured you can search by class
19:21
That is of course if something is colored in a specific way and the coloring is done on a specific class You can get a one item, but it's not necessarily like always the case that these classes are not reused on several Positions in the web page that you're looking at You can programmatically walk over the trees on the top
19:43
HTML and then I have the body and then I go down down down. It's not particularly fast and There's something called expat, which you can if you haven't used it yourself It's more or less like a regular expression to get to a particular piece of data based on the tag names and some attributes
20:04
Expat is not very complicated. But if you don't use it on a daily basis, it's Kind of hard to remember how to do things There's a better reusable option that I tend to use and it's the CSS select. It's about it's not as powerful
20:20
I think it's a expat but it's powerful enough for all of my purposes It looks like for instance this here it says get any URL that is an HTTPS URL on some site of the comm but it might be longer the Carrot actually make sure that it only has to start with that
20:41
so the H ref of an H a element has to have these start with this string and Then the a has to be After a diff element that has import as a class and there's all kinds of rules like this this kind of thing I think you
21:00
Selenium might not support this but this is a beautiful soup As far as I know it does and there's CSS allows you to get to particular elements at its point like you have the a element and you can That as soon as you point to the a element you can get if you're interested in that the full URL That is the agent H ref attribute
21:26
CSS select F's my preference over expert because I can also use it when I make a website in using the CSS files to actually Determine the look and feel of the site But there like I said, there are restrictions that you have to be aware of
21:44
Both Selenium and beautiful soup don't implement CSS selections as complete as your browser does So what is the typical Selenium session before we go into how to do it differently you open a browser and go to soon URL
22:01
You click a login button. We assume that you have to authenticate You wait until the redirection to the open it ID provider side is is reached You provide your credentials this of course a whole subject on itself. How do you automatically provide credentials? You don't want to have everybody you read your login name and password
22:24
There's a few things but one of the simpler ones is if is make a subdirectory in The SSH directory if you're running Linux that has already the restrictions in our checked restrictions on accessibility only by the owner of the files Then if your credentials you wait till you get back to the request page after the open ID
22:46
has Open ID session has notified your website that everything is okay Then you fill out some search criteria to Restricts the new or look for new data that has happened or has been added since the last time you checked
23:03
Then you might get a table or a list of items and you click on one of these references in that table and Then you're finally there you might be on the final page and get the data from there You extracting from the HTML or you find a link and the link might be to some files on PDF file or some other
23:26
and The main problem with this is that debugging is very time-consuming every time you log in and you have to wait And it's like it's you're not talking about seconds if your program doesn't it in the ends your program doesn't exactly know how to
23:41
Analyze the structure of the of the last page where you actually retrieve the file information or the textual data Then you have to restart your program and it has to log in again So we're talking about tens of seconds if not a minute to before you can Get to where you want and if you have a client waiting, it's like oh your software is not working anymore. This is kind of
24:07
So how can we improve on that that you don't have to restart? selenium every time One there's probably several ways, but the way I solve this is going into a client server architecture where the server talks with selenium and my client can just crash or can be restarted and
24:25
continuing where I left off the server keeps the Selenium session open and that keeps the browser open even if this if the client crashes for that to do that You need some protocol. So if you think about like our house, how do I set it up?
24:42
It doesn't have to be very sophisticated And you get data to the server, which is essentially requests So you get data from the server to the client for analysis and knowing what what state The program is or the website is in so you can take appropriate action or rewrite your client program to take other appropriate action
25:04
originally when I set this up a couple of years ago, I thought about like oil write some files with increasing file name numbers and The server will just look at the directory and I'll get stuff from that But then I looked at zero MQ and it actually allows you to do these kind of things pretty easily
25:23
It allows you to have a many to one Among other things Allows you to have a many to one connection between many clients and one server that also allows you to have multiple Or multiple threads within your clients or having not and still have one server open
25:48
Using zero MQ, it's very it's it's trivial to get the the service side on a different machine You use using port numbers and specify with machine things are running on if they're not on the localhost
26:05
Zero Q not by default, but it allows sending Unicode based Exchanges and it especially easy to get data You might not use like special characters in your protocol but on your website that you download you're almost certain at some point to find non nasty characters and
26:25
You have to deal with those so you might as well set up the whole thing using in a unique code so if you look at the thing that we did before the The session of getting to some data if you have a client server based solution
26:44
Then The thing look slightly different you open a browser, but only if it's not already opened You click on the login button, but only if you're not logged in yet if you're not learning logged in yet But you're at the open ID site. You don't have to go to the open ID site
27:02
Etc. Etc. You don't have to do things that are already done and you just have to pick up where you love left off Last time and you have to check for those things So it might just be if the finals Page with the data has changed that you don't do any of these initial things
27:23
You just check if they're done and then you directly get your data So your turnaround time is starting a client program goes down from tens of seconds to a minute to Part of a second and then you have your data much more easy to debug So debug and goes very fast. I really said
27:41
So If you define a protocol, what do you need? Well, the protocol sends some commands with some parameters and gets a result back So we look at like what kind of commands do we need and what kind of parameters do these commands have? There's only very few of them So please stay with me you have to be able to open a window and I use a specific
28:04
Window ID for that so I can open multiple windows on the server side if you don't do that Essentially have only one window to work with and this is very difficult to do many to one Or have multiple clients running because they would have to they would would be compute competing for the same window to do something
28:24
Using that window ID you can say go to some URL And you go the page will show up in the web browser that selenium has opened in a meantime The next protocol thing that you need is select some specific item on an item ID
28:41
The item ID you can reuse again on a specific page with a window ID And then you want to interact with the specific item based on its ID you might want to click on it to Have a radio button clicked or to go to some specific link Clear input or text area there might already be something where you want to write
29:05
That's the next thing that you want to do so you might want to clear out The old password that is incorrect and give the new password and then very important return some HTML Starting with a particular ID you can of course go get the complete
29:24
HTML page, but it's inefficient often already know like oh, I'm only interested in this table you selected table Using selenium, and then you get the whole table back And the other thing that is almost necessary to have is like
29:41
What is the current URL that I'm looking at because if you go to an open ID page? And you say click somewhere you need to know that it actually gets back to your original site to continue working So you want to check we want to be able to check If the client check ask the server, what is the current URL that we're looking at you can extend this protocol
30:02
It's whatever makes things more efficient This is essentially where I stopped a year and a half ago after Adding a few few things It might be just be more It might be more efficient to to do things on the client side then push them to deserve
30:24
So it gets the HTML back you need to do an analysis of that I use beautiful soup for that It's faster than going over a tree in selenium and try to getting the individual items It's of course not useful if you have to actually click on the items Then you still have to do it on the server side
30:44
Like I already indicated it has CSS select support There's one cave that though You get a piece of an HTML page back and beautiful soup wants to have a whole HTML page so Put this in this string in the between the curly braces, and then you can actually hand it over
31:07
So The first problem that I solve this with the clients serve actor architecture is that your clients can crash And you don't have to start from scratch But it the whole thing introduced the problem is that you have to have a head
31:24
desktop where you actually start the browser and If you want to run something on a headless server, or you don't want to have at some point in time Have a browser start value typing in in some some email There's a solution using pi virtual display it creates a virtual
31:45
It creates a display where you can that you can use to start the browser You will actually not see the display But for debugging purposes you purposes you can still get at it if you start a VNC session So what I normally do is I don't use the VNC Becca or the by virtual display backend while I'm developing
32:04
And then when it starts running it's fine, and if my client crashes anyway I will use VNC to connect to the pi virtual display startup, and then she is like oh the browser Stopped because of whatever reasons sometimes you get like stupid things like your website requires you to change your password every six months
32:22
and you haven't done that and Of course like it goes to complete different page, then you're expected because you never programmed it for that There's different ways of extending this What I already have done is just respect the advertisements I use the fire for browser often in the back end by using some configuration that of course loads
32:43
Pages that uses ads much faster What doesn't work with selenium, but it's the client-server architecture is capable of is using the tour Network by starting Firefox with its own extensions that you can drive it with slightly more Slightly less powerful than selenium, but it's for most purposes is good enough
33:06
Then about available the availability of the software like the previous talk suffers not yet on pi pi I Need to remove some stuff from the client side that is proprietary for the clients that I develop some software for and you would recognize
33:20
Where I could scrape the software from So I need to get it out But when it's once it's get up there You'll be able to see it on pi pi with using a real mo browser client and real mo browser server and I will also update the YouTube video With that information when this is available
33:44
so That's also almost the end of my talk. I can take some Questions now, I can also give some real-world examples for what I use. It's not for clients Let's do the questions first. Do the word examples make complete
34:01
There is a microphone One two, hi usually this kind of problems we have when the pages kind of single page application or
34:24
In JavaScript driven and it's usually took into API Right I'm not used like if there's an API available. You might just want to use the API to get the data I'm looking at its pages that are not designed for that. I don't have an API to get to the data
34:43
Okay, it's all the main problem is that you are You need to be sure that that page is completely loaded. Yeah, that's what I Look at some some some specific element on the page if it's already there or not if the table
35:01
if you Immediately check you might not have the table at all then the table gets there But you don't know if all lines have been loaded So there might be some indication that there's gonna be like 15 results and you have a table of 10 items You know that five results still need to be loaded and sometimes it's just waiting for Hoping that everything arrives in time. Yeah, sure
35:22
But this with selenium looks pretty complicated complex and a lot of stuff are used Isn't it easier to do something like I don't know Sleep one second and check. I don't know Content of the page or something. Yeah. Yeah, that's but you still need to use selenium to get like is is the content there or
35:45
Download the whole page But you also have to request selenium for that and a selenium is and if you don't use selenium But would go back to like using requests anything that gets loaded by JavaScript You will not get at all because request doesn't handle the JavaScript. Yeah
36:02
So it's there's different ways as addressing these but it's yeah one problem we had when using Selenium to access data was that these pages sometimes have date pickers and other elements that do not allow you to type in data and
36:24
These are usually very complicated to automate. Have you had these problems and do you have ideas to handle these about this? Actually, there's multiple things that you can do I have seen these problems so you have if I recall correctly like there are
36:42
Selenium calls like just right in this field But there's also same calls where use where you click somewhere and start just sending Characters and they you have to make sure that your cursor is at right position and they will get there. I've Done that with for instance Khan Academy. They the website has that kind of problem so you can get around it
37:04
It's not trivial with you there There's different ways of getting the data actually in the and I would have Would have to see if the protocol has some option of what what of the two to use
37:29
One problem I got when using oh, sorry one problem I got when using Selenium to do about the same thing Not really the same way, but same tools Is that a lot of people don't want actually the data scrapped so they're using
37:44
Services like distillate works cloudflare that is proxies that will try to detect detect patterns while scrapping and When they think you're not human, they will put some capture Did you? Did you encounter this problem?
38:02
To do the client-server architecture is that one of the The most frequent things I've seen is that they noticed that you log in like seven times a day It's like why is that why don't why is it? Why doesn't the cookie persist and those kind of things if you know, that is one of the examples that I have
38:22
Let's see stack overflow like this they will actually detect how often you refresh and restrict that and if you if you want to Advance on the cues on the review cues and get like a thousand reviews and get the gold batch You have to do special things and load balance where you're actually looking
38:42
it depends of course on the site like this is a Thief and and making a better lock and the the they will look at the patterns that you're using try to detect it but Essentially if you behave like a normal if you have your program behave like a normal person
39:01
they can hardly kick you out and For me, that's for some sites for clients. That means I Do the scraping it takes two hours, but they only want to have it done once a day and They cannot disallow you to like they they put up some Say 10 references to 10 PDF files on a day
39:21
Well, they can assume that you need to read the PDF and they don't want you to like within five seconds download all the PDF files But if you download one every two minutes You still can provide your client at the end of the day with a 10 PDF files that were uploaded That is the way I handled it I just have my program behave like it as if it was a human and that has to be accessible
39:43
and again Yeah, yeah, this
40:00
Set up second account. You have to say a second account look at it You have time for one last question Yes, I just wanted to add there is some ways to run Selenium headless with with
40:22
You seem by virtual display with phantom GS and chromium I mean there is ways to run Selenium headless Without you saying part of being a yeah Selenium has some modes where you don't like you don't get a browser window
40:42
Do you disadvantage is that it's they're not using a real browser? So that might be detected and the other thing is if things go wrong You have nothing to look at you just have your HTML structure and the nice thing is if you buy use by virtually say Display you use VNC and you see the browser that you would normally be using
41:03
It's like oh, it's it's in that state You don't it's much more recognizable if if it now After six months ask you for to change your password if you see that like you have to change your password in Instead of like getting the HTML back and it's like what is it actually trying to do that? But that is also possible. This is just there are multiple ways of addressing these things, but everything has advantages disadvantages
41:30
Okay, thank you Anton, thank you very much everyone