We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Frontera: open source large-scale web crawling framework

00:00

Formal Metadata

Title
Frontera: open source large-scale web crawling framework
Title of Series
Part Number
130
Number of Parts
173
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceBilbao, Euskadi, Spain

Content Metadata

Subject Area
Genre
Abstract
Alexander Sibiryakov - Frontera: open source large-scale web crawling framework In this talk I'm going to introduce Scrapinghub's new open source framework [Frontera]. Frontera allows to build real-time distributed web crawlers and website focused ones. Offering: - customizable URL metadata storage (RDBMS or Key-Value based), - crawling strategies management, - transport layer abstraction. - fetcher abstraction. Along with framework description I'll demonstrate how to build a distributed crawler using [Scrapy], Kafka and HBase, and hopefully present some statistics of Spanish internet collected with newly built crawler. Happy EuroPythoning!
Keywords
Open sourceWeb 2.0System callScaling (geometry)Software frameworkPresentation of a groupTwitterLink (knot theory)Software developerWordPredictabilityWeb crawlerNumberTerm (mathematics)GoogolPower (physics)Computer virusMereologyComputer animation
Goodness of fitTerm (mathematics)ExergieDisk read-and-write head
Web pageWeb crawlerHyperlinkAuthorizationWordCalculationGraph (mathematics)WebsiteLink (knot theory)Physical system2 (number)Arithmetic meanPlanningClient (computing)Resultant40 (number)PlastikkarteComputer fileComputer animation
Real-time operating systemDatabaseRevision controlProcess (computing)Canonical ensembleAsynchronous Transfer ModeInterface (computing)Image resolutionRight angleContent (media)Front and back endsUniform resource locatorLogicAbstractionStapeldateiCodeWeb crawlerWebsiteDirect numerical simulationPoint (geometry)Web pageMiddlewareBefehlsprozessorLink (knot theory)MetadataData storage deviceScheduling (computing)Field (computer science)Data managementDistribution (mathematics)Data structureEndliche ModelltheorieTask (computing)Multiplication sign2 (number)Library (computing)Type theoryMedical imagingDependent and independent variablesSet (mathematics)Cartesian coordinate systemFingerprintSoftware frameworkDirection (geometry)CASE <Informatik>Expert systemNumberComputer architectureEstimatorInternetworkingEvent horizonSequelPay televisionGoodness of fitPhysical systemSpeech synthesisBridging (networking)TelecommunicationDivision (mathematics)VotingOracleBasis <Mathematik>Population densityView (database)Scripting languageComputer animation
WikiLoginFlow separationParsingWeb 2.0Search engine (computing)Rule of inferenceSet (mathematics)Web serviceMiddlewareCalculationScheduling (computing)Endliche ModelltheorieServer (computing)StapeldateiWeb crawlerInternet service providerDatabaseArithmetic meanStrategy gameWebsiteAbstractionDebuggerRevision controlDirect numerical simulationComputer architectureUniform resource locatorWeb pagePartition (number theory)Computer hardwareSlide ruleContent (media)Single-precision floating-point formatSemiconductor memoryBlock (periodic table)Focus (optics)TelecommunicationTask (computing)Electronic mailing listBand matrixInformationTotal S.A.CASE <Informatik>CircleGraph (mathematics)Multiplication signCovering spaceAreaLine (geometry)Parameter (computer programming)Similarity (geometry)GoogolDataflowDifferent (Kate Ryan album)BuildingSpiralMessage passingGoodness of fitSubsetStructural loadPoint (geometry)DampingCausalityCodeForm (programming)Logic synthesisScalar fieldPhysical systemComputer animation
State of matterMoment (mathematics)Presentation of a groupWeb pageMessage passingSoftware testingCuboidPlanningContent (media)WebsiteMultiplication signWeb crawlerDomain nameIntegerMathematicsGraph (mathematics)ParsingLink (knot theory)AlgorithmCountingDemo (music)Scaling (geometry)Uniform resource locatorComputer animation
Interface (computing)Web crawlerUniform resource locatorMiddlewareLink (knot theory)ChainInternetworkingWhiteboardSlide ruleWebsiteHeuristicCodeRule of inferenceRight angleData structureShooting methodData compressionSet (mathematics)Web pageBasis <Mathematik>Library (computing)Connected spaceValidity (statistics)Canonical ensembleContent (media)Decision theoryMereologyOffice suiteSystem callFunctional (mathematics)Video gameConnectivity (graph theory)Error messageMultilaterationScheduling (computing)Cartesian coordinate systemPerfect groupBitInternet forumComputer animation
Transcript: English(auto-generated)
I will just hand it over to him. He will present us Frontera large-scale open-source web crawling frameworks. Welcome, Alex. Thank you. Thanks. Hola, los participantes. So, a few words about myself. I was born in Yereterinburg in Russia.
It's like in the middle, about 1,500 km from Moscow to the east. I was working five years at Yandex. Yandex is a so-called Russian Google, number one search giant in Russia. I was working in search quality department and was responsible for development of social search, QA search.
At the moment, we had access to the whole Twitter data, so we built our search based on Twitter data. Later, I moved to Czech Republic and worked two years at the vast antivirus. This is later the most popular one in the world.
It has about 200 million users. I was responsible for automatic false-positive solving and large-scale prediction of malicious download attempts. So, let's go with Frontera. I put this quote here because Crawl Frontier became such a common term in the web crawling society.
So, basically, when crawling works that way, you pull the seeds in, crawler starts to go there,
gets some links from there, and then continues to get these links out of there. So, the place where these links are stored before they will be fetched is called Frontier. So, here, this term comes from shipping.
Obviously, all the Spanish guys know what is Frontera. But, I just realized that Frontera is not so used word, especially in the countries where there is no sea. So, this is a place where all the stuff is like people and goods are before they go to the land or to the sea.
So, a few words about motivation. Why we decided to build Frontera? We had a client.
They came to us and said, So, and then we want you to process these pages and tell us what are the biggest hubs frequently changing. So, we just have a look at it.
One billion. What does it mean? means 150 millions per day and about 1,500 per second. So, that was quite a lot. Later, I will show you that current scrappy abilities is 1,500 pages per minute, not per second.
So, here, you see the very important picture. It's an illustration of hyperlink induced topic search from John Kleinberg.
Hubs are the nodes with lots of outgoing links. But, bigger nodes are authority sites which has lots of incoming links. It's similar to the scientific publications. If you are excited much, it means society gives you support.
How to calculate hub and authority score on a link graph and that became history. So, now, every major search system is using this to rank the pages and to research the link graph.
Another thing is like, it's not like Scrappy wasn't suitable for broad crawls. That's not true. But, like, broad crawls with Scrappy was hard. Really hard. And nobody does that. So, like, favored Apache Nudge instead of Scrappy.
And we didn't like that. So, we wanted to make them able to crawl with Scrappy whatever they want. So, there are two modes of execution of Frontera. Single-threaded and distributed. Frontera is mostly all about what to crawl next and when.
It's basically guiding the crawler what to do next. Single-threaded mode for up to 100 websites. This is appropriate because it heavily depends on the intensity of your parsing task. Your documents can have a lot of links, which gives more documents.
Or your websites can be less, like, not so responsive as others. And also, sometimes spiders do additional post-processing, which is also CPU intensive.
It's basically all about CPU. For performance of broad crawls, there is a distributed mode. Here are the main features of single-threaded version.
The main feature, from my point of view, is like, it's real-time. Time to nudge. What does that mean? It's like, when you work with nudge, the first thing you do is you pull the seeds and then you run crawling of one second.
Then the whole thing stops. You need to run a command to process what was crawled, to generate new links to crawl, like new batch, and then continue with crawling. So it's like, a batch, and always has steps.
Frontend is opposite, everything is online, so it never stops. It catches every batch, like, at the end of every batch, batch is requested, and then it continues.
Therefore, we avoid waiting for last euros, which are taking too long to download. For those of you who have experience with crawling, you probably know about it. Actually, can you raise your hands, like, who was doing broad crawls before?
Okay, one person. Okay, who knows about Scrippy? Okay, it's much better, you know? Well, another thing is we have a storage abstraction. So, you have, out of the box, you have SQL, HEMI, and HBase.
SQL, HEMI means you can plug any popular database you know. MySQL, OSGRAS, Oracle, and so on. Or you can implement your own, there is, like, pretty straightforward interface. Third thing, we have canonical URLs resolution abstraction.
This is, like, usually underestimated problem, and you have each page, if you just take it as a unique content, each page from each website can have many URLs.
So, it's always a question which one to use. If you find the same content by using two URLs, and will not pay attention to this, you will end up with duplicates in your database. And here, we provide an interface to implement your own canonical URL structure.
It could be different depending on your application. The last thing is, like, Scrippy ecosystem. We have a big community, good documentation, I believe. And it's, like, really easy to customize, mostly because of Python.
So, benefit from Frontera, then you have a need of your own metadata storage or content storage. So, you have a website, and you want to show the content, or you have intranet, and you want to show the content from database or from metadata. So, basically, Frontera is the right thing to do.
The second thing is when you want to isolate your URL ordering or queuing from the spider. And the third thing, when you have, like, pretty advanced URL ordering logic with big websites, or if you want big websites, it means, like, if your website is so big,
and there is no way to crawl it full, you can adjust crawling logic, so it will, like, select the best pages to crawl. Here's the architecture single-threaded version.
Let's go probably from right to left. You see the database, and you see the backend. Backend is responsible for communication with database, mostly. Also, in backend, it is coded the model for URL ordering and queuing.
So, it's tightly connected with the type of storage you use. Therefore, it's in the backend. Frontera middlewares allow you to modify the contents of requests or responses as like as you want.
So, you can put your fingerprinting or you can change the metafields, add scoring fields or another thing you need. Frontera API is basically the API looking outside of Frontera framework,
which is, like, possible to use by any other process management code or crawler. So, crawler is basically the stuff which makes DNS resolution and fetching the content from the lab. So, you can put anything you want here.
Obviously, we have everything for Scrapy. And also, we have an example from the library. Just to demonstrate that Frontera is working well outside of Scrapy. And the site is internet, filled with shots during a bell.
I put that image here because it's like how we are friending with Scrapy. So, basically, Frontera is implemented as a set of custom scheduler and spider middleware for Scrapy.
So, all that stuff is pluggable and Frontera doesn't require Scrapy. It can be used separately. Mostly, Scrapy is used for process management and fetching. And, of course, we are friends forever.
Guys from Scrapy are always attacking me. Well, not from Scrapy, but from Scraping Hub are always attacking me. Let's integrate it even more. So, my task is to stand against that. Because I have to think about the community and the bandwidth.
So, here's a short quick start to try Frontera in single thread mode. First, you have to install it. Then, you have to write a simple spider, maybe like 20 lines of code, including imports. And, well, you can take example one.
Edit spider settings by and put scheduler and Frontera spider middleware there. So, Scrapy, what scheduler to get and scheduler later will load all the Frontera stuff. Epic role. Finish. That's it. Check the database if you use database.
Here is a list of use cases for distributed version. It's like a completely different story. Single version is meant for like maybe 50 or 100 websites.
And you know all these websites. But when you have a broad crowd, you don't know what you will face. So, if you have set of rules and you need to revisit them. Like, set of rules, I mean like hundreds of thousands. And if you are building your search engine and you need to get content somewhere.
If you are doing some research on web graph, Frontera also could be useful. Therefore, you don't need to save content, which is like making work a bit easier. You have a topic and you want to crawl the documents about that topic.
Imagine like you have like, you want to crawl about sport cars. So, you are on the front end. After some time, you have a lot of documents. Much better than Google. Because Google will show you like only first few pages. And still, it's like how to get this page out of Google.
More general focus scrolling tasks, as I mentioned previously. Like if you have, if you want to search at some topic for a big hub. You probably will get benefit from Frontera.
So, here is the architecture distributed version. Let's go from Scrapy. You pull, I will just describe the data flow. And operation, how all this stuff works. So, you pull your seats in spiders.
Then, these seats are passed to spider log by means of Kefka transport. This is a Kefka topics. And then, from Kefka, we will get to strategy worker and DB worker. Strategy worker is responsible for all the scoring stuff and for making decision.
When do we have to stop the crawling? It's like when crawling goal is achieved. DB worker is responsible for building new URLs or old ones, doesn't matter. And producing new batches.
Scoring block is a place where all the scores about URLs are passed to the DB worker. So, seats are going to strategy worker and DB worker. Strategy worker saves sites that are new URLs and we have to crawl them.
Calculate score for them. Score is propagated to DB worker and DB worker is making a new batch for them. New batch is propagated to spiders and spiders are downloading to this batch of URLs.
After that, we get a content and we send this content. Well, we actually do also parsing and then we send this content by means of spider log.
Again, to strategy worker and DB worker. Strategy worker extracts links, look at them. If they are new, they need to be scheduled. It calculates score again, puts the score to scoring block and DB worker is saving the information about what was downloaded and so on. So basically, we have a closed circle.
Actually, now I'm running out of time. So, you can put any strategy you want in strategy worker.
It's implemented in Python, strategy worker and DB workers. And, well, let's go. Here's the main features of distributed Frontera. Well, we use Kafka as a communication layer and we use a rolling strategy abstraction. As I mentioned, in strategy worker, so you can implement your cloning goal, URL ordering, scoring model in separate model.
Polite by design, it means you will not get blocked because your website will be downloaded by at most one spider. This is achieved by means of partitioning in Python.
Everything is in Python. Requirements, so you need to have HBase and Kafka. It's crappy, 024 at least. First two is easier to get by installing Cloudera CDH.
DNS service, because we are making DNS intensive stuff. So, it's better when your DNS service will be pointing to upstream service servers like some big ones from big providers. Maybe American Bazon or OpenDNS. Hardware requirements, quite interesting slide.
So, here's like how to calculate from your need, what hardware you need for Frontera. Typically, each spider gives you 100 pages per minute, it's including parsing. And spiders is about four to one.
So, here's an example. If you have 12 spiders, that will give you 14,000 pages per minute. It means three strategy workers and three DB workers. Total 18 cores, because each worker will consume one core. Memory would be nice also for strategy workers.
So, some gotchas. I would better skip this if you want, because we are running out of time. So, here's like short quick start, but it's not quick at all.
So, prepare HBase and Kafka and install distributed Frontera. I think if you have HBase and Kafka, you will need like two, three hours to get it running from scratch. So, it's like all the instructions are mostly at this website.
Of course, we will be like working more on this. At the moment, the commentation is like it's not at the best, it's its best state. So, we made a quick Spanish crawl, I just told like before the presentation. So, to test Frontera, we just wanted to find out what are you guys doing here in Spain,
besides playing football. So, we decided to check out what are your biggest websites. And I just took from Demos all the Spanish content, all the Spanish URLs, and pulled them as receipts, having like 12 spiders, and running this for one and a half months.
So, probably you are now at least one of these websites at the top. And after all, we crawled about 47 millions of pages.
We know that we have at least 22 websites with more than a million pages. But considering this count of domains found, it's like we should found much more, I think. Here are some future plans.
We want definitely revisiting strategy out of the box. So, it means if you performed a crawl, then probably you need to recrawl it to get what was the changes in your content. And also you want to recrawl it by some ordering, which is based on how content is changing.
PageRank and hits-based, I already told about hits. PageRank is just another link algorithm, link graph algorithm. We want our own URL parsing.
It's like, scrap it, so I guess we will get it soon because of that. And yeah, we will test it on larger scales. Preguntas.
Preguntas, questions, anyone? Okay, so I have a small question. How do you guys work out canonical URLs?
Because I think that might get really tricky in some pages. There are like few approaches. Actually, some website, webmasters, they provide canonical URL in the content, so you can get if it is there.
That's the best. If it is not there, like what you can do, you can probably analyze the structures. For example, if you have a chain of redirects, you can get the last one in the chain. Basically, with some set of heuristics, it's like there is no clear decision.
The target of front error is to provide interface for this. So that's it. Actually, if you look in the chat, you will find out that we are just picking the last one from the direct chain. This gives us the ability to avoid duplicate to do the thing.
I have a question. As I know, Scrapi has a web-based dashboard. Do you spider work with it too?
Actually, they should work. Because you can put your own scheduler and spider middleware in these spiders and that potentially should work. As I know, this web board should create some rules and then Scrapi uses rules.
I'm sorry, what rules? Like a spider rules. Sorry, I don't remember. The thing is, I'm kind of more dedicated to scrolling.
Honestly, I'm not well aware of what Scrapi is all about. Let's talk later. I will just point you to the right guys. Second question. Do you use some asynchronous library or if you run your application as a single-threaded, do you use some asynchronous code?
Yes, we use Twisted mostly because it helps to call some functions and just makes code more readable.
Thank you. Okay, thank you. Maybe one quick last question before we change rooms. Is there anyone? Otherwise, Alex will be outside. I can show you something interesting if you don't have questions. Okay, really quick. We have 45. I think the next talk will begin.
That was done 15 years ago. This was done by Andre Broder and others. They are from Yahoo Research. This is a structure of the internet they think of.
In the middle, we have a strongly connected continent. They think it's like there are a lot of links, highly connected inside. Here, it's like a butterfly. Here, it's like incoming links to this strongly connected component.
Inside, there are outgoing links and a lot of them. This butterfly has a tendril, so it's a bit like an octopus. These tendrils, they have outgoing links.
Some tendrils have only ongoing links, like to the end or to the out. They have tubes. You can bypass strongly connected component from in-links right to the out-links. Actually, you also have a disconnected stuff.
That means there are pages we will never find, if we will just go and try to crawl the internet. So, I wish someday, these days, we find this picture to prove that it is wrong or it is true.
Okay, perfect. Thank you, everyone.