URL Frontier, an open source API and implementation for crawl frontiers
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 56 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67201 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 20223 / 56
22
26
38
46
56
00:00
Musical ensembleXMLUMLComputer animationLecture/Conference
00:16
Open sourceWeb crawlerImplementationUniform resource locatorIterationImplementationOpen sourceWeb crawlerNumberScalabilityLink (knot theory)Multiplication signUniform resource locatorPoint (geometry)Web 2.0Group actionMereologyWeb pageInformation1 (number)Projective planeJava appletRoboticsIterationSelf-organizationParsingWebsiteComputer programmingQuicksortMappingRecursionLecture/ConferenceComputer animation
03:01
Modul <Datentyp>Suite (music)Software testingService (economics)Common Language InfrastructureStreaming mediaFormal languageLibrary (computing)Java appletMessage passingBuffer solutionCommunications protocolClient (computing)CodeServer (computing)Time domainContinuous trackUniform resource locatorQueue (abstract data type)StatisticsParameter (computer programming)String (computer science)MetadataService (economics)Projective planeModule (mathematics)Formal languageWebsiteUniform resource locatorQueue (abstract data type)Electric generatorSingle-precision floating-point formatLibrary (computing)Computer programmingImplementationQuicksortJava appletStructural loadWeb crawlerMultiplication signOpen sourceInstance (computer science)Error messageMereologyDomain nameLine (geometry)Scheduling (computing)Interface (computing)Process (computing)Goodness of fitWeb pageCommunications protocolBuffer solutionSoftware testingLink (knot theory)Suite (music)Client (computing)Message passingInformationRight angleLevel (video gaming)InjektivitätCodeSystem callIP addressKey (cryptography)Interactive televisionBitStreaming mediaHeegaard splittingMetadata1 (number)Electronic mailing listLogicMaxima and minimaNumberQuery languageProgramming languageLecture/ConferenceComputer animation
09:21
Interface (computing)Line (geometry)CodeJava appletInjektivitätDistribution (mathematics)Software testingOpen setRead-only memoryImplementationBeta functionService (economics)Web crawlerScale (map)Computer hardwareInstance (computer science)TestbedVirtual realityVertex (graph theory)Digital signalStatisticsProjective planeComputer hardwareImplementationCodeFocus (optics)StatisticsWeb crawlerRevision controlResultantScaling (geometry)CuboidPhase transition1 (number)Web pageContent (media)MereologyPlanningTestbedUniform resource locatorLine (geometry)Interface (computing)BitDistribution (mathematics)Replication (computing)Sheaf (mathematics)Medical imagingCommunications protocolWeb 2.0Multiplication signFitness functionArithmetic progressionVirtual machineMusical ensembleJava appletInstance (computer science)Client (computing)Software testingInformationElasticity (physics)Computer fileSemiconductor memoryAxiom of choiceState of matterData miningMaxima and minimaStandard deviationSubject indexingBuffer solutionFile archiverBasis <Mathematik>Electronic mailing listInjektivitätComputer animation
15:27
Web crawlerDigital signalStatisticsUniform resource locatorSet (mathematics)Flow separationMereologyPoint (geometry)Software bugWeb crawlerComputer animationLecture/ConferenceMeeting/Interview
15:55
StatisticsUniform resource locatorMiniDiscComputer fileDigital signalWikiWeb crawlerTraffic reportingScale (map)MiniDiscComputer fileDifferent (Kate Ryan album)Projective planeWeb crawlerVolume (thermodynamics)Enterprise architectureTerm (mathematics)BitMetric systemRevision controlTraffic reportingFunctional (mathematics)Group actionCuboidWebsiteWikiSingle-precision floating-point formatDigitizingCodeUniform resource locatorDirection (geometry)Instance (computer science)BlogLocal ringNumberAsynchronous Transfer ModeMultiplication signArithmetic progressionProcess (computing)CASE <Informatik>Ocean currentGene clusterComputer animation
18:58
Chi-squared distributionMusical ensembleControl flowLecture/Conference
19:18
Uniform resource locatorNormal (geometry)Computer architectureLecture/Conference
19:42
Uniform resource locator1 (number)Dependent and independent variablesValidity (statistics)Normal (geometry)Web crawlerMereologyImplementationInstance (computer science)Multiplication signQuicksortLecture/Conference
20:35
Musical ensembleJSONXMLUML
Transcript: English(auto-generated)
00:07
Thanks for coming. My name is Julien. It's great to be back in Berlin. I was at the very first Berlin buzzwords years and years ago, so it's a pleasure to be here. My talk this year is about Eurofrontier and Open Source API and implementation for Crawl
00:25
Frontiers. What is a Crawl Frontier? It's very simple. It's basically the information that a crawler, a web crawler, has about the URLs it has visited or the ones it needs to visit. You can think of it in this way where you start with a number of seed URLs, your
00:47
starting points for the crawl, and then the crawler iteratively expands and as it discovers new pages. Some links are recursive or some are not followed. The Frontiers expands over
01:03
time as the crawl grows. It's basically a Crawl Frontier. My motivation for this work was that although there are plenty of open source crawlers around, there is Apache
01:23
and Stormcrawler. All these solutions have their own approach to how they deal with the frontier. They have their own solution. What I was trying to do with this work was to see if we could find a common approach to dealing with a Crawl Frontier. Could we
01:41
find a consensus on what an API would look like? What actions do web crawlers typically do when they deal with the frontier? Ideally, could we get a good implementation that would be useful for crawlers, something scalable and robust that people could use?
02:04
Eurofrontier as a project is funded by an organization called NLNet. They're based in the Netherlands. It was funded as part of the NGI Zero Discovery program. I was very fortunate to get funding for, not once, but twice for this work. The initial project
02:24
ran last year. We're now in the second iteration of the project. It is open source and it is under Apache license. It's a sub-project of something called Crawler Commons, which is, if you don't know it, have a look. It's a great little project. It's about
02:43
providing resources in Java for web crawlers that web crawlers can use to do the things they typically do, like parsing sitemaps, parsing robots.txt, and so on. Having Eurofrontier as part of Crawler Commons made a lot of sense. The project is pretty straightforward.
03:05
It's organized in four different modules, the main one being the API itself. There's also a service implementation, or rather implementations, plural. There's also a common line interface tool and a test suite, which is used to check that the implementations
03:24
behave as expected. Let's start with the API. The API is defined using gRPC. With gRPC, you define the services and the messages used for the services using protocol
03:43
buffers. gRPC is high-performance. It uses HTTP2. Obviously, it is also cross-language. You can, from that neutral text definition with protocol buffers, you can generate code in various programming languages. One thing that gRPC gives us is also the ability
04:05
to have streaming methods, which is really, really useful for web crawler. From the gRPC, and we'll have a closer look at it in a minute, we can generate code, which is useful for writing a client or a Eurofrontier service. This is deployed in Maven Central, so if
04:26
you use Java, you can very easily import the library. The main concepts with Eurofrontier are that the URLs are organized into queues. Each queue has a key, which can be anything you want, but it's typically the host name for the
04:44
URL, or the domain, or the IP address, or whatever you want. Within the queue, the URLs have a priority. This priority is like a scheduling date, so it's when you want this URL to be fetchable by a crawler. For instance, you could tell that a URL needs
05:07
to be revisited in two days' time. Imagine there's been an HTTP error or something. You could reschedule it in the future. Within the queue, the URLs are sorted in this way. Also, what you get with Eurofrontier is that it will enforce some sort of politeness.
05:25
To give you an example, it tracks the URLs being currently processed by crawlers. It will make sure that no more URLs for a particular queue are provided to crawlers. It also enforces a reasonable delay between calls from a particular queue. This way, you won't
05:46
be in a situation where the crawler is getting loads of URLs for a single host, a very good diversity of sources, which is ideal for web crawlers. It can politely crawl.
06:03
With Eurofrontier, where you delegate part of the politeness logic of the crawler, you delegate that to Eurofrontier. You make your code a little bit simpler. This is a quick overview of some of the methods defined by the API by Eurofrontier.
06:22
As you can see, it's things like list queues. You ask Eurofrontier to give you the list of all the queues it has internally. The two main ones, obviously, are GET URLs and PUT URLs, where you get stuff in and out of the frontier. Let's have a closer look at GET URLs. This is what the message that is sent by
06:45
the crawler or by the client looks like. You can define a maximum number of URLs per queue to retrieve, also a maximum number of queues to get results from in one go.
07:01
You can also query for a particular queue, basically, if you give it a key, and so on. What you get in return is a stream of messages like this one, where obviously you have the URL to be fetched. The key it came from, corresponding to the queue, arbitrary metadata. That's typically the stuff that the crawler would have found
07:24
about the URL, which you can store in the frontier, and so on. Just to illustrate the basic interactions between a crawler and the frontier. Again,
07:42
the initial step, called the injection, very often, is to inject the CDRLs into the frontier. You usually do that on the command line, using the client, and you'd call the PUT URLs method. Then, when the crawler kicks in and starts doing its work,
08:03
let's see how it interacts with the frontier. Here, I'm not talking about a particular crawler, although it does look a bit like a stone crawler in the way it breaks things down. Just split it into three abstract steps. The first one, spout, is the bit that
08:21
gets the work to do. It gets the URLs, so it queries your frontier for work to do. It gets, as I said earlier, a stream of URLs. It's good because it's not really blocking. It just gets work to do as it comes, which is nice. Then, the spout would typically pass
08:44
that onto a fetcher to get the pages. Once the fetcher can then update your frontier, imagine you have a redirection, it could go straight to the frontier, say, right, I'll update this URL. It had HTTP code, so and so, and so on.
09:02
Often, you would get onto the next stage of parsing the document where we extract the alt links. That's where we get loads, of course, we stream loads of information to the PUT URLs method for every single alt link that has been found within that page. Then, typically, the crawler then does whatever it's meant to do. Often, it's
09:25
indexing into Solr or Elasticsearch or the engine of your choice. Then, at the end, it will just update the information for each URL that it has completed this way. I mentioned earlier, there's the command line interface, very, very straightforward,
09:44
implemented in Java using the code stubs generated from the API, from the protocol buffer file we saw earlier. It's not something you'd use for crawling, but it's more, as we've seen, for the season injection, but also debugging and
10:03
monitoring the state of the frontier. You'll notice the commands mimic the methods defined in your URL frontier. Now, the other interesting bit, of course, the implementations. The first two listed were done during the first
10:25
phase of the project last year. I'll skip the memory ones, it's not very interesting, it's not scalable, it's just for testing. The one that was the end result of the first phase of the project last year was an implementation based on BoxDB, which is persisted. If you turn the frontier off and then you
10:46
turn it back on, then it hasn't lost anything. But this one is not distributed or the content is not replicated. But this one is pretty solid. It's been used, as we'll see in a minute, it's been used quite heavily and with good results.
11:01
Now, this year, with the second phase of the project, there's been more focus on trying to distribute things. And we have two implementations, well, two and a half. One is a distributed version of the BoxDB implementation. So that's, again, persisted. There's distribution, but it's not replicated, which means that if the other frontier dies, then that section of the frontier
11:25
that it's in charge of won't be available until that instance restarts. There's also one based on Apache Ignite, which does the replication. So if an instance of the frontier were to die, then the data it has is replicated
11:44
by another node and that node will take over in serving the results. So these two implementations are, you know, still working progress. The sharded RocksDB one is probably a bit more advanced compared to the Ignite version. One other thing worth mentioning is that
12:04
there's another implementation, which is not, strictly speaking, part of the project, but it's one based on open search. And it's some work I'm doing for a client of mine pre-search. So if you don't know pre-search, check them out. They have pretty ambitious plans. They're pretty cool.
12:24
And as part of the work I'm doing with them, we did that implementation using open search. And that one ticks all the boxes. It's persisted, but also distributed and replicated. And it uses a totally different approach from the other ones. But all of them, of course, are totally compatible with the API.
12:44
And every single one of them will work in the same way as far as your crawler is concerned. So using it is straightforward. You can just pull the image from Docker and then just on one command,
13:02
just run it and then you have a working instance of a frontier, which you can use straight away for crawling. So as part of the first phase of the project, the last bit was to check it at scale with a large scale crawl,
13:21
just to make sure that things worked and that some of my assumptions were valid. So for that, I used StormCrawler, which I mentioned earlier, which is a distributed web crawler based on Apache Storm. So something I've been working on for the last eight, nine years. And for this experiment, the hardware was provided by Fed4Fire,
13:49
which was a European project. So we got sponsoring from them to do that. And the crawl was running on the cluster of machines
14:00
on a testbed called Virtual World 2 in Belgium. So we had Apache Storm installed on those nodes, which is needed for StormCrawler. So we had five nodes for crawling and then a single node to host your frontier. Because at the time in the version one of the project,
14:21
we had only the non-distributed version. And the implementation was the RocksDB one. The modest, sorry, the hardware on Virtual World 2 was relatively modest, nothing to write home about, but still pretty decent.
14:41
The code of the experiments is available on GitHub. And what we did was that we took some stats from the Common Crawl project. So again, it's not a project if you don't know about it. Have a look. Absolutely fantastic. They provide for free on a monthly basis billions of web
15:02
pages that you can use for your experiments or for whatever you want. And by the way, they use StormCrawler as well for one of their crawls. So starting from that list of a top 1 million seed list, we crawled with a maximum depth of five steps
15:20
from the seeds and generated web archives, which is the standard used by the web archiving community on Amazon S3. And the idea is that Common Crawl will, at some point, make that available as part of their data sets as well. So we really wanted to do not just run the experiment to make sure that your front-end worked,
15:41
but also make sure that the outcomes were usable. So we let that run for several weeks on and off because I was fixing bugs as I was finding them. But by the end of it, we had fetched 354 million URLs. 1.2 billion URLs had been discovered, but not yet fetched, of course, because of the politeness.
16:02
If you find a pretty heavy site in volume, then if you process it politely, it does take time. And yes, we had a pretty large RocksDB on disk and nearly 37 terabytes of work files on S3. So that was quite a success.
16:21
All this, like the 1.5 billion URLs, were just on a single instance of URL Frontier with RocksDB. And that was pretty good. And it proved that it was pretty solid. Now, the current work with URL Frontier, as I mentioned, there's a new version of the project.
16:45
And we're getting there. We're nearly finished, I think. The first stage was about improving the... We're adding functionalities to make it easy to monitor and report things with URL Frontier,
17:01
make it more, although I hate the term, enterprise-like. So that's done. So for instance, being able to export metrics to Prometheus and display them with Grafana, that sort of stuff. There's also a bit about multi-tenancy. So having within one frontier, having different, let's call them logical crawls.
17:25
So that's done as well. I mentioned already the discovery and clustering. And the bit we're working on now is making it even more robust and even more resilient in case of failure. So that's work in progress.
17:41
Next steps. So I'll probably be running another similar large-scale crawl to make sure that it works fine in distributed mode, but also get more crawlers to use URL Frontier. So obviously we already have some crawler, but there's also been some work
18:01
on using it with crawler4j. If you look at my company's website, DigitalPebble, we have a blog and one of our guests wrote a blog about how he used URL Frontier with crawler4j. And it was great because it validated that it allows you to reduce the amount of code you have in your crawler
18:22
by just delegating it. And there has been work and initial attempts to use it with Heritrix and Scrapys. I'm hopeful that there'll be more in that direction. But yeah, what I really want to increase
18:41
is getting more people involved. So I listed on the Wiki, I've listed a number of ways in which people get involved. It ranges from just giving it a try to talking to your local dev group about it and so on. So yeah, you'll find them listed over there.
19:01
And yeah, that's it from me today. Thank you very much for your attention. Thank you for the talk, Julian. So maybe we can have one quick question and then we go to the break. I'll bring you the mic.
19:22
Thanks, Julian, for the nice talk and the great work. I was wondering what are your thoughts on URL de-duplication and URL normalization? Where do you see them fitting in in your architecture? And normalization, did you say? Yeah, de-duplication and then normalization. Okay, so the de-duplication is something that,
19:42
yes, the Frontier takes care of. Obviously, if you send the same URL twice, the implementation has to recognize that it already knows that URL. So that's taken care of. The normalization is something, same as for the filtering, for instance, here you could decide not to keep some URLs.
20:01
The normalization is part of that, but that's external to the other Frontier. It's more the crawler's responsibility to normalize the URLs and filter the ones that it doesn't need. But the de-duplication, yes, it is assumed that a valid URL Frontier self-implementation will take care of the de-duplication.
20:23
All right, so for the sake of time, I'll direct you to Julien. I think he'll be around. Yeah, from me. You can catch up with him. Let's thank Julien one more time. Thank you.