Maintaining Spatial Data Infrastructures (SDIs) using distributed task queues
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 208 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/41024 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
00:00
Queue (abstract data type)Task (computing)Mathematical analysisInstance (computer science)Windows RegistryComa BerenicesObject (grammar)Computing platformDemo (music)Multiplication signWindows RegistryRevision controlVisualization (computer graphics)AreaMetadataQuicksortVector potentialGroup actionSet (mathematics)Service (economics)Staff (military)Total S.A.Software developerStudent's t-testLevel (video gaming)ExistenceWeb serviceWeb applicationCASE <Informatik>Plug-in (computing)Demo (music)Ocean currentBasis <Mathematik>Computing platformProcess (computing)Object (grammar)Dimensional analysisUltraviolet photoelectron spectroscopyBitMappingComputer animation
05:58
Menu (computing)MaizeProgrammable read-only memoryDemo (music)VideoconferencingMaxima and minimaCollaborationismComputer animation
06:36
Demo (music)Source codeSoftwareServer (computing)Macro (computer science)Sign (mathematics)Normed vector spaceSynchronizationState of matterBeta functionComputer wormEstimationStructural loadHill differential equationImage warpingPhysical lawData miningDistribution (mathematics)Local ringMetadataLevel (video gaming)Proxy serverCombinational logicQuicksortService (economics)Remote procedure callTemplate (C++)Physical system
08:17
Demo (music)Context awarenessCoprocessorOperations researchSynchronizationServer (computing)WindowDependent and independent variablesTask (computing)Client (computing)Process (computing)Personal digital assistantView (database)Priority queuePhysical systemParallel portData modelService (economics)Service-oriented architectureMessage passingThread (computing)Interface (computing)Communications protocolDemonQueue (abstract data type)CodeArchitectureBefehlsprozessorMultiplicationThumbnailEmailSearch engine (computing)Price indexCalculationQuery languageBackupSoftware maintenanceShape (magazine)Source codeWeightMilitary operationScheduling (computing)Maß <Mathematik>DisintegrationErlang distributionSystem programmingHypercubeType theoryStatisticsProxy serverHome pageWindows RegistryComputing platformData miningContent (media)Endliche ModelltheorieObject (grammar)FrequencyLie groupPriority queueTask (computing)Point (geometry)Client (computing)Web 2.0Process (computing)MappingInteractive televisionReal-time operating systemMultiplication signStandard deviationGroup actionCartesian coordinate systemCASE <Informatik>Web applicationCycle (graph theory)Lattice (order)Data managementServer (computing)Computer architectureRaster graphicsRelational databaseThumbnailService (economics)Data storage deviceMedical imagingRepresentational state transferDependent and independent variablesWeb browserCache (computing)Subject indexingTable (information)CodeService-oriented architectureFlow separationFile systemMessage passingRoutingOperator (mathematics)Search engine (computing)MereologyLevel (video gaming)Queue (abstract data type)Library catalogInterface (computing)User interfaceMetadataRight angleSlide ruleSoftware testingSampling (statistics)Open sourceResultantPlanningDatabasePhysical systemElectric generatorContext awarenessDemo (music)Run time (program lifecycle phase)Parameter (computer programming)Virtual machineSynchronizationQuery languageCalculationCoprocessorEmailCommunications protocolSystem administratorFormal languageEndliche ModelltheorieGoodness of fitParallel portInformationINTEGRALBefehlsprozessorDifferent (Kate Ryan album)Single-precision floating-point formatView (database)Computing platformImplementationConfiguration spacePartial derivativeConnectivity (graph theory)Term (mathematics)Proxy serverSoftware maintenanceComputer animation
Transcript: English(auto-generated)
00:00
Good morning, this is Ben Lewis and Paolo Corti from the Center for Geographic Analysis at Harvard University. We are going today to talk about two web applications we are developing which are
00:22
WordMakta and iKarmakta. Ben, do you want to say a word? So I'm just going to give a really quick background on this and then Paolo will go into the details. So our center, Harvard Center for Geographic Analysis, its reason for existence is to
00:46
provide support, geospatial support to a big community, it's about 40,000 students, faculty, graduate students and staff across the school and we also work with many groups
01:02
outside of Harvard as well in various ways. So we're just a total of nine people but we've got a potential sort of request shed of like 40,000 so we are always looking for ways to scale ourselves and one area
01:21
that we get into scaling ourselves is by building platforms. We do a lot of other things, we teach and we do consulting in various ways but we have gotten into a bit of platform development. Two of the platforms that we've developed are WorldMap and HyperMap.
01:45
WorldMap actually started quite a few years ago. It's a giant geonode instance, I think it's the biggest one in the world. And it's in the process of catching up with the mothership and
02:03
getting reunited with the current version of Geonode. It's a general purpose mapping platform. Users can, anybody in the world can create an account on it, upload data, anything that's in there becomes a web service.
02:23
I'm often asked why we're hosting data for the world at Harvard by higher ups and so far just making the case that the best data in the world is not at Harvard actually works. And by the way, most folks are collaborating with people outside of Harvard and
02:44
this makes it a lot easier. HyperMap is a map service registry. It actually got funded initially by National Endowment for the Humanities. It's designed to make it possible to create and
03:03
maintain a large map service registry. It's really a set of tools for both being a registry and creating and maintaining a big registry. HyperMap uses Solr in some new ways for traditional SDIs.
03:22
And there's a strong emphasis on improving the search experience for end users looking through lots of data, lots of spatial data. In developing it, we brought in Ace Solr developer David Smiley, who's at the conference but he's not here, to add a key new capability,
03:44
which was 2D faceting in Lucene and Solr. It's really sort of a big data visualization capability that Solr and Lucene now have, and we make use of it. We also improved temporal faceting and built a time miner to extract temporal
04:03
metadata because we wanted to also make it possible for people to leverage the time dimension of what is often very heterogeneous metadata, often very sparse metadata. So the HyperMap approach of using spatial and temporal faceting actually got inspired Boundless and
04:23
Boundless used it as the basis for Boundless registry. I don't know how much of HyperMap they're still using, but that was a tremendous compliment for us. So WorldMap and HyperMap are loosely coupled. They can each run fine without the other.
04:41
But we built a plugin within WorldMap for enabling HyperMap to be its search capability. And I'm gonna just do a really quick demo of sort of how that search works. So you sort of get the idea of how a big,
05:00
what a big, consider this like early stages of figuring this out. But how a big SDI might provide search such that it would need some of the scaling capabilities that Paolo will be telling you about in a minute.
05:21
So, and as a note, we're also working on a separate platform. We'll be talking about Friday. And this, the sort of underlying technology is really just a scaled up version of what we developed for HyperMap. It's sharded Lucene Solar to enable
05:43
interactive visualization of a billion space-time objects. And that's, we'll be giving a talk on that later, on Friday afternoon. So demo of HyperMap WorldMap, let's see if this allow, is that helpful?
06:23
Okay, so here we are in WorldMap, out where you'd come in, if you went to worldmap.harvard.edu. And so click on Create a Map, and you'll go in blank map template.
06:43
If you wanna add layers, you'll pop this open and come to a search tool that's got 112,000 layers in it currently. This is a distribution of all of the layers dynamically created, and you can mouse over and see how many. These are a heat map of overlapping bounding boxes, basically.
07:03
So we can also, we're doing a map proxy thing where we're trying to let people see what a map, what a layer looks like instantly. So you get a sense of stuff, especially when metadata is light like a lot of map services. So this combines both local general layers and remote map services.
07:21
So we did a keyword oil, you kinda instantly see the distribution of where there's stuff in the system about related to energy. And again, we can get a sense of where the holdings are, mouse over. This date column is pretty well populated.
07:44
So we're clicking on layers and turning them on. And in the course of that, adding them to the shopping cart, we can zoom in here for a little more detail if we want. So we find stuff we like and we load it to the map. And that's really it.
08:01
Now, once we're in here, we're in sort of a traditional spatial layer-based UI, and we can, at this point, save and share and stuff. But we're doing so with a combination of local and remote layers. And with that, I'm gonna pass it back to Paolo.
08:22
So let's give an overview of the technologies which are behind these two applications, WordMap and HyperMap. We will focus mostly on the task queue component,
08:43
which we are using to do synchronous processing. So one of the problems we are experiencing in WordMap and HyperMap is some kind of operation which the client from the browser do,
09:03
and in which the server response can be time demanding. Some samples of these kinds of interactions are, for example, when the user actions will require the fetching of an external API
09:22
to our rest metadata of an external service and its layers. We also have a search engine, which is based on solar, which needs to be synchronized. This should be processed asynchronously.
09:41
We also have a gazetteer, which is composed from any layer for which the user lets the layer to be part of the gazetteer. This is also time demanding action in the regular request-response cycle in the browser.
10:06
Two other typical use cases are when the user is uploading spatial datasets to the server, and when they are creating new layers using table join. As far as the request-response HTTP cycle is fast,
10:25
and we are talking about a few hundred milliseconds, everything is going well, and we don't have problems, but we have different features of the two web applications that will require a more time demanding process
10:42
to be done in this cycle, and this can be done synchronously, but we must be delegated to a task queue. A task queue is a system for parallel execution of tasks
11:01
in a no-broken fashion, and this is a good solution for doing asynchronous processing in the context of a web application. So if we go back to the model of the request-response cycle, what we are doing in HyperMap and WorldMap is to send a task to a broker
11:21
for each kind of action in the web application, which is time demanding. The task in the broker will be at some point processed by an asynchronous processor. We are in the code base.
11:42
We have some code, which is by the client, which can be considered the producer of a task, and some other code, which will be the consumer of the task. A message queue for a broker
12:01
is a way to let the interaction of messages between different services and applications. The work of this task can be distributed across threads and machines.
12:21
Typically, in the context of a web application, the producer is a client application that creates messages based on the user interaction, while the consumer is generally a demo process, which will consume those messages and execute the task. Let's give some glossary terms.
12:42
Task queue is a system for parallelization of tasks in a blocking fashion. A broker or message queue is an interface for message exchanging. A producer is the code that places the task in the broker. A consumer or worker is a code
13:01
that pulls the task from the broker and processes them. The exchange component is responsible for taking the message from the producer and routing them to zero or more queues. This is the message routing. Tasks must be consumed faster than being produced.
13:20
If that is not the case, we need to add more workers or consumers. There are several use cases for task queues. We are using mainly the task queue to process synchronously web requests, but they can be also used to interact
13:41
between heterogeneous application services and the microservices architecture. They can be used also as a replacement for crontabs when processing periodic operations. And can be used as a way of doing
14:03
multiprocessor tasking, or parallelizing task. They generally provide, the task queue they generally provide, a monitor process. And it's possible to analyze the task in a nice way.
14:20
Let's see some typical use cases in a web application. Which generally require a task queue. Thumbnail generation is one of the most common case. So when the user upload an image to a content management system, it's generally required that thumbnails are generated.
14:41
This should be demanded as signal to a task queue. Sending bulk email is another process. Fetching large amount of data from API. Performing time intensive calculations. Doing expensive queries. Search engine synchronization. Interaction between application and services.
15:03
And replacing the cron job. These are all typical use case in a web applications, which are well handled through the user of a task queue. Focusing on just partial web application and portal. Typical use case are, for example, engine old.
15:23
The user uploading a shapefile or a geotiff to the server. The generation of thumbnail for layers and maps. In our data hypermap, we harvest OGC services. And this should be done asynchronously.
15:42
Also, job addressing operation is a kind of sample when interaction between user and client can generate long process. Just partial data maintenance. Here we have a look of different configuration. The most simple architecture of the task queue
16:02
is just one server where there are the producer, the broker, and the consumer. Then we can scale the producers to different server or we can scale the consumers and the producers to different server, keeping the broker still in the same server.
16:23
There are many broker implementation. Many of them are open source. At CGA, we are using for this two project, hypermap and wordmap, RabbitMQ, and for another project, which is the platform Kafka and Zookeeper.
16:41
We need also a task processor. So, hypermap and wordmap are two Django web application. And we opted for the defacto standard for web application based on Python, which is Celery. Celery can use different brokers
17:01
and we have opted for RabbitMQ. Celery focuses very well on real-time operation but can also support scheduling as well. The task can be executed concurrently
17:21
on a single mock worker server. It provides support for different brokers, most typically RabbitMQ and Redis. It's written in Python, but it can be used by any language and it provides a very good integration with Django In fact, Celery was originally written as a Django application
17:44
and now it's more general. There are also good monitoring tools. Flour is a great web application which let the administrators inspect the task queue in real-time. RabbitMQ is a message broker
18:00
and it's very widely deployed. It supports many message protocols and many operating systems and languages. The Python client which is being used from Celery is Pika. It's written in Erlang.
18:23
This is an overview of the task queue architecture. We have some code which is the producer which is generally raised by the user of the web application. The producer is sending the task to an exchange node
18:44
which will send the task to one or more queues in the message broker. And then we have the consumer which are the Celery workers which is Python code as well which is pulling out from the broker the task
19:02
and executing them. It's not mandatory, but it's very useful to store the results in a result background. This is the user interface of Harvard HyperMap. We are doing here a massive use of the task queue
19:23
because we have a server operation which cannot be handled synchronously between the user and the server. Here we have a slide depicting the architecture of HyperMap and WorldMap.
19:41
WorldMap is based on Geonord so it provides a catalog. We are using Geonetwork as a catalog right now. And Geonord provides a map engine which is based on your server and your web cache. And we use a relational database
20:03
based on Postgres, PostGIS to store vectorial data. And we use the file system to store raster data. Thanks to Geonord in the mapping client we can generate maps using RGC standards like WMS, WFS and WMTS.
20:24
These are the layers which are local to our system and we are talking about more than 25,000 layers. We also have almost 200,000 layers which are remote and which are stored in HyperMap relational database.
20:45
We keep synchronizing the relational database with a search engine which is based on Solr. And thanks to the REST API in Solr we are providing the search features for the remote layers in the mapping client
21:04
as we have seen before with WIMAN. HyperMap provides also a catalog which can be used to query metadata in a RGC standard way. The catalog is based on pycsw.
21:20
We are using map proxy to cache most of the layers. And we use of course the task queue and the task queue is used typically to keep in sync the search engine with the relational database and to fetch the API of the external services.
21:43
We are fetching both RGC services and SRE services. So let's have a very quick view of what is happening. So this is the typical use case of which the request cannot be handled synchronously
22:00
because in the request we are fetching a WMS endpoint so we are parsing the GET capabilities document and this can be slow because this document can be large and it can respond in not untimely manner fashion. And also we are iterating all of the layers
22:21
to store the information in our database. So the best thing is to send this task to the task queue. So we will have some producer code which is activated in the web request. This code which is named the producer will send a message to the queue.
22:42
This is the typical salary message we can have. So we store the ID of the message in the task queue, the task name, check layer in this case, and other arguments. Most important here is the ID of the layer to be checked. This way at some point the worker
23:01
and the Python process in the worker which is named consumer will consume the task. This Python code will take the ID of the layer in the task and will process the layer. We also use salary as a replacement from cron jobs
23:20
which is a good thing because we are storing cron jobs in the relational database and this way it's much easier to redeploy the system in case we need them. Django is very well integrated with salary so it provides a nice and convenient administrative interface to create and maintain scale-red jobs.
23:43
We use two tools to monitor the process. One tool is the nd-htop Linux command. From this tool we can see each CPU in which a single worker is working, always doing.
24:01
And then we use Flour which is a web application which is handy because in real time all the tasks which have been processed by the task queue and for each task provides useful information like the kind of task, the parameters and the runtime. So now we are done and if there is time we are willing to answer questions.
24:28
I think there is one there.
24:44
Do you have an answer for this? Solar already had powerful spatial capabilities as well. It didn't have 2D faceting. And we already knew, it was also driven by David Smiley. We knew him, we knew he was great
25:01
and he was interested in this challenge. We are not choosing Solar over Elasticsearch. We'll let other people worry about that. We ended up going with Solar. I think Boundless has rewritten that stuff in Elasticsearch.
25:28
So, on one of your slides you showed various workers or tasks in your Solar workers. Index layer tasks, is that actually like reading a layer
25:41
and looking at all the features? Yes, we are keeping, so the question is how do we unload this interaction between the relational database and the server database. Yes, we are indexing one document for each layer and we have the plan also to index one document
26:03
for each feature at some point. So this way it will be possible to do a full test of features.