We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Processing and publishing big data with GeoServer and Azure in the cloud

00:00

Formal Metadata

Title
Processing and publishing big data with GeoServer and Azure in the cloud
Title of Series
Number of Parts
351
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
The amount of data we have to process and publish keeps growing every day, fortunately, the infrastructure, technologies, and methodologies to handle such streams of data keep improving and maturing. GeoServer is a web service for publishing your geospatial data using industry standards for vector, raster, and mapping. It powers a number of open source projects like GeoNode and geOrchestra and it is widely used throughout the world by organizations to manage and disseminate data at scale. We integrated GeoServer with some well-known big data technologies like Kafka and Databricks, and deployed the systems in Azure cloud, to handle use cases that required near-realtime displaying of the latest received data on a map as well background batch processing of historical data. This presentation will describe the architecture put in place, and the challenges that GeoSolutions had to overcome to publish big data through GeoServer OGC services (WMS, WFS, and WPS), finding the correct balance that maximized ingestion performance and visualization performance. We had to integrate with a streaming processing platform that took care of most of the processing and storing of the data in an Azure data lake that allows GeoServer to efficiently query for the latest available features, respecting all the authorization policies that were put in place. A few custom GeoServer extensions were implemented to handle the authorization complexity, the advanced styling needs, and big data integration needs.
Keywords
Process (computing)SoftwareOpen sourceCommercial Orbital Transportation ServicesWage labourOpen setThermodynamischer ProzessPresentation of a groupGoodness of fitControl flowServer (computing)Open sourceProjective planeSoftware engineeringStandard deviationOpen setLevel (video gaming)Data storage device1 (number)Computer animation
SoftwareOpen sourceCASE <Informatik>Daylight saving timeSystem programmingVisualization (computer graphics)Population densitySatelliteMessage sequence chartBitPresentation of a groupBounded variationPosition operatorReal-time operating systemTransmitterNavigationAuthoring systemAuthorizationMedical imagingDifferent (Kate Ryan album)Type theoryTelecommunicationCASE <Informatik>Term (mathematics)Extension (kinesiology)Slide ruleInformationPhysical systemOperator (mathematics)Rule of inferenceArithmetic meanSet (mathematics)Right anglePopulation densityUniform resource locatorFlow separationVariety (linguistics)Real numberAreaEvent horizonVisualization (computer graphics)TrailNumberServer (computing)Selectivity (electronic)Multiplication signRepetitionService (economics)NeuroinformatikSatelliteComputer animationSource code
Cache (computing)Open sourceSoftwarePay televisionBefehlsprozessorTable (information)Core dumpMiniDiscData storage deviceView (database)Thermische ZustandsgleichungGeometryAttribute grammarRead-only memoryDigital filterSubject indexingSanitary sewerReal numberOperations researchSineSystem programmingCASE <Informatik>Visualization (computer graphics)Real numberReal-time operating systemSubject indexingDatabaseReading (process)Operator (mathematics)BitComputer configurationPhysical systemGeometryConnectivity (graph theory)Traffic reportingExtension (kinesiology)Stress (mechanics)Data storage deviceTerm (mathematics)Position operatorFlow separationChemical equationMechanism designMiniDiscProper mapNeuroinformatikAuthorizationElectronic visual displayModal logicScaling (geometry)Rule of inferenceMathematical optimizationType theory2 (number)Stability theoryProduct (business)Core dumpMathematicsPresentation of a groupOrientation (vector space)Multiplication signVirtual machineWritingKey (cryptography)Zoom lensServer (computing)Projective planePay televisionDemosceneSet (mathematics)Right anglePressureLevel (video gaming)View (database)Buffer solutionComputer animationSource code
SoftwareOpen sourceReal numberCASE <Informatik>Bit rateCASE <Informatik>Projective planePolarization (waves)Orientation (vector space)1 (number)Fisher's exact testGroup actionType theoryDemosceneLevel (video gaming)Physical systemFilter <Stochastik>Video gameVideoconferencingObject (grammar)BitCategory of beingReal numberComputer animation
TrailModal logicWritingReading (process)CASE <Informatik>Computer animation
OscillationWebsiteOpen sourceSoftwareVector spaceDemo (music)CASE <Informatik>Casting (performing arts)Level (video gaming)Digital photographyStructural loadAreaQuery languageDemosceneVector spaceGroup actionDatabaseProcess (computing)Subject indexingRaster graphics2 (number)Server (computing)Semiconductor memoryPosition operatorPresentation of a groupMultiplication signPerfect groupPrototype1 (number)Partition (number theory)Scaling (geometry)Coordinate systemTheory of relativityRelational databaseOperator (mathematics)Stack (abstract data type)WebsiteImage resolutionPhysical systemControl flowService (economics)Virtual machinePoint (geometry)File formatMereologyDifferent (Kate Ryan album)Diallyl disulfideComputer animationProgram flowchart
Vector spaceOpen sourceParameter (computer programming)InformationSoftwareElectronic visual displayService (economics)Revision controlGeometryAttribute grammarElectric currentConfiguration spaceInformation securityCache (computing)Dimensional analysisRead-only memoryDemo (music)Hydraulic jumpSlide ruleVideoconferencingServer (computing)Multiplication signPosition operatorCASE <Informatik>BootingComputer animationSource code
Software testingMotion captureTask (computing)SoftwareOpen sourceProof theoryInformationPosition operatorCASE <Informatik>AreaComputer animation
Open sourceSoftwareServer (computing)Computer animation
Transcript: English(auto-generated)
Hello, good morning everyone. This presentation is about publishing and processing big data with both GeoServer Databricks in Azure. So I'm Nuno Oliveira, I'm a software engineer at GeoSolutions. As we may know, we deal with a couple of open source projects. Some of the main ones, GeoServer, MapStore, GeoNode, GeoNetwork, and we embrace open
standards in everything we do and not only we embrace them, we also participate on the test pilots to improve them as well. Okay, so I will start this presentation with discussing a bit what's big data,
so what is this about, why? Because usually the definition we hear about big data are the typical three V's, so it's when we get a lot of data coming in very fast and with a lot of variation. But in practical terms, this necessarily doesn't mean that we actually need big data technologies
to handle our use case. This is what you are going to see during these presentations. So I like a lot more the practical definition from Wikipedia, is that big data is when the current system we have basically cannot handle it. So that's when we need to think about the big guns. And of course we need also to take into account the functionalities,
what we need to display to the user. Do we really need to keep all the data we are receiving in the system, only a portion of it? So that's what will decide which type of technology we should use. I'll present to be the use case that I will use for all the demonstrations during this presentation. It's about maritime data, so long story short, vessels on the sea,
add to navigations, anything that is maritime related, and is positioned on the seas, on the oceans, and that emits a position, transmits any kind of information. So we get all of those events, we have to process them, and display them in several types of scenarios.
So in terms of numbers, we basically are around, in 24 hours we get 50 millions, actually around 60 millions as today, and we have to deal with half a million ships per day. And we have of course peaks of activity during certain times, typically during the night in certain areas, during the day near the port.
It really depends if it's a fishing vessel, when they go to fishing, when they are coming back to the port. And of course we have the historical data, which is quite consequent. So we have 7 years of data, around 125 billions of positions. Okay, that's our data set we have to deal with.
Okay, and well, this is the interesting bit. So we have all of this information, which is quite big, it's quite varied, and it also allows us to handle a lot of different use cases. So it can go from the maritime traffic monitoring,
which is the most obvious one I will say, where you basically want to see the vessels in the sea, what they are doing. We have search and rescue, we'll see that even the aircrafts are going to sea for a search and rescue operation. They transmit their positions as well, they add to navigation things that are on the sea, that typically we want to notice if they are moving around,
if they are staying in the same place, if there are some kind of fishes with them. And of course we have to enrich all of this data. So if I have a fishing vessel, not only that the fishing vessel at that location is not enough, I want to know what kind of permit it has, what is fishing, what is this port, where it's going, what it's doing,
that kind of information. And of course all of this needs to be very basically interchangeable with a couple of other systems, and we implemented all of this with GeoServer and OGC services. So we implemented several scenarios with this data, visualized the positions in real time,
which means we want to know the very last known position of the vessel, where it is at the sea, what it's doing. This is useful for tracking a particular vessel or to understand the activity in a port or a busy area. We want to understand the density maps, what are the typical rules that vessels are taking. Visualized in real time navigation to aid systems,
is there an issue with them, are they deviating from their usual position, that kind of thing. And of course the detected ship positions to understand if someone is doing something bad and they just turned off their sensor. So we detect that within a satellite, then we have to correlate all the data.
And of course the electronical navigational charts and historical ships positions visualization, where basically say, look, I want to know what these vessels were doing three years ago during the 5th of January in the middle of the Indian Ocean. So all of these use cases, I will show two of them.
So the first one, why? Because here we deal with real time, so we actually don't care about historical data. We want to deal with the data that is coming in and display it very efficiently. While for the last one, we actually have to deal with all the historical data. So 7 zeros of data, 125 billion positions.
So, I was forgetting about this. In all these use cases, we have to deal with authorization. So there is a very extensive authorization system where, I don't know, someone from Portugal cannot see the terrestrialized positions reported by France, cannot see the positions in the area of Gibraltar,
so it's really a huge variety of authorization rights. Why am I explicitly having slides for this? Because this means that we cannot pre-compute anything. Any selection, depending on the user, the image that will be displayed will be completely different. So no pre-computation can be done at all.
Because if you can't pre-compute the tracks, well, guess what? The user doesn't have the authorization rights, so we have to rebuild it again. So the first use case is visualization in real time, let's say the latest position for this vessel in the last 24 hours. So every time a vessel reports this position,
we have to update the system and display whenever someone does a WFS request, a WMS request, look, the position for that vessel is that one. Of course, depending on the authorization rights, because if the user cannot see that very last one, he should see the one that is authorized to see, which means that in practical terms,
you have to store multiple positions for this vessel that match the cardinality of the authorization rights. Just that optimization will be a presentation per se. Anyway, this system has been deployed in Azure. It is designed to receive 5K positions per second, so around 432 millions per day.
And of course, positions are enriched with several data sets, which the most significant one is the fisheries one. So this is the deployment on Azure. I have put there, let's say, the Azure VMs. I still want to make the math about the cost. Depends on your subscription, on your, let's say,
whatever deal you have with Microsoft. Long story short, we have a Kafka cluster, we have an ingestion cluster, a Postgres database, and then we have GeoServer that is deployed with Kubernetes. Okay? Interesting bits. The Postgres database is the core of the system
because it's under a lot of stress. So it gets a lot of writes per second and a lot of reads per second. Okay? So we can't use this, like, for example, using spatial indexes because it will be very good to read the data, but it will be very inefficient to write the data. And you may be wondering why we have 8 terabytes of data.
So do we have 8 terabytes for 24 hours? Not really. It's like 200 gigabytes. But to have the necessary IOPS in Azure, we need to push us that huge amount of disk so the machine gives us the necessary networking, the necessary IOPS we need to be super-efficient.
And, of course, we have the ingestion cluster, which deals with Kafka. Basically, the positions land on Kafka. They are processed. They go again on Kafka, and then there is another component from the ingestion that reads from Kafka and stores the things in Postgres. Why? So we can have buffers, a proper back-pressure mechanism between all these components.
Okay? So this improves the stability of the system, mis-monitoring, and so on. Okay, well, that's basically what I say. The key here was finding the balance between writing and reading. So, yeah, spatial indexes were not an option. I will explain later what we used. And, of course, then we have an extension of processing rules mechanisms, computing positions,
filtering positions, that kind of thing. Okay, so for the indexing, this is something we still have to discuss with the GeoServer community. It's a new extension we like to contribute. It basically allows to tell GeoServer on the fly, look, I don't have geometries on database,
but I have latitude and longitude, so build the geometry for me. So all the spatial operators available on GeoServer work transparently, but behind the scenes, we are sending requests based on latitude and longitude. So we can use numerical index, which is super-efficient, and we got top performance in writing and in reading.
Yeah, advanced authorization, as I say, I mean, we have to do several extensions, scale views, we have to, you know, really craft very carefully their scale and that kind of thing. So this is basically what the final product looks like. So this is the maritime picture for Europe, where we can see the vessels draw according to the type.
As we can see, even at this zoom level, we are drawing the orientation of the vessel. This is a very costly operation to do in terms of styling. This is more of the same, but now buzzed on age. This is basically only displaying the fishing vessels, depending on their geotype. This is a real-time aid to navigation system.
This is an aircraft doing a search and rescue operation. We can see that it was turning around at the place of the accident. This is aid to navigation, so it's the stuff typically we see in the ports. It's basically useful to detect if they are moving, so they have an issue, and they are not basically at the place they should be.
Advanced projection, in this case the polar projection. And advanced styling, so, yeah, this is taken because we have moving objects, so we cannot just do a couple of requests. The request we do needs to apply the styling we want. In this case, it was about highlighting things. So, yeah, for example, here, I think we are highlighting
all the cargo vessels that have not reported in the last 10 minutes, for example. Okay, this is a video. Normally, it should work. There we go. Okay. So, of course, we cover the whole world. This is the real-life performance.
So, initially, when we did the system, it was only mean to have such an advanced style around the coasts. But since the system was very fast, they wanted to see it basically at the world level because if we have experience looking at maritime data, just knowing the group of vessels, the orientation of the vessels, there's already a lot of insights that can be obtained from that.
So here we are basically just navigating around. And now we are going to apply a couple of filters just to show, you know, all the enrichment that was performed behind the scenes. So we are going to get, basically, if I'm not wrong, all the, yeah, the cargo vessels around Europe.
As we can see, there are quite a lot, and we can definitely see the roads. So they come from China and go to Rotterdam, most of them. And now we are going to look at the ones that have reported in more than 10 minutes. So it's a bit concerning. They should be reporting more often.
We are going now to go to fishing vessels. So as we can expect, most of them are around the coast, and we have the big ones in the middle of the ocean. And we can now check, we are going to filter by the type of permit they have. I know, the type of vessel, the FO type, so these are the long liners, if I'm not wrong.
So there is really a lot of vessels that fall into a category, especially the ones that can go in the middle of the ocean. And here are the ones that are permitted to fish swordfish. So typically someone monitoring fishers with the proper layer out will be able to understand if they are at the right place or not.
So this was for the first use case. The take-over is that we deal with a lot of data, so we are receiving a huge amount of data, but we don't really need big data technologies for it, because our use case is to display only the last 24 hours, and Postgres is perfect enough to give us
the necessary writing capabilities and the necessary reading capabilities. So now it's about historical vessel tracks. So historical vessel tracks, it's another story, because we have seven years of data, and we may want to go anywhere in time, look for the data we need.
So we implemented this system, once again available through OGC services, and in your use cases, so it's basically user wants to see what the vessel, a group of vessels did in the last month or five years ago, or what vessels were in this particular area. We can afford, let's say, last seven days,
it needs to be blazing fast, sub-seconds, and we need to get the full resolution. All the other time, we can use down-sampling, or we can wait like one minute to get the data, but once we get the data, after that initial load time, it needs to be super-efficient. And taking account that one typical vessel in six months
will report around 400K positions. So it's definitely something we can keep in memory, and we can just have a randomized browser site. So this is still a prototype. This is something eventually that will be contributed to GeoServer. We still have to discuss with the community. So what we have here is basically an Azure data lake
with Databricks in front of it. We have a cluster of GeoServers, a Postgres database we'll see later for what we are using it, and we have another cast for coordination in our cluster. The Databricks will basically give us... We are using the SQL endpoint at this stage where we don't have jobs running there.
Well, actually, yeah, but not relevant for this presentation. And we use that SQL endpoint to query the Azure data lake. What happens is that we send an SQL, Databricks translates that to a Pash Spark job, and we'll read the data from the data lake.
So Databricks in one minute, because I'm running out of time. So long story short, it provides now two SQL endpoints. The new one is the photo engine, which is quite fast, but as well a lot more expensive. And it's compatible with a Pash stack. I think it's the main engine used behind the scenes. And long story short, as I say, we send an SQL query,
and it will translate that in a Pash Spark job, and it will read the data it needs from the cluster. We have a Pash Sedona that can be used, and it will allow us to use on our SQL spatial operations very similar to the Pash CIS ones. So we can have the intersects, that kind of things.
And of course it supports raster and vector. I could do one single presentation about this. So the concept that we are using Databricks to allow us to send the SQL, it reads the data from us. OK. Now, this is another important aspect of it. We already saw, is all the big data
requiring big data technologies to be handled? Not really. Some of them are. And we need also to take this into consideration. These technologies like Databricks, Adobe is having at a lower level. OK. They are super fast. OK. They have a lot of machines, a lot of memory. Things are super fast.
But the scale is completely different from a relation database. So while in a relation database, a query that takes one second is just too long. In such a system, taking a minute is not that long. And we then need to take into account the infrastructure because if you have such a big cluster running 24 hours per day, that is always ready for you.
So things are super fast. Can you do it? Yes. But it will cost you an absolute fortune. Well, typically what happens in this cluster is that they are on demand. So you do a query that is quite expensive. OK. I need to instantiate more 10 VMs. I need to allocate resources. And this takes time. So that's why it's a completely different scale
of time. Database sub-second, such a system sub-minutes. So and of course, then it depends on the art that was used to set up the data. What index was in? Well, an index in big data typically means partition, the way the data was partitioned, was stored.
Did we use a columnar format? Did we use a tabular format? So all of that will have a huge impact on the performance of our queries. OK. Our use case in particular was quite interesting because this data lake supports so many use cases that we cannot just go there and say, look,
I will find the perfect partition schema that will make all my queries very fast. We can't because they are just too different. So we have use cases that are required by the vessel ID, some of them by the spatial area, some of them by the time. And again, we have seven years, 125 millions of them. So we can't just go,
we can't basically just do whatever you want. Doing a quick demo. So in this use case, we are going to see the vessels for basically six months of data. So we are going to ask Databricks, do you serve it transparently to load from Databricks six months of data and behind the scenes
to automatically cache it for us in Postgres. So the first time the request takes time and then it's just managed by your server on Postgres behind the scenes. And I will jump this slide. OK. Looks like we missed the video. That's good. OK. Here we go.
So let me try to play it. Here we go. So basically this is for two vessels, six months of data, around half a million positions and your server got the data from Databricks, loaded up on Postgres. And basically we can see here
how fast it is. We can see all the data. We can see that the performance is really good. And that's basically it. If someone wants to see a more advanced demo, feel free to pass by the GeoSolutions boot.
I have some other interesting use cases to show. I guess that's it. I have run out of time. We have this use case where we select the positions for let's say basically six months of data in a particular area so we can see all the vessels that are there so the amount of information is insane.
OK. Next step is basically contribute this to GeoServer. And that's it. Thank you so much.