We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Vector Mosaicking with GeoServer

00:00

Formal Metadata

Title
Vector Mosaicking with GeoServer
Title of Series
Number of Parts
156
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The vector mosaic datastore is a new feature in GeoServer that allows indexing many smaller vector stores (e.g., shapefiles, FlatGeoBuf, Geoparquet) and serving them as a single, seamless data source. This has the advantage of cost savings when dealing with very large amounts of data in the cloud, as blob storage bills at a fraction of an equivalent database. It is also faster for specific use cases, e.g, when extracting a single file from a large collection and rendering it fully (e.g. tractor tracks in a precision farming application). Attend this presentation to learn more about vector mosaic setup, tuning, migration from large relations databases, and real world experiences.
Keywords
127
Streaming mediaMaxima and minimaLeast squaresHill differential equationElectronic program guideAlgebraImage resolutionSource codeTesselationVector spacePower (physics)Multiplication signPartition (number theory)Table (information)Field (computer science)FrequencyProduct (business)Open sourceInformationTime seriesData storage devicePhotographic mosaicRange (statistics)Extension (kinesiology)Domain nameCuboidElectronic mailing listTerm (mathematics)GeometryComputer fileTrailOffice suiteSoftware developerCore dumpNumberProjective planeLimit (category theory)MereologyRevision controlClient (computing)AreaPoint (geometry)Level (video gaming)Set (mathematics)Series (mathematics)InternetworkingBackupPlug-in (computing)Flow separationMedical imagingAverageSatelliteCommunications protocolSelf-organizationMassStability theoryProgram slicingTestbedStandard deviationCASE <Informatik>MultiplicationBitRaster graphicsAttribute grammarRectangleCue sportsPolygonServer (computing)Arithmetic progressionSpacetimeLecture/ConferenceComputer animation
AlgebraAxiom of choiceData storage devicePoint cloudPrice indexWorld Wide Web ConsortiumParameter (computer programming)File systemProgram slicingPhotographic mosaicAttribute grammarBinary fileLocal area networkMultiplication signWindowData managementFile formatGame theoryTable (information)Uniform resource locatorFilter <Stochastik>RectangleSatelliteData storage deviceLocal ringMiniDiscSoftwareServer (computing)AdditionBoom (sailing)Computer fileHuman migrationDenial-of-service attackVector space2 (number)Alpha (investment)Numeral (linguistics)Subject indexingShape (magazine)CASE <Informatik>DatabaseGeometryLine (geometry)Point (geometry)Software testingPolygonCodeBitAxiom of choicePoint cloudField (computer science)Electronic mailing listCharacteristic polynomialRow (database)CurvatureGoodness of fitCuboidAbstractionElasticity (physics)Data storage deviceMedical imagingCombinational logicCache (computing)Standard deviationProjective planeResource allocationSpacetimeSet (mathematics)TesselationAttractorComputer animationLecture/Conference
Source codeSoftwareGeometryComputer filePoint (geometry)Data managementPhotographic mosaicServer (computing)Square numberSet (mathematics)Zoom lensType theoryCASE <Informatik>Automatic differentiationOperator (mathematics)Subject indexingFile formatMathematical optimizationPoint cloudTime zoneData compressionMereologyAlgorithmQuaternionDatabaseTesselationMultiplication signData storage deviceModule (mathematics)Vector spaceSinc functionRevision controlExtension (kinesiology)GeometrySource codeAttribute grammarState observerLevel (video gaming)Physical systemCurvatureHierarchyMiniDiscStandard deviationArithmetic meanQuicksortTable (information)Connected spaceRow (database)Range (statistics)Natural numberMixed realityCategory of beingKey (cryptography)PasswordConfiguration spaceArrow of timeSelf-organizationUniform resource locatorData storage deviceOrientation (vector space)Filter <Stochastik>Denial-of-service attackClosed setMedical imagingGeneric programmingRight angleProcess (computing)Partition (number theory)QuadrilateralNetwork topologyComputer animationLecture/Conference
Computer-assisted translationLeast squaresComputer animation
Transcript: English(auto-generated)
So, yeah, I'm gonna talk about vector tile Sorry, not vector tiles at all. But vector Mosaic and we just serve it's all server-side. There are no vector tiles involved So first of all quick shout out to my company Joe solutions. We are based in Italy with offices in the United States We provide support and custom development and core development for a number of open source projects
We are open at our core. So we are part of OS GEO We participate actively actively in OGC through testbeds and Through standards which are important to GEOINT, which would be the federal government in the United States Okay, so let's have an introduction to the problem. Let me tell you a couple of little stories
example one we have a Use case at UmedSat. UmedSat is the European organization that manages some of the satellites that we Funded like Copernicus and the like and They have this data set called ASCAT which comes from NOAA
ASCAT provides a set of wind vectors which are collected every 90 minutes, so high collection ratio and It's a time series. We have one collector every 90 minutes for the past 20 months Which means right now we have almost 10,000
Time in the in the series and each collect averages more or less 200,000 points We have some that have 500,000 some that only have a hundred thousand, but they are kind of big. Oh And
We have a time navigation in the client powered by some of our protocol extensions that allow us to say, okay There's time data here and you can drill into it and figure out which times you have in a particular area and so on and The points here are nicely classified but level of details So you don't really need to render all those 200,000 points, but you have 10
classification levels and so you start by drawing just a few of them and then more and more and eventually you reach the the level of which we are under wind barbs and Well only at that level we render every single point but on a small area
Now We are storing this as a postgresql posgis table and if you run the numbers you will find that we are Storing lots and lots of points in a single table that became a performance problem Eventually, we started by partitioning the table over time so that the table is basically split physically in sub tables
and that improved things a lot but in terms of Growing up we are stumbling into issues There is a number of limited number of partition that you can efficiently manage especially on older versions of postgresql postgresql got better at partitioning over time and
The cost is growing. The cost is just going going up As I said that the time the client provides time navigation So we need to quickly grab a range of available times and and eventually the actual times We use an extension in GeoServer called WMTS multi-dimensional where we have a request which is not standard
It's called describe domain which basically says, okay. I have this bounding box. I have maybe these other Variable values. Can you tell me the list of times that apply in this situation? the list of time that we actually have and
It responds and we use that to power a time navigation bar So we need to have a very quick time aggregates doing the aggregates over that partition table takes forever. So we have a secondary lookup table example to precision farming So tractors moving the field collecting data at high frequency like every minute how many seats they sold how much product they deployed
Information about the engine like RPM how much fuel blah blah blah you got a ton of attributes which are collected once per minute One collector can grow pretty big I asked the customer in question to send me the largest that they had on
File and they threw me back this thing that has four million little rectangles for one Collect or tractor of on a field now multiply this by a number of trips For that tractor by the number of tracks of tractor that a farm might have and by the number of customers that you might be
Following that have all these tractors and the again the numbers pile up They very quickly to give you an idea that thing that seems to be drawn with with a chalk is actually As I zoom in on that corner, I get there and then I mean on that little corner I actually start seeing the polygons. So it's actually a
Millions literally of tiny rectangles put together to form something that I might be printing out and hanging on my wall So as I said Multiple tractors multiple collector customers multiple customer and so on in this case We don't even have a good classification of the points
So we are basically supposed to render them all every time and when you have a four million well That becomes a little bit of a problem So in summary we have a one massive vector data set Organized in slices sub data set by time by customer by tractor. There are there is an organization principle behind it
But generally speaking I can take out one sub file out of the massive set of data And we want stable performance for data extraction Give me quickly all the data from that slice and I want quickly Aggregates give me all the times that are available in the data set on all the times that are available in a certain bounding box
and so on and Well, we wouldn't like to pay a fortune for it. So we want to contain costs Just to give you an idea Here is some reference that I found on the internet about how much does it cost to use Amazon Aurora for Postgres?
It's zero point one dollars per gigabyte per month Plus you pay for the request plus you pay for backups plus you pay for IOPs plus blah blah blah blah blah blah It piles up really quickly If we were able to store the same data in S3, the cost would go down just for storage five times. So there's a kind of a interesting incentive there to be able to store the data in S3 if I can rather than in
Aurora And so enter the vector mosaic in plug-in in just solutions We have a long history of using the image mosaic plug-in The image mosaic plug-in is the one that allows you to take several raster images and put them together
Now we don't use it only to put together images in space, but also to filter images in time so if you have a time series of Satellite collects for example, you can store all the references to the images in a table Which is the index of the mosaic Reference each file and then then say oh, yeah I would like to say to get to the Sentinel to collect of this particular date. You scan quickly
The index is fast Even if there are millions of entries and you get one file reference and boom you render it. So it's it's really quick and so we Designed the vector mosaic in store just after that concept We have one or eventually more index tables that point to
external files which can be stored on your file system or On the cloud or wherever you decide to put them I'll show you some examples and the index can have a reference to the to the file sure Allocation in space sure, but also whatever extra attribute you want
So time tractor ID customer ID blah blah blah So that identifies your one slice of vector data that you wanted to take out of of the large pile And we are mixing the attributes So when I have that's an example of an index table that points to various geopackages
And it has the URL a file footprint But then also a time attractor ID and so on and then each geopackage has the attributes of each and every tiny Rectangle and when we Put them together. We basically merge the filter attributes of the index with the attributes of the geopackage and come up with a
summarized view of of the data now Which stores can you use for the index and for the satellite files? Well, the the mosaic store is actually designed in a very generic way So it really doesn't care in geo server
There is this abstraction of store that is something that can tell you which tables you have and fetch data out of them And it hides you from the fact that that something is a geopackage or a flat geobuff of projects Postgres QL or elastic search and so on so on So you can pretty much do whatever you want with the image more like in a plug-in
You can have whatever index store you want and whatever satellite store you want But you there are some combinations that work better than others and so let me let me tell you about them so when we configure a An image mosaic we tell it Okay, which store is the index store and we go and pick another store which is already configured into your server
What is a Good index store. Well, it can index the search fields so I can quickly locate the device To open based on time bounding box customer ID, whatever is my search field
I need to be able to do it very quickly So indexing is important and I need to be able to compute the aggregates Quickly, give me the list of all the times available in the current bounding box or give me all the times available for this Customer and so on and so on. So quick aggregates. That's the other characteristics that I want out of a
An index store a DBMS would do both of them So I'm not trying to kick positive SQL completely out of the picture I'm still gonna use it most of the time, but just to store the index rather than storing hundreds of millions of records And then there is the target quote-unquote file store. What format would be a good for
Choice for it. Well, it can really be anything and we played with the shapefile Geopackage flat geobuff and also your old database the one that you are trying to get rid of It depends on your use case you
a couple of those lines Might make you think that I'm flipping mad And so I I need to explain myself a little bit shape files in 2024 really really the original found funder for this activity is already generating
Shape files out of their tractors. They come out of the tractor shape files and They had no reason to convert them into anything else. So yeah, they are using an opposed SQL index the Points to shape files shape file is not such a bad format in the end
It has been a reliable workhorse for the past 27 years in Geo server If you want to render a ton of lines and points, it is the second fastest format Anyways, because we optimize the to the bone decoder that can read it And it's way faster than for example for postgresql
and Well, it has downsides You cannot store it on s3. It has to be on a file system local or network file system It doesn't have alphanumeric indexes. So you cannot sub filter inside the shapefile efficiently But if it fits your use case, why not? Geopackage Well geopackage OGC standard it's simple self-contained database with all your vector data. Can you store?
In in in geopackage sure you can And it's very good if you need to sub filter inside the geopackage because it has internal indexes however, it's not as fast as shapefile is and
Your local your storage still needs to be the local disk or a network disk. No s3. No blob storage Flood geobuff flood geobuff is a performant binary encoding for geographic data And it's actually the fastest Geo server format if you want to render everything at once It has an internal special index and the format is designed just to be very quick to to go through in addition
You can store flood geobuffs on your local network But also over HTTP s3, whatever so this one allows you to get rid of all the costs and move to blob storage and
The old database and you might say wait, didn't you start by saying you want to get rid of it? Sure, I want to get rid of it But maybe I cannot do it overnight because I already have two terabytes of data in it And so if I'm managing a time moving window Maybe I wanted to keep the old database still around and just chop away at that time slices in the time moving window
Management and just add the new slices in the new format rather than doing one massive migration So if I wanted to flip over the vector mosaic in a week rather than in a month I can play this kind of game and just refer from the index of the mosaic the old tables with some filters
Performance note Remember that insane four million tests. That's four million polygons test. That's a data set Those are the rendering times for post GIS a hundred and thirteen seconds for shapefile 41 seconds for flood geobuff 36 seconds Sure, it's still too much for any practical use Can still use some tile caching but flood geobuff is like three times faster than PostgresQL
Where to go from here? Well go and grab the vector mosaic store. It's a community module available since version 223 Usage is growing. So hopefully it's gonna become an extension. You can find releases soon
We are been doing a bunch of optimization both in the vector mosaic format itself and in the flap geobuff format So it means that it's best if you try it out on 225 where all the optimization we made are available What about Jupyter K for granules Jupyter K is becoming popular for for storing files
It's column oriented. It's very very well compressed. It's cloud native as well There's only one problem. It doesn't have a special index so you might not want to use it to store everything but you don't have to because with
Vector mosaic you can take the large file split it into zones Spatial zones in this case and let the index do the spatial search for you And that's it. Now you have a cloud native format with very good compression You just have to design your own spatial partitioning algorithms here
I imagine one based on quad trees that stops whenever the the tile has enough Oh Sorry, I stood two little points stuff like that. This is not available in your server It's something looking for sponsors Do I understand from that that had I said I want to draw all the red squares from that 4 million data set that
Postures would have won and not Yes, and no, it depends good question. So for example in the case of ask at I have the LEDs I have a nine and nine different zoom. Sorry nine different types, which I want to display
Like all Kind of cloud data management endeavors you have to pre process your data to make it available and fast for one particular Use case so if you wake up in the morning and you just want to render all the reds That's a problem because you don't have fast search you have to read the whole file
But if I know in advance that I want to do that kind of operation then I split the file and Have the Reds on one side and the yellows on the other or in the case of ascot have nine different files so when I Start opening up a particular file and I just wanted the arrows that I would see when I see the entire globe
I only open the LOD 0 and which is a very tiny file and then it's fast But what you don't want to do it is to search within a flat your buff But if you can predict your use case, then you slice your files to optimize for those use cases That's very common of all
Cloud data organizations you make them fast for one particular use case or other than being generic All-encompassing stuff like databases Any other question, how does the reference from the index
To the granule in the postgres database looks like Well, it depends it can be as easy as just a URL so you just point to the location of your file if it's
But if it's a database if it's a reference table Right, right for for any of the store for any case, which is more complex It's a tiny property file So key value pairs with the host port username password or J and the I reference or whatever
Geo server accepts as a set of configurations to talk to one particular Store, so for example in the case of ascot We are we we do have one J and the I reference to a connection pool Which is the same for all the And then I have a filter on the side saying of that table. I want to take only the records that match this time range
Okay, so it's not fixed at what is a granule in database. Okay, you know can be The image mosaic, sorry the vector mosaic is designed so that
It's completely agnostic to the nature of the stores that that is it's gonna open it and it can be a mix many things I could have a mix of geo packages and flood your buffs and Other sources think about having a hierarchy of storage
Like you you keep the the old data on cold storage, which is very cheap Then you need a very very much compressed Data source because transferring stuff out is slow and maybe for the fresh stuff You use a local disk with whatever format is efficient for that
How close is this to being an open standard? I mean, how good is the comment it can be reimplemented in QGIS server or something? That's a good question
I Don't know. I mean it's something so simple in fact that that I never thought about making it a standard to be honest But for example the image mosaic which has been working off the same principles for many years Now has in gdal sort of an equivalent which I think is called GTI It's a generalized table index or something like that
It's like a VRT on steroids, which is basically doing whatever geo server is doing adding multiple columns filtering and so on So I I can see something like that Happening in OGR for vector data as well. There is already a tile index notion in but for both
Rasters and vectors, but I don't think it uses Attributes for filtering. I think that Yuka has a question or an observation probably Not a question but a comment map server has been supporting this kind of system this OGR tile index
For ages and actually I have been used that for something like 15 years ago It's a very good idea