We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Serverless Planet-scale Geospatial with Protomaps and PMTiles

00:00

Formal Metadata

Title
Serverless Planet-scale Geospatial with Protomaps and PMTiles
Title of Series
Number of Parts
266
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
FOSS4G 2023 Prizren Protomaps is a simple, self-hostable system for tiled vector datasets. In the year since last FOSS4G, we've rolled out a new compressed specification (V3), added support for tile generation tools, and open sourced key integrations with content delivery networks. This talk will give an overview of: - Why you might want to, or not want to, deploy Protomaps for your application - PMTiles write support in the popular Tippecanoe and Planetiler tools - The new open source integrations of Protomaps with AWS Lambda and Cloudflare - Overview of real-world deployments for users in web GIS, journalism and the public sector
Vector spaceCore dumpDirectory serviceFile formatRange (statistics)Service (economics)Point cloudData storage deviceVolumeRepresentational state transferGeneric programmingMathematical analysisLatent heatHeat transferCommunications protocolContent (media)Partial derivativeOpen setMappingFunctional (mathematics)Numeral (linguistics)Mobile WebComputer fileGeometryTessellationBitTesselationPoint cloudVector spaceInternet service providerFile archiverLevel (video gaming)Scaling (geometry)Asynchronous Transfer ModeCASE <Informatik>Range (statistics)Library (computing)QuicksortMereologyEmailCore dumpWeb 2.0Single-precision floating-point formatOpen setClient (computing)Data storage deviceDecision theoryProjective planeComputer architectureOpen sourceZoom lensCartesian coordinate systemSquare numberEntire functionService (economics)Web browserObject (grammar)File formatExterior algebraMiniDiscPoint (geometry)DiagramPlug-in (computing)Descriptive statisticsComputer animationLecture/Conference
Server (computing)Band matrixPoint cloudCloud computingBand matrixServer (computing)Bit rateTerm (mathematics)Cartesian coordinate systemPoint (geometry)Point cloudFile formatData centerCurvatureComputer animation
Object (grammar)Single-precision floating-point formatSystem programmingDatabaseNumerical digitPay televisionCache (computing)Data storage deviceContent delivery networkFunction (mathematics)File formatCompilation albumComputer fileOpen sourcePoint (geometry)Physical systemScaling (geometry)Process (computing)Order (biology)EstimatorBinary codePrimitive (album)Instance (computer science)MiniDiscData storage deviceDatabaseSoftwareSelf-organizationService (economics)TessellationProjective planeUtility softwareData conversionComputer programmingServer (computing)Multiplication signFrequencyObject (grammar)Mathematical optimizationBand matrixDependent and independent variablesMappingEmailPairwise comparisonDigitizingQuicksortCuboidKey (cryptography)PredictabilityCartesian coordinate systemSocial classRoundness (object)MultiplicationSingle-precision floating-point formatWindowReading (process)Operator (mathematics)Range (statistics)Computing platformElectric generatorPoint cloudEndliche ModelltheorieInternet service providerMultitier architectureCASE <Informatik>Cloud computingScalabilityLevel (video gaming)File systemJava appletTesselationAxiom of choiceConnectivity (graph theory)Order of magnitudeFile formatExterior algebraOpen setArithmetic meanVector spaceSoftware as a serviceDifferent (Kate Ryan album)Observational studyStreaming mediaSystem callComputer animationLecture/Conference
CASE <Informatik>UsabilityFocus (optics)Term (mathematics)BuildingProjective planeCartesian coordinate systemUsabilityFile formatCASE <Informatik>MereologyPoint cloudUniverse (mathematics)Observational studyVector spaceComputer animation
Local GroupWebsitePlastikkarteService (economics)Information privacyClient (computing)Regulator geneCartesian coordinate systemServer (computing)Projective planeFile archiverLibrary (computing)Process (computing)Point cloudMappingMereologyClient (computing)Term (mathematics)Set (mathematics)Web 2.0Level (video gaming)Information privacyOrder (biology)Front and back endsTessellationVisualization (computer graphics)Service (economics)TesselationStaff (military)Self-organizationScaling (geometry)Exterior algebraComputer fileData storage deviceQuicksortDebuggerPhysical systemComputer animation
Transcript: English(auto-generated)
So, what is protomaps? It is a free and open source map of the world. And some of the unique things about this project are that it is a trivially self-hostable vector tile project based on OSM.
So that is... It enables use cases like a vector-based map that is customizable on the client for web mapping, for mobile mapping. And at the core of protomaps is an open and serverless file format called PMTiles.
As a very quick example of how this works, if you have a tile set of just prisren, maybe for a cartographic base map, that could be as small as 100 tiles, maybe one megabyte. But the same file format also works at the planet scale. So I provide a free download of a planet scale base map.
It is 1.4 billion tiles that are addressed. So it covers the entire planet from 0 to zoom 15. And it is a little over 100 gigs. And the header part that Vincent was talking about is about 300 megs. So the same library and the same techniques work from both city scale all the way to
the planet. So PMTiles is just one of many cloud native or cloud optimized or serverless formats. But what do those things mean? It's a little bit buzz wordy. And on cloudnativegeo.org, there is a nice description, which is that a cloud native
data format uses HTTP range requests to access just one part of the file, instead of having to download the entire thing to disk first. You can read a small header and then access a tile at the beginning or in the middle or at the end.
So what's the benefit of doing things in that way? One good reason is that you can host these archives as a single file on these commodity object stores. The most popular one is a service like Amazon S3. There's also numerous providers on public clouds like Azure and Cloudflare.
And this is all built using commodity tech, which is just HTTP. And as a diagram for how this works right now, so on the top is your PMTiles archive on S3 as a single file.
And when you access a single tile in the web browser through a client for which plugins exist, like OpenLayers, like Map Libre GL or Leaflet, when you ask for a single square tile, that will correspond to a range request in this archive. So it's only returning that small chunk of the entire file.
So this is sort of the baseline mode for how to use PMTiles. The architecture is very simple. There's only an object store at a single file and then a client evolved. There's also an optional enhancement to this, which is to add a serverless function in the middle.
What this does is it will accelerate the application by caching these intermediate tiles. And it will also make those tiles backwards compatible. So you can serve a ZXY endpoint that's compatible with things like mobile clients while inside
of something like Amazon Lambda or Cloudflare Workers, you are accessing just one range of the planet file. And this is a new open source ecosystem that has been developed in the past few months. So my central point of this talk is to talk about why you might not want to do
things this way, because one myth is that just calling it cloud optimized or cloud native does not mean it is the cheapest, fastest, or even easiest way to do this. And I'm going to talk a little bit about some alternatives to using PMTiles or to using a sort of cloud native design and how those might influence your decision to use
a certain stack in your project. So the first point is that these cloud native formats are never going to be the cheapest way to do things. Because the cheapest way to do things is actually to go buy a server.
If you have a fairly high traffic public application, the egress bandwidth of serving all this data is going to be a significant cost in the long term. And most cloud providers, they actually make their money based on egress. So if you buy Amazon, they will charge you about $0.02 US dollars per gigabyte that's
outgoing. And beyond some certain amount of data, it is going to be more expensive than just buying a server that is in a data center. You can buy these from OVH, you can buy these from Hetzner for a flat fee. Let's say something like 30 euros a month will give you an unlimited bandwidth server
that is unmetered. It is a flat amount to serve an unlimited amount of data. Now that is limited in terms of the rate at which you can serve the data. But if you are only factoring cost as a consideration, this is the cheapest way to do it.
And one other myth is that cloud optimized means that it is the fastest way to do things. And in fact, usually it is not even, it is competitive, but it's usually not close to a more traditional stack. If you run a server process and you access a SQLite database like MBTiles on disk, or
you access pre-prepared data in a PostGIS database, or even files on disk through a file system like ext4 or zfs, those will generally get you back your data in single digit milliseconds, just on even a cheap thing like an EC2 instance.
While if you interact with a cloud object store, 50 milliseconds is not even a pessimistic estimate for how long something takes to retrieve one piece of data. And in a lot of these cloud optimized designs like PMTiles or COG, you might need to make
multiple round trips to object store just to get one tile. So that's reading the header and then reading the actual tile data, or even reading multiple headers. So you're looking at easily 100 or more milliseconds to return things. So that's an order of magnitude difference from if you just run an EC2 instance and
you are just serving back a response within a couple of milliseconds. So finally, cloud optimized is definitely not the easiest way to do things either. The easiest, as many projects encounter, is just to outsource this problem.
To buy a SaaS from a vendor, there is no setup. You simply maybe get an API key and you upload your data to an external service and then they handle all that for you and you just pay them a very predictable cost. Maybe they have some sort of tiered usage or metered usage.
You pay for what you use. And in a lot of cases, it's even better because that data will be globally distributed. So you might be reading from a COG or PMTiles in a bucket, but the vendor will also integrate with something like Amazon CloudFront. So your globally distributed users are able to access that data with very low latency.
So I feel like I've hopefully done a good job of talking you out of doing things this way. So why bother doing this? Doesn't this sound like a waste or some kind of fad? And I'm going to talk about some of the interesting reasons why you should do things
this way. So I think one of the ways I describe this is it gives you all the benefits of a SaaS in that the scaling of the system, the uptime are someone else's job. So S3 goes down sometimes, but it's generally somebody's on call job at
Amazon to make sure S3 doesn't go down. And instead of having to provision like a horizontal scaling system of tile servers, this system that uses static storage is well suited to serving a high traffic
global audience. And another reason it's a lot like SaaS is you pay only for what you use. And this matters a lot for the scale of projects for a small one or a large one. For a small scale project, maybe something that is for internal usage, it
doesn't make sense to provision an entire server to go buy an EC2 instance or a Hetzner box in order just to serve tiles. So you only pay for what you use. And if that usage is small, then those operational costs are very low. Just for comparison to some other popular solutions, if you are serving something like
an OpenStreetMap base map as your base map solution for an application and you do, let's say, five million tile requests per month, if you kind of give some ballpark estimate of what that would cost on a tile provider like Google Maps, it would be in
a system like Protomaps for the same thing, the cost of bandwidth and Lambda and object storage comes out to more like $60 a month. On Cloudflare, it can be as low as 10. So that's just a very rough estimate. Obviously, the cost model here
is more complex than just buying a vendor solution. You have to factor in how much storage do you have in bytes, how much bandwidth are you using, how many invocations. But generally, the economics of using a cloud platform this way are
quite good. And I think one of the really important things about these sort of cloud-native ecosystems is the alternative might be using a vendor, but
usually the vendor has some sort of secret sauce into how their API works. Maybe they have some open-source components, but it's fundamentally not an open-source system that you can run yourself. So one key benefit of the system like Protomaps or a cloud-native system like COG is you can rely on an
all FOSS geostack. And that runs all the way from the source data to the applications or programs that serve that data. And that has really important benefits for organizations like the public sector that want to be able to
rely on that software being reliable over long periods of time. And one really important point I'm addressing with Protomaps is to combine vendor lock-in. Like I mentioned, Amazon Lambda and Cloudflare workers as some of the primitives that the system is built on. But even those
pieces of infrastructure like Lambda are not open source. And one important thing about Protomaps is that right now it has first-class support for both Amazon and Cloudflare. And it works in almost exactly the same way on both platforms. So this is an important choice for users of this
project. That if you adopt PMTiles and you are serving geodata through this system, it is portable from one cloud to another. And one thing I'm going to be working more on is supporting Azure as well as Google Cloud. And just to
emphasize the point about a FOSS ecosystem, where Protomaps is now and has developed in the past few months is to build out the ecosystem around creating and also deploying PMTiles. So the TpCanoe project, which is for
generating vector tilesets from your own geodata, now supports PMTiles natively. So you can go straight from let's say GeoJSON or anything that Google can read into vector tiles. That's an open source project that was funded by Felt very generously. The Planetiler project, which is a Java
program for creating open stream map base maps, also can now create PMTiles natively. As well as a conversion utility I published called go-pm-tiles, which is a single binary command line tool in order to generate PMTiles from other formats, in order to interact with cloud storage, as well as do some very
basic read operations. And this is a self-contained Go program that is available on GitHub for Windows, for Linux, for Mac. Some case studies about the kinds of use cases for protomaps and a cloud-optimized format. So
Mesonet.org is a university project that visualizes the weather and thanks to the ease of deploying vector data, they've been able to focus on the parts of their application that are unique and building features and building the important parts of their projects instead of worrying about paying
for an external vendor uploading their data to a third party, worrying about who's going to maintain that in the long term. The Washington Post has data journalism projects involving interactive visualization. They
use client libraries such as MapLibre. And for them, they might be working with a data set of the entire US. They can use PMTiles to create a data set as a single archive instead of millions of little loose tiles and serve it from S3 and CloudFront. And that is a really compelling sort
of workflow for an organization like a newspaper. Maybe their staff are not just focused on maintaining tile servers, but this works really well for their kinds of projects. So to recap, some good
reasons they use protomaps is the idea of a totally static deployment of your application. You might be working in a front end mapping library with Leaflet or MapLibre GL. And instead of having them
maintain a backend, the backend is simply S3. And many organizations are paying for cloud storage already. That's not something that you need to go have approved to add a vendor, for example. You can simply put tiles on S3, serve them through Amazon or Cloudflare. So that is the infrastructure you already have, is a really compelling reason
to use this over adding an external dependency. Some other reasons to use this are if you are a services business that has client projects, maybe your client project needs, for example, a base map or other geodataset. But when you are negotiating how to maintain
this project in the long term, there's some uncertainty around, well, what is the bill for, let's say, Google Maps going to look like in one year, in two years, in five years? Is that third party API still going to exist? Is the pricing going to change? So having
a self-contained solution is really useful for a lot of these projects. Some other reasons people might want to use protomaps for this kind of thing are privacy and GDPR in order to comply with regulations or even for air-gapped usage. If you are a business that provides an appliance, for example,
that runs on-premise, then this will eliminate the need to call a vendor through the web. Some reasons not to use protomaps is if you want the fastest, the cheapest or the easiest system possible. If provisioning servers or containers or managing Kubernetes is the fun part of your job and that's
what you enjoy the most, then maybe this isn't a great fit. If managing a file on storage for some scale-up project might still be dedicating too much resources. For OpenStreetMap-based
maps, I also have a free hosted API supported by GitHub sponsors. So that is a good alternative to something like Google Maps for a small scale project.