Best Practice for Serving Imagery using MapServer/GDAL on Amazon Web Services
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 208 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/40930 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
FOSS4G Boston 2017176 / 208
13
14
16
23
24
25
26
27
28
36
39
48
53
67
68
73
76
80
96
101
104
105
122
127
129
136
143
147
148
150
152
163
178
188
192
194
197
199
200
201
00:00
BuildingPoint cloudPrice indexData storage deviceEvent horizonLambda calculusComputer fileSystem programmingDemo (music)MereologyMIDITesselationGoodness of fitOcean currentRevision controlComputing platformSequenceServer (computing)BitPlastikkarteSubject indexingComputer fileLink (knot theory)Lambda calculusTerm (mathematics)Level (video gaming)2 (number)Utility softwareChemical equationCodeData storage deviceSet (mathematics)Translation (relic)Computer animation
03:36
Web serviceGroup actionIdentity managementFeedbackFunction (mathematics)Lambda calculusBlogCodeEncryptionVirtual machineLaptopSource codeComputer animation
03:59
Gamma functionComputer fontLine (geometry)Tesselation
04:32
Intrusion detection systemMaizeCAN busWeb applicationNumberTesselationData storage deviceServer (computing)Web serviceStructural loadCodeObject (grammar)Computer architecture
05:41
Execution unitTesselationCache (computing)Source codeComputer architectureAreaComputer animation
06:10
MathematicsGamma functionMaxima and minimaQueue (abstract data type)Execution unitMountain passRankingFatou-MengeLemma (mathematics)Social classSet (mathematics)Musical ensembleWebsiteMultiplication signVisualization (computer graphics)Information privacyPoint (geometry)Right angleBitMereologyOffice suiteSource codeScaling (geometry)Serial portMetreCASE <Informatik>Finite-state machineComputer animation
09:36
Sign (mathematics)AbstractionHeat transferMaxima and minimaComputer-generated imageryVolumenvisualisierungSanitary sewerComputer fileDevice driverProcess (computing)Lambda calculusReading (process)MathematicsSubject indexingView (database)Social classTrailCubic graphPixelSource codePoint cloudSelf-organizationRange (statistics)EmailArtistic renderingImage resolutionData structureContent (media)Directory serviceKey (cryptography)DivisorMultitier architectureWeb pageFunction (mathematics)Event horizonComputer wormBuildingDirected setData storage deviceGateway (telecommunications)Software testingExploratory data analysisScale (map)Cluster samplingIdentity managementLibrary catalogUser interfaceServer (computing)Data managementVideo game consoleType theoryMultiplication signTrailData storage deviceWeb pageOrder (biology)Point cloudState of matterEnterprise architectureWebsiteShared memoryPattern languageStapeldateiLambda calculusBuildingLevel (video gaming)TesselationDesign by contractSystem callLocal ringPerspective (visual)SpacetimeServer (computing)Set (mathematics)Descriptive statistics2 (number)TouchscreenData compressionPoint (geometry)Goodness of fitTranslation (relic)SequenceObject (grammar)Utility softwareMathematical optimizationComputer animation
16:14
Execution unitVacuumData managementVideo game consoleCommon Language InfrastructureBuildingPrice indexCache (computing)Computer fileDefault (computer science)Data typeRaster graphicsAxonometric projectionPoint cloudElectronic mailing listShape (magazine)Public-key cryptographyCodeMappingNumberRight angleMultiplication signSubject indexingComputer fileLambda calculusServer (computing)Level (video gaming)Axiom of choiceSoftware testingObject (grammar)MereologyUtility softwareTesselationComputer animation
Transcript: English(auto-generated)
00:00
So today, Paul Ramsey gave his talk yesterday, and so I'm going to try to do a balance between kind of old school infrastructure, very familiar tools that have been around for like 20 years on our platform and a little bit of what he was talking about, more cutting edge sexy in terms of speaking to how we can kind of simply leverage things like AWS Lambda
00:27
serverless on us to do things that, you know, we've been doing for many years, but maybe a little bit faster, right? So what I'd like to cover today is topic wise, building cloud-optimized geotests,
00:44
speaking a little bit to Amazon S3 storage, eventing on that storage system, how that links to AWS Lambda, and how you can do really simple things like use Lambda. So I have a script, it's a node just simply wraps the good old binaries, or good old utils. In this case, it's a good old
01:07
gdol gdol translate and gdol ado, just those two. And others have done a significant amount of work around, you know, creating the recipe for these cloud-optimized geotests. And that's now
01:21
available to everybody from about, I think it's 2.21 or 2.22, gdol version. And then a little bit of, you know, so, you know, now that you in this case, we're working with the USDA NAPE data, which in its, you know, kind of current set, so 2015 data, 2016 data for the whole country
01:45
is about 220,000 geotiff files from like 200 megs to maybe almost 500 megs in size, depending on what state. And so this is a tool that our sequence and set of tools that I actually use to prep that data. And I'll show you how that works a little bit in Leaflet
02:07
in a second. So, you know, building the geotist using Lambda methods, using familiar gdol tools, built, you know, having those newly created geotists, where you actually build a special
02:22
index so you can get map server to work with that, right? How do you do that? And then, what does the map file layer definition actually look like, right? It's a lot, it's very simple, the good news, right? And then we'll deploy that at the end, and I'll show you a running
02:41
version. I just fired up 10 servers right before the talk, and it's just, you know, somebody else's Dockerized version of map server seven, right? And how, with very simple setup, very simple architecture, anybody in this room with a credit card and AWS account can all of
03:00
a sudden serve the United States, and you can serve it as fast or as slow as you want. And say, if you wanted to tile the whole thing, you might want to fire up more servers and, you know, run some code on top of that to generate approximately 200 million tiles, I think it is. But you might not have to, because you can build these tiles anytime you want,
03:22
so I'll show how that works in a second. So I'm going to do a little mini demo, this is the hard part, this is like real, right? So I'm hoping this works. So right now I'm in, I'm on a Mac here, and I'm actually now, so if it looks like Windows,
03:46
that's because we're looking at a machine somewhere in Virginia, in a Virginia data center, so now we are no longer on my notebook here, and somewhere here, font is so small it's hard for me to see, so I'm going to show you the end product,
04:06
right? So it's very simple, it's a sleepy map with images, right? You can see if you look at it, those dark lines are the tiles, the geotiff tiles, right? And as we move this, doing what you'd expect, it's building new tiles on the fly.
04:24
And if you look on the right hand side, I'll clear this, let me see what layer I'm looking at, so I'm looking at tiles Docker, I'm going to do Docker behind S3, so right now what's happening is Leaflet is making a request to our object store S3, but I haven't built these tiles yet,
04:43
right? So it's looking to S3, trying to get it from S3, S3 doesn't have it, S3 is doing a redirect to a tiling service that's running behind it, right? And I'll show you an architectural picture of that, but it's very simple, so the browser application makes a request to S3, S3 throws a 403, redirects a request
05:05
to a tile maker, right? I can't remember what it's called, I think it's called S3 tile maker or something in here, and that very little short piece of code basically transforms the TMS request
05:23
into a WMS request, and behind that tile, very simple tile maker is any number of WMS servers that you want that are running behind a load balancer, right? So it's very simple architecturally, and whether you wanted to, you know, wanted to use this and never
05:42
create a tile cache ever, you could use it that way, or if you, you know, for some reason you wanted to go, you know, fully serverless and build all the tile caches, you know, pre-cache things, you probably want an architecture that looks something like this, okay? And so this is the end product, this is the jpeg, this, you know, this is the area of the source files or
06:05
quarter quads, and then I'm going to go over here, you know, there's a S3 tool called Cloudberry, make this a little bit larger, sorry it's a little bit too small, expand, it's kind of hard to see, but the point that I'm trying to get across here is on the right hand side is
06:24
our public sec, our open data, one of our public data buckets or data accounts, I should say, and in this account are mini buckets, one of which is, I can find it, it's so small,
06:42
is all the NAPE data that we're hoping to do in collaboration, you know, sometime this year with Esri, you kind of conflate the efforts around parallel ETLs of the NAPE data set and
07:02
Has anybody looked at this before in the room? No? Is this brand new? Anybody played with this before? So the main thing here is, all you have to do is remember AWS hyphen NAPE and there's a bucket, all the data is in there, and if you compare that to traditional
07:21
methods of getting the data, you'd have to go to the federal site or you'd have to go to the FSA APFO office and order, I don't know, 24 serial ATA drives and wait a few few weeks for all the data to arrive, it's all here, it's all part of Earth on AWS program, freely available to anybody who wants to play with it, and so if you look at this, so
07:44
here's for example California, there's three years of California, so the most recent is 0.6 meter centimeter, 60 centimeter data, and here's the original four band data, and the data we're using today is the stuff that I lambdud, right, that is the visualization
08:03
layer, the three band RGB stuff here, so it's not RGB IR, it's just RGB, and instead of, you know, what is it, 400 megs, these are more in the 20 meg class because they have been tiled, internally tiled, overviews added, and lightly JPEG compressed, so you roughly end up
08:27
in some cases with a kind of one-tenth the size of the source, okay, so this is the source data, so this is a federal, you know, data sets available to anybody, but what I'm hoping that
08:45
I can communicate is that, you know, if these tools work at this scale and we can go through CONUS data set in minutes, right, to prep this this class of data, it's not petabytes, it's more like, you know, 130 terabytes of source data, we can use very simple tools and, you know,
09:07
methods we've been using for a long time around GDAL to crunch this data and prepare it for this kind of visualization using simple architecture, simple method, okay,
09:20
so I'm going to try to now, you saw the source, you saw the leaflet with the little JPEGs, right, so I'm going to try to fill in the middle, how did we get there, right, I'll go back and get my, so that was the mini demo, and I got a couple of screenshots of websites that
09:44
others have done, so this is Pete Schmidt, who's at Digital Globe, had done some work late last year around kind of documenting this, and so if you look for, you know, the map server, map server sat on GitHub, and look for S3 VisiCurl, he has some, he has a set of instructions here on how
10:06
to make this happen, that's useful, there's one problem that I've identified, we haven't fixed it in this, but I'll explain that in a second. This here is my GitHub site where I have a
10:20
something called Lambda GDAL Translate that I use in a workshop at the CalGIS conference, I don't know, a couple months ago, and so if you're interested in this, you could actually use this to kind of do an entry level, let's get started with GDAL and Lambda type things, it's got a, this has a step-by-step sequence here that I used in a workshop
10:46
that mostly people got to the end of, right, and you know, so today I'm really just talking about a couple of the utilities under the, you know, the GDAL project, it's GDAL Translate and
11:01
GDAL Ado, those are the two fundamental tools you need in order to create these things called cloud-optimized geotiffs, right, and there's also under, under the, with the track site for GDAL, there is a page here that I think Evan did under contract maybe with, I think it was with
11:25
Planet, so you put, they put a significant amount of work around essentially extending VisiCurl so that it's Visi S3 and it supports authenticated access to S3 bucket and also supports requester pays, which is important for shared buckets in some cases, and what that means is
11:45
that you can do, you know, what I'm showing you today against secured buckets that nobody else on the planet can see, just you, or you know, if you're nice you can share it with everybody, right, it's, it's, it's your call or you can, you can do it in a very granular fashion if you
12:01
wanted to. The one thing I wanted to point it out, point out is if you look at Evan's very detailed description of how to create these things using the sentinel data here, notice there's a translate step and then there's a overview step and then there's another translate step, and that's missing I think with Pete's documentation, and what's going on here is
12:26
that here we're tiling it and doing a deflate compress, here we're adding the overviews, and then finally here we're doing JPEG compress tiled again using a photometric which
12:41
drives the smallest size, that's what gives you the one-tenth size, but what happens is that this middle step here adds the overviews but it adds it to the end of the file, the whole point of the cloud optimizer is the overviews, overviews go to the front of the file, so you got to do it one more time, okay, but just don't, don't forget that,
13:03
so you can see that in the Lambda tool I will be explaining in a second. So generally this is a sequence of building cloud optimized agitas, so you have some kind of source, this is what the contractor has delivered, some kind of ortho, right,
13:24
and then you use AWS Lambda basically running translate to you know do things like tile and maybe deflate compress, you don't, you don't want to use a lossy compression step first, and then you use again Lambda to add overviews and then I'll put it back to S3 and then one more
13:47
time it'll translate to get the true cloud optimized file, so it's got the three-step pattern. I wanted to talk, how are we doing? Oh wow, okay, I gotta go really fast, okay,
14:02
so is that, are we leaving time for questions, is that fine? Yeah, okay, maybe we don't have questions, so data ingestion S3, there's, I just want to remind folks that there's multiple methods, so a lot of our customers just upload overnight their LiDAR data, we have a lot of customers at the state local space that are uploading like two terabytes
14:24
overnight of you know freshly acquired LiDAR, that's the simplest thing to do, you know, one other pattern you should be to order a Snowball device, all the NAPE data was moved via Snowball, that's where we send you a device and you securely copy it and then it ends up in your S3 bucket magically, so that's the easiest way if you've got a lot of data, we have, we also
14:43
have the Snowmobile which is a truck if you have petabytes of data, this is just a screen showing how Netflix, I just wanted to make point here that S3 isn't just a static data store, it could be very dynamic depending on what you're up to, Netflix is a good example of that,
15:01
and why is this important, we're talking a lot about data lakes nowadays, this is a picture of how the object store, this is why, you know, doing, I think doing some of this work around S3 and learning how to use tools like Lambda around the object store is important because there's a lot, lot of interest and activity and, you know, not building these data lakes, using
15:24
them internally at the enterprise level, but being also being able to, for example, share at the education level globally, right, so somebody, one person does the work, everybody in the world can benefit from the work without everybody having to reinvent the wheel, you know, 10,000 times, right, so I won't spend time on this, I think there's other
15:45
talks talking about Lambda, no servers to manage, the main thing from a batch computing perspective is that once, once you're finished there's no servers to manage, there's no cost, right, you don't have to even turn anything off, it's just, it's done, so it doesn't get any simpler, so I'm going to skip this, I'm going to get right to this,
16:07
so what I, if I had more time, wait one second, how much time do we have now? Three more minutes, okay, okay, I'm going to do my, I'm going to do my deck then, sorry, okay, so I wanted to go to the console and show you the three steps in the console,
16:26
right, so the node code wrapping GDAL in the console, you hit a test set on, you know, the first Lambda event, you can, in production, you'd invoke it from the command line, so all it really is, is you list what you want to work on in the S3 bucket, you feed the list
16:44
via the command line right to Lambda invoke, Lambda invoke does the first step, puts in an S3 bucket, the S3 bucket puts it, puts it to the next Lambda because it's evented, and it does, does it three times, it finally goes into your prod bucket, so it's a, you know,
17:01
it's a very, it looks like a very batchy thing that we've been doing for many years, but you're using servers out there in the cloud right next to the data, this is, this is, so at the end, so now that you have the geotist, you need to create the spatial index, once again, you just create a list, you just need to prepend it with a busycurl
17:22
prefix here, right, that's where the data is now, right, that's where the code, what the code requires, and then here, because the other tools like GDAL tile index, T index, which is what we normally use to build, build the index off these geotists, are also now enabled to, to go and look at the object stored directly, right, so here's an example
17:48
of basically giving this utility the key pair that it needs and a couple of GDAL setup things to act on data that's in the cloud, right, so you can do it from your notebook, you can do it
18:02
preferably in the cloud because it's a little bit faster, and then essentially act on data without moving data, build your spatial index now that you can then use in your map file in the next step, so I'll skip this, if you look at it in QGIS, it's this monstrous, you know, CONUS shape file, it's all black until you zoom in, it's about 100 megs in size,
18:25
but that's all it is, right, so if there's any tricks to the trade there, it's, it takes, it takes some time to create a spatial index of 220,000 files, so the simple trick is, you run four, four, five, six threads, right, the data's in the cloud, you don't have to move the data,
18:42
do you do it a few times, and then you conflate that shape file at the end, and then put that in your map file, I'm sorry, it looks like I ran out of time, but if you want to, you know, ask me questions about this, I'll get the deck out later, so questions, yeah, so if I have time, yeah, the last part is, it's, it's, it's, you know,
19:04
ridiculously easy now, because you can just run a Dockerized map server instance, so in my, in my case, I was just using somebody else's, you know, Docker image, right, and just feeding it the map file, and I'm, I'm, I can get to market with a national mapping server,
19:22
so that's, that's, that's radically easier than, than it used to be, so in this case, I'm just running on EC2, but you can obviously run it on ECS, which is our Docker platform, but you have a number of choices there.