We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Don't Copy Data! Instead, Share it at Web-Scale

00:00

Formal Metadata

Title
Don't Copy Data! Instead, Share it at Web-Scale
Title of Series
Number of Parts
188
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2014
Production PlacePortland, Oregon, United States of America

Content Metadata

Subject Area
Genre
Abstract
Since its start in 2006, Amazon Web Services has grown to over 40 different services. S3, our object store, one of our first services, is now home to trillions of objects and regularly peaks at 1.5 million requests/second. S3 is used to store many data types, including map tiles, genome data, video, and database backups. This presentation's primary goal is to illustrate best practice around open data sets on AWS. To do so, it showcases a simple map tiling architecture, built using just a few of those services, CloudFront (CDN), S3 (object Store), and Elastic Beanstalk (Application Management) in combination with FOSS tools, Leaflet, Mapserver/GDAL and Yas3fs. My demo will use USDA's NAIP dataset (48TB), plus other higher resolution data at the city level, and show how you can deliver images derived from over 219,000 GeoTIFFs to both TMS and OGC WMS clients for the 48 States, without pre-caching tiles while keeping your server environment appropriately sized via auto-scaling. Because the NAIP data sits in a requester-pays bucket that allows authenticated read access, anyone with an AWS account has immediate access to the source GeoTIFFs, and can copy the data in bulk to anywhere they desire. However, I will show that the pay-for-use model of the cloud, allows for open-data architectures that are not possible with on-prem environments, and that for certain kinds of data, especially BIG data, rather than move the data, it makes more sense to use it in-situ in an environment that can support demanding SLAs.
Keywords
SpacetimeGeometryState of matterLevel (video gaming)MereologyMultiplication signMobile appLocal ringLecture/Conference
Real numberLevel (video gaming)Open sourceMagnetic-core memorySource codeProjective planeServer (computing)Connectivity (graph theory)ImplementationDoubling the cubeLecture/Conference
Set (mathematics)Projective planeComputer programmingCodeScaling (geometry)Point cloudMultiplication sign2 (number)Mathematical analysisPhysical systemSerial portSocial classMetreDifferent (Kate Ryan album)PixelCASE <Informatik>Group actionRevision controlUniverse (mathematics)Presentation of a groupFile systemData storage deviceInformation privacyComputer-generated imageryObject-oriented programmingOpen setServer (computing)System callTessellationLevel (video gaming)Open sourceBitNumberMagnetic-core memorySoftware testingTerm (mathematics)Web serviceBuildingResultantRight anglePoint (geometry)Proxy serverFocus (optics)Local ringCartesian coordinate systemDialectMetadataType theorySurreal numberSinc functionBoss CorporationMiniDiscOnline helpCuboidState of matter
Slide ruleDemo (music)Disk read-and-write headServer (computing)FreewareMedical imagingSource codeData storage deviceData centerRight angleSoftwareNumberCartesian coordinate systemPoint (geometry)Computer fileObject-oriented programmingReal-time operating systemSet (mathematics)Level (video gaming)MetadataClosed setDialectQuicksortOrder (biology)File Transfer ProtocolInternet service providerFigurate numberComputer-generated imageryDifferent (Kate Ryan album)Band matrixData structureProcess (computing)Error messagePattern languagePoint cloudSpacetimeSatelliteWorld Wide Web ConsortiumProjective planeCuboidPhysical systemNetwork topologyLink (knot theory)Normal (geometry)Multiplication signWebsiteEmailObject-oriented programmingInformation privacy9 (number)SmartphoneMiniDiscComputer animation
Server (computing)File Transfer ProtocolRight angleSet (mathematics)Virtual machineNamespaceProgramming languageOpen setProduct (business)SoftwarePoint cloudUniverse (mathematics)Limit (category theory)Staff (military)SpacetimeObject-oriented programmingBit rateCache (computing)Database transactionLevel (video gaming)Magnetic-core memoryTerm (mathematics)NumberKey (cryptography)Multiplication signStrategy gameMereologyElectronic program guideProjective planeScaling (geometry)NeuroinformatikAreaTesselationData storage deviceBackupJava appletPhysical systemClient (computing)TessellationWebsiteVideo game console2 (number)Grass (card game)Network topologyGroup actionGastropod shellWordWindowBitCellular automatonLine (geometry)AdditionMaxima and minimaElectric generatorData structureCategory of beingBuildingPerspective (visual)CASE <Informatik>State of matterQueueing theoryExecution unitEnterprise architectureComputer animation
DialectSelf-organizationPhysical systemObject-oriented programmingAreaOperator (mathematics)Point (geometry)Virtual machineElectronic mailing listCombinational logicGraph (mathematics)InternetworkingData structureDrop (liquid)Multiplication signCountingConnectivity (graph theory)Key (cryptography)Dressing (medical)Computer animationDiagramProgram flowchart
Point (geometry)World Wide Web ConsortiumWeb 2.0SoftwareScaling (geometry)View (database)Web pageDiagramProgram flowchartComputer animation
Point cloudDemo (music)MultiplicationPlanningDialectCloud computingComputer animation
Client (computing)Multiplication signPoint (geometry)MereologyMetreSubject indexingBitLevel (video gaming)Electronic program guideBoundary value problemTesselationState of matterPhysical systemDirectory serviceMedical imagingSoftware developerPersonal digital assistantGroup actionKey (cryptography)Shape (magazine)Image resolutionCASE <Informatik>Computer fileInformation privacyExpected valueSet (mathematics)Computer animation
Right angleCache (computing)Point cloudMassState of matterSocial classSoftware testingPhysical systemLocal ringInformationMetadataClient (computing)Level (video gaming)CountingCodeGroup actionServer (computing)Drill commandsPoint (geometry)Information privacyWeb pageImage resolutionComputer fileCodePerspective (visual)CASE <Informatik>LaptopProjective planeSet (mathematics)TessellationCycle (graph theory)RoutingVideo gameComputer animation
BitPhysical systemComputer animation
Object-oriented programmingTesselationData storage deviceDomain nameDirect numerical simulationDistribution (mathematics)Content (media)Physical systemContent delivery networkSoftware testingCache (computing)Right angleSoftwareWeb browserArc (geometry)Error messageSystem callCASE <Informatik>Point cloudDomain nameComputer animation
Source codeServer (computing)Software testingAsynchronous Transfer ModePhysical systemCASE <Informatik>TessellationMereologyTesselationComputer animation
Level (video gaming)Type theoryLastteilungFood energyElasticity (physics)CodeStructural loadMultiplication signInstance (computer science)Server (computing)DemosceneMathematicsMereologyPhysical systemTessellationComputer animation
Client (computing)Server (computing)Enterprise architectureObject-oriented programmingRight angleCustomer relationship managementMagnetic-core memoryPhysical systemThread (computing)Type theoryCache (computing)Video game console
File systemLevel (video gaming)Server (computing)Asynchronous Transfer ModeVirtual machineCustomer relationship managementOpen sourceVideo game consoleGeometryPhysical systemGraphical user interfaceMereology1 (number)Event horizonVirtualizationSubsetCASE <Informatik>Point cloudGroup actionEnterprise architectureDigital electronicsScaling (geometry)Reading (process)WindowPerspective (visual)Selectivity (electronic)Right angleCellular automatonGame controllerSoftware testingMaxima and minimaTerm (mathematics)Single-precision floating-point formatComputer animation
Virtual machineServer (computing)Level (video gaming)Physical systemCombinational logicKey (cryptography)Cache (computing)Open sourceTessellationRight angleMereologyDirect numerical simulation1 (number)Point (geometry)Object-oriented programmingCone penetration testDemosceneSubject indexingDuality (mathematics)Metropolitan area networkState of matterComputer animation
Sound effectDivisorVirtual machinePhysical systemCodeData storage deviceObject-oriented programmingWeb pageDensity of statesRevision controlRight angleDecision theoryPerspective (visual)Computer animation
Multiplication signWeb pageState of matterLimit (category theory)CodeOpen sourceRight angleMetreServer (computing)Computer fileSource codeComputer animation
WebsiteRight angleDerivation (linguistics)Bit rateStrategy gameVolume (thermodynamics)TessellationData compressionMereologyComputer-generated imageryContent (media)Software design patternSocial classStapeldateiOrder (biology)Point (geometry)Computer fileSource codeOptical disc driveMusical ensembleMultiplication signStack (abstract data type)Video game consoleAnalytic setComputer animation
Codierung <Programmierung>Fluid statics2 (number)Level (video gaming)WebsiteCache (computing)Business modelPoint (geometry)Universe (mathematics)Software testingStructural loadSubject indexingComputer clusterCycle (graph theory)Streaming mediaParameter (computer programming)Right angleMassVideo gameProduct (business)Computer animation
Data storage deviceObject-oriented programmingTerm (mathematics)Business modelDebuggerWeb pageContent (media)MathematicsProcess (computing)Form (programming)DatabaseBackupOpen setOrder (biology)Event horizonFigurate numberFile archiverPulse (signal processing)Right anglePhysical systemMagnetic-core memoryPoint (geometry)Cycle (graph theory)File formatImplementationCodierung <Programmierung>Mobile WebWordCASE <Informatik>VideoconferencingBuildingLevel (video gaming)Spur <Mathematik>Fluid staticsGoodness of fitQuicksortDifferent (Kate Ryan album)CodeProgramming languageCartesian coordinate systemSocial classSlide ruleVideo gameMultiplication signExecution unitRule of inferenceCache (computing)GeometryMusical ensembleSinc functionScaling (geometry)World Wide Web ConsortiumXML
Transcript: English(auto-generated)
My name is Mark Korver. I'm with Amazon Web Services. I'm part of the public sector team at Amazon. That's why if you look up there you'll see that I'm a solution architect. I work largely with state local government. And because our customers in the education space are
so active, there's so much going on, especially in higher ed. I spend a lot of time working with our higher ed customers also. So I'm also the mapping guy on the team. My specialty and a good chunk of my background has to do with, you know, geo apps on the web.
I've been doing that for some years now. And, you know, it's interesting because this is the one conference that I've been wanting to come to for like 10 years, and so I'm very, very happy to be here, to be invited to speak here. This is, you know, this is the preeminent conference
about open source and mapping. And, you know, I have a real sense of gratitude for the whole open source community because that allowed me to run a business. Most of my business was actually in Tokyo, Japan, but we did a lot of projects that had
at their core open source components. And so we were doing things like the, you know, the first double byte implementation of Map Server back in, I don't know, 2000 or something. So I'm very, very happy to be here and get to talk about something that's really true from our heart.
And so anyway, so much for the history. I want to kind of try to go a little bit forward into what I hope is the future. And as you can see from my title, I want to, this presentation was actually prepared to help explain best practice around open data.
And I gave a kind of a, the first version of this at our symposium in Washington, D.C. I think that was in June. Since then I've been, you know, given this for smaller groups a few times, but this is really the second time I'm doing this. But I know I have a, you know,
I guess I can go more technical, which is always more fun. But the core idea here is, and you know, this is the one thing that I shouldn't have to, you know, explain too much here, but you know, it shouldn't be about copying data anymore. Right, we live in a linked world. We, you know, especially with mapping, we,
for many years, you know, we, you know, we know what, you know, REST endpoints are, right? We know about web services. And so, you know, today I'm representing a company that provides you IT on the fly as a result of a REST call, right? And on top of that, of course you're building systems
that, for example, are RESTful and do all kinds of interesting things on top of an infrastructure that you can build and destroy within minutes via code of your choice. And so, so I work with a lot of different use cases.
So, you know, one day it might be like genome analysis, another day it might be Alzheimer's research, a large universities having to share data at scale. So kind of big data for scientific analysis. And at the core of many of those use cases,
and I would say that the larger they get, the more our object store becomes a central feature of that system. And so it's a kind of, it's best practice around not working in terms of traditional file systems but working in terms of object endpoints. In this case, I'm specifically talking about
what's called our simple storage service, S3, which I'm sure many of you are aware of and I've heard other people at this conference, you know, talking about it yesterday, right? So I want to focus on that. And what I'm going to show you is a test set that I need to do a call out to the folks at Mapbox.
Let me give you a little bit of background. Mapbox asked the USDA for the most recent set of the NAEP data set. So this is a one meter per pixel coast to coast, 40, I think it's 48 states. And it was delivered to Mapbox on 24 serial ATA disks,
each one two terabytes, and then they contacted me, and I said, hey, this is a customer, they run on AWS, and the idea is, Mark, this is essentially a public data set, can you help us out here? Can we get this into your public data set program? So it's not quite in our public data set program yet.
It might be, we would rather it be the USDA's data set in their bucket. I'll speak a little bit more to that later. But so what you're going to see is best practice around building a tiling system that's focused
on delivery of aerial image data in this case. So it's not open street map data. It's one meter class aerial imagery. And if you have that kind of data sitting in the AWS cloud in a region, one of our regions,
how can you leverage our services with the least amount of custom code leveraging open source projects to get the market the most quickly? So you'll see what we call the auto-scaling application, very little code that essentially allows you
to give any number of people out there that you want access to, 48 terabytes of data that you don't have to pre-cache, right? So I'm going to go ahead, and there's a couple of slides. I think I only have like four or five slides. I don't intend to slide deck you today.
It's mostly going to be a real-time demo. But I want to make a couple points clear before we start. And so in a sense we're trying to correct for the problem of what in a mapping world we call clip and ship. So typically you go to some website,
maybe it's a federal website or a national data website, you go find the data, and you go clip whatever portion of the world that you want out of that, and you somehow download it. And if you've done this before, you essentially go to a site, you might have a map, you create a bounding box, and then you might get an email saying that zip file will be available
to you now, right? There's this whole kind of manual process with clip and ship, and that's been around for many years. One of the earliest projects that I worked on was actually for Japan's space imaging. We built a shopping cart for satellite imagery. I think it's the same idea though.
So you go through this manual process, and if you're lucky you get email after a few hours saying it's available via FTP or something. And so that's the norm, and you still see that out there. So in the world of mapping, especially, when you're talking about things like compressed image data
or things like LiDAR data, typically there might be some data out there, but there's this whole exercise around going and getting it, and then essentially making another copy of that and putting it somewhere on-prem, and then if it's a large data set, typically you have to worry, you have the same
storage problem, where are you gonna put it, and you have it close to performant server so you can actually use it after you download it. So there's copies all over the world, everywhere. So when the USDA comes out with the new 2015 NAIP data, what happens?
Copies proliferate. Everybody now has a storage problem. But in the interconnected world of the cloud, web-based services, theoretically it should just be one copy. Why should we need to have more than one copy? We'd rather have one copy that's a definitive source that's well-maintained, well-curated, it's got all the metadata, and nobody's
moving that thing around. When we received the 24 two terabyte disks from USDA, of course there were errors, right? There's a whole, you know, another week there trying to figure out where the errors are, and correcting for the errors, so just that, even shipping the whole thing is problematic, because there's a lot of files,
there's close to half a million files, if you include the metadata for this particular set. So there's storage cost, there's network cost, there's a computational cost, and then, you know, since you're distributing, every time there's any kind of minor update, you have this huge cost around updating those distributed copies.
I mean, we bear that every day in our world of NAIP. We're all used to doing this, and we think this is some kind of normative pattern, right, it shouldn't be anymore. So, what makes cloud storage different? Well, one is, because it's available as an endpoint that you can either make completely public,
or secure in a very granular way, it's up to you, it's not siloed in some data center behind some firewall, right? That's one, and then number two is you can provision a real-time granular access to it or not, right? So you can have, so, probably the best way
to think about it is many of you have smartphones in your pocket right now, if you've got, for example, an application that allows you to take pictures or gather some kind of sensor data and upload it, there's a very good chance that you're uploading that via what we call a signed link to an object store, right, not through a server.
You're loading it directly to a storage system that, for example, Amazon Web Services provides that particular application vendor. The third thing is, and this is now on the cost side of the equation, it's not simply a technical thing,
right, you can offload the network egress cost. So, and this is probably the most important point, so when you store data in the cloud, and I'm assuming that whether it's our object store or another service provider's object store,
you're basically paying for how much data you have in there, right, now, or for the last two, three days, and then typically you're charged for how much network bandwidth you use for that data going out the door. That's the network, so that's a variable cost. So if you've got, for example, right now with us,
I think it's less than three cents a gigabyte a month, so one gig costs something on the order of three pennies a month to store, but depending on how often that data goes out the door, and technically that means it goes out of one of our regions, there's a variable cost
associated with that, right. And so, when I say offload the egress cost, what that means is that you can have somebody else pay for the network charge. You continue to pay for storage, but you can set it up such that somebody else pays for data going out the door, and that's very important because that allows you
to actually release really large public data sets without getting your network hammered, without having your FTP servers all of a sudden go down, because it's not your problem anymore. You've given us the heavy lifting. You've given us a job of doing the heavy lifting around allowing access to that data.
So, in the cloud, you still have to pay for storage. It's your data. You control that data, but there should just be one copy of that data, right. Just by storing it in S3, you have 11 nines of durability,
so it looks like one endpoint, and I'll be showing this to you, for that geotiff file, right, but in the background, we're of course making multiple copies of it for you, but you can't see it. It's not your problem. It's our problem. We need to make sure that we satisfy the SLA around that data, okay,
and because you can offload the network cost, you don't have to worry about network. You don't have to worry about provisioning network on your end just because somebody might come today to go get, come and get the data. You don't have to worry about your network getting maxed out because somebody decides to download the whole thing in one hour, right.
That's our problem. You don't have to worry about compute costs because you're not standing up, for example, traditional FTP servers or putting it on some website anymore, right, so that's not, again, not your problem, and then because you're maintaining just one definitive copy of that data, right, you don't have to, you don't have the cost of updating
all those distributed copies because they're not out there anymore. There's no need for them, right. So, you know, I shouldn't have to tell this group, but, you know, we need to think in terms of URLs, right, rather than in terms of copies. It's not about copying the data.
So today I'll speak, you know, we have, I think, we have over 40 services now, so part of my role is to act as a guide to our 40 services, but I'll be very frank with you, they grow at such a phenomenal rate
that even the solutions architects on our teams, the technical people on our team can barely keep up, so oftentimes, you know, if it's something that just came out two weeks ago, then I have to refer customers to the actual product team kind of thing, but I have many years of experience about building tiling systems on S3, and S3 is one of the original three services on AWS.
It was the queuing system, EC2, which is a virtual machine service, and SQS, so there's just three in the beginning, but those three, at least back when I was a customer of Amazon Web Services, allowed us to pretty much build anything, because those are the most core parts, right,
queuing, storage, and compute. Now we have 40. So I'll spend a little time explaining some lesser known features of S3. The main key item here, if you walk out the door with anything, I just want you to remember two words, and that's requester pays. Requester pays is a feature of S3
that allows you to offload the network egress charge, and that's key to today's talk, and that's key to our best practice around government open data strategy using cloud wisely. S3 has many clients. S3's been around for a long time, so there's, you know, command line,
like there's, you know, Perl clients, there's, you know, basically every language you can imagine, there's a client. We have a full set of SDKs that we support on our site, from, you know, Ruby to PHP to Node that natively support S3,
and then today, because I'm running Windows, I'll be showing you a client called Cloudberry, but there's Mac clients, and for example, in the Java world, there's a project that's been around for a long time called Jet S3T, which is used in a lot of projects. So it's very mature, it's been around a long time.
It helps our customers, and so you can imagine our larger customers like Netflix, or, you know, Shell Oil Company, are all using S3 somewhere at the core of their architecture.
And then, so this is what I really want you to remember, this idea of request for pay is a bucket, and I will show you how this works. And here's a little picture here that kind of maps to not just how it technically gets set up, but really the most salient thing here, the most important thing here is that,
so this is a bucket. A bucket is just a top-level name for an S3 container, so every AWS account is allowed to create 100 S3 buckets, and the reason it's limited to 100 is because that's a global name space. If you go to, for example,
well, I need to be careful, I think, about what customers I name, but there's, for example, some large university customers out there that have their whole www, some university name, .edu site that are sitting in S3, right? There's some emergencies sites out there that use WordPress, for example, to generate HTML,
the HTML gets pushed to S3, S3 then takes care of the emergency, right, because it scales massively. So it's very, very simple architectures. But here, what I wanted to show was, this area here is one particular AWS account owner, right?
And that owner has a bucket. You can have 100 buckets, you can actually have more than 100 buckets because you can have a number of accounts, but in that bucket, you can have the data of the world. The bucket has no limits. It's just an object, it's a key value, so it's just, the object store allows you
to keep pushing data into that bucket, right? So you can have as many tiles as you want. You can have as many Oracle backups as you want. The only limiter would be, you know, you can't have anything that's larger. Each object is limited to, you know, five terabytes, right, but you can keep pumping five terabytes into that bucket as fast as you want,
as long as you want, and you won't run out of space. So from a map tiling cache perspective, it's perfect that way, and it comes up, I think, in many talks, right? And here, so this is one account, and so here's the bucket, and there's a virtual machine. In our environment, we call them EC2, it's a EC2 server.
And you notice right here that when this virtual machine that's living in a region moves data from one of its buckets, this transaction here is free. I need to back up a little bit, right? So I'll show this to you in the console in a second.
But when you have an AWS account, you get access to now eight regions. So we have a global footprint. You can fire up systems, so for example, if you're a government customer, you can be running them in our Virginia region, or Oregon region, or in our GovCloud region, which is actually in the Portland area.
But just as easily, you can go to Tokyo, or Singapore, or to the EU, and do the same thing. It's just a drop-down list. And so the point I'm trying to make here is, this colored area here is one of those regions. So if you go and get data from a bucket,
you have a virtual machine in one of those regions, and it's doing a get operations of the bucket, that's free. Putting to the data is free all the time, right? Putting to S3 is always free. Now if you take the data out of the bucket, out of the region, and basically pull it out
to somewhere on the internet, it could be, whoops, then there is a charge. That is the data egress component, right? And when you turn the request or pays tag on, in combination with marking whatever objects you want public, authenticated access, what it does is,
make it such that this other account, so account A over here, and account B, pays for the data egress charge. Okay? And that's the key point. So why does that matter?
That matters because, you know, the web has made it possible for everybody in the room to publish, right? So you know, I can write a paper, I can link it to some other paper, or I can make some data set available, but with just that, I might not be able to operate at web scale, right, I might not be able to scale,
and even if I was, I might have to pay the cost for network there. With request or pays, you can actually offload that to whoever wants to get the data. So today I'm gonna show you that there's many views to the same data, so going back to my point,
it's just a bunch of geotiffs that are essentially gridded and prepped, and these are the geotiffs that I'm sure that, you know, this is what the prime contractors that flew the planes, that flew the Leica sensors, you know, the ADS-80, or whatever it was, they did a bunch of QA work, and finally at the end of the day, they get copied to some hard disks,
and the USDA probably receives those, copies those again, does a bunch more QA work, and then after many weeks, we can get access to it, right? But the way it should be is that there's one definitive copy of the data, it lives in the cloud, it might be in multiple regions, it might be in multiple
cloud vendors, right, but there should be a lot fewer copies out there. And so, I'm gonna go ahead and jump into the demo. So what we're looking at here is a leaflet.
I'm kind of an open layers guy, but I took this chance to learn a little bit of leaflet, not that I'm doing anything complex here. So it's just leaflet, and the important point here is this system, if you back off a little bit, so you can tell, for those of you who are from Oakland,
you'll be able to see that this is actually the city of Oakland's data set. If you back off a little bit, now we're looking at the USDA's NAEP data. And what's important here is, I'm using one client
to look at data that I can look at, so these are tiles obviously, that's why, here we got the slippy map thing going on, right? So I'm pretty sure that these tiles have been built already before, so they're already on S3. But in a second you'll see that if I move to another part of the USA, the tiles will take a second
to come up, because they're being generated on the fly. But the important thing here is these images, which are just tiles, 256 by 256 tiles, as we're all familiar with, are based on content that's living in another account bucket.
So in this case, it's Amazon Web Services public data set account, not my account, not my working account. And if you look in that account, there's a whole bunch of stuff including microbiome data and genomic data, all kinds of cool public data sets. But up here towards the top, there's AWS NAEP.
And you'll see, right, so all those mappers here will go, oh, okay, I get it, these are states, and there's California. And I rearranged this a little bit to simplify kind of the, it's not actually a directory system. These are all object keys.
So it'll look like a directory system so that we can then, next year we can receive the 2015 data and just slide it in here and maintain that one copy aspect of it. So this is a little bit rearranged, but basically the same data. Here's the one meter resolution data. I think Idaho has half a meter. It's the only state that has half a meter. So if you go look at Idaho, it'll say 0.5 meters.
And the original data was delivered as a group of shape file that defined the tau boundaries, the index, right, as you'd expect. And then the original data is all, nowadays it's all four band, right? So it's RGB, infrared, IR.
And then there's a bunch of metadata here. And if you go look at the original data, these are FIPS codes. And then here's a bunch of almost 200 megabyte files in here. And if you have an account with us,
and if you use a tool, for example, Cloudberry or any tool that can do request or pays requests, you all have access to this data. You can, and all it is is AWS hyphen nape. You have access to 48 terabytes of data. Now, I need to caution you. So this, you know, this is a test data set.
So I can't guarantee, I can't give you an SLA that's gonna be there next week, right? But if you fired up a client right now, and it could do request or pays requests, you can go and see this data. You can go and download this data as quickly as you want to do. And that's, I think, a very important point.
So at this point, you know, people have done this before who have done the exercise of going to some clip-and-ship system, are probably realizing, oh, all I need is a client, and it could be on my notebook here, or it could be in my workspace VDI, you know,
container in the cloud, and I can go quickly copy all this data, you know, before Mark stops talking, into my own account, because I want this stuff, right? And if you've done, if you've ordered any data from, for example, the USDA, NAPE data, you can see that this would be a much faster method
for you to get access to the data. So, you know, that being said, you can do all that, but I'm suggesting that would be a mistake. You don't need to do that, okay? You shouldn't have to copy, going back to what I was saying earlier. So on the right-hand side is somebody else's account, right?
Preferably this would be the account of whoever owns the data, so from my perspective, it would be a national agency, right, best case. It would be state local government that have maybe banded together with other counties or something that did a group by for aerial data, right?
And they're exposing, you know, high-resolution aerial imagery because it's public data anyway, right? And as long as they don't have a cost in disseminating information, why not do this, right? It's, you know, there's no cost associated
with making it public if you take this route. On the left-hand side, you can see that this is my working account, and I've got a bunch of stuff in here. I apologize, I have a whole bunch of buckets that are badly named, but down here, there's one called NAPE TMS, and so this is the, you know,
you can think of this as a level one cache, right? So rather than the cache being on the server that generated the tile or the servers that generated tile or in memcache or, you know, whatever you want to use for your caching layer, it's just an S3, okay? It just is, I'm using S3 as a cache,
and you can choose to use it as a, you know, how real-time that cache is or how long the duration of the cache is is up to you, right? Whether it lasts for one day or whether it stays in this S3 bucket for a year, that's all tweakable, and you don't, you know, you don't actually have to write code. You just change life cycle policy,
and I can show that to you in a second. So here, so this is just a TMS cache, just Mercator data, and it's exactly like you'd expect, right, so layers, you drill down here, eventually, right, you see some JPEGs, right? And these JPEGs here are, right,
exactly these guys, okay? And so technically, I can do things that go and just delete all these guys, and it would build it again, right? So going back to here, I'll show you a little bit more about how this works, so here,
right now I'm looking at the NAPE data. I'm gonna turn Firebug on. I thought I just turned it on. Here we go, and as we all, you know, like to do with somebody else's mapping systems, right, we can kind of explore how this works.
Let's see, I got, I'm gonna go all, and you can kind of see what's going on here, so as I move this thing around, if it's throwing a 303, because it can't find it, and right here, I have this DNS name called, so it's ABC, Oakland, told me 3D is my, it's a domain name that I have.
These are all using our content distribution network, so these are DNS names that are pointing to a content, a distribution that I've created that again leverages the AWS infrastructure, so this is another layer of cache that's closer to us, right, that's the way to think of it,
and if I go over here, you can see that I got a couple of test layers, so I'm borrowing the MapQuest OSM tiles here, and here I have a direct to the S3 bucket link, so right now I'm looking directly at our object store, nothing in between, so just the object store,
so for most use cases, this is just fine, simple architecture, and over here I'm using, I'm getting access to exactly the same data, but via our CloudFront content distribution network, so it goes to the content distribution network, content distribution network then goes to S3, right?
You'll notice though, that as I move this thing around, it's going to this thing called Tyler, right? Because neither the CDN nor the object store has the data, so S3 has an interesting feature where if you throw a certain kind of error,
you can essentially provide a filter, and you can do a redirect, so in this case, I'm doing a redirect to the system that makes tiles, right, and that's called Tyler, so if you click on Tyler and open this guy up in a new tab, you can see he just made one for us, right, and you can see that all it's really doing
is taking this tile name, so this is just the typical TMS naming scheme, and under the hood, what it's doing is exercising an auto-scaling WMS service that's running on EC2, right, so a completely separate system,
that is the definitive source for the tiles in this case, so you can see this in practice by just chopping this part out, and it'll go into test mode, check mode, and you can see the actual WMS request down here, right,
there you go, so if I copy this guy, all of a sudden, you're looking at a WMS server, okay, so this here is a load balanced, so it uses our elastic load balancer, it has, right now, I have it set up
using two University of Minnesota map servers, so I'm familiar with map servers, so I tend to use map server all the time, especially for imagery type stuff, so I'm running two EC2 instances that know how to deliver WMS content, right, so this little piece of code here, all it's doing is going, okay, what's the tile this person wants,
it just translates that into the appropriate WMS request, and behind the scenes, within region, so typically this wouldn't be exposed to public like this, it's having an auto-scaling system that's easily tweakable, so you can go, it's two now, but I need to have it 20 tomorrow,
it's just a simple change, and it's coughing up one of these tiles, right, but it does a couple things, it services the request, so it makes, oops, so it makes sure that the client is happy, of course, so it gets the data, but as soon as it delivers the data,
it fires off another thread, and it copies the same data, right, to the object store, so that any subsequent request would be satisfied from S3 rather than the tile, it's a very simple architecture, so the core feature there is, you're not trying to do the caching on the server itself, but you're just leveraging available services
that are available in the cloud, such as S3 to do the caching, another system does the caching for you. And so if, for example, we go to the management console over here, so I got a tab open,
so I'm just curious, how many people have seen the management console here? Could I just, you see some hands? Oh, quite a few, okay. So for all of you, I don't have to explain what this is, but for those who haven't seen this, so this is a GUI that allows you to use any of our 40 services, right?
Right now, we're looking at one called EC2 that allows you to control our virtual machine service, and it's actually, there's another tab that allows you to do our VPC, our Virtual Private Cloud, which is actually a subset of EC2, but anyway, this allows you to spin up virtual machines
anytime you want, but more importantly, turn them off and not pay for them as soon as you turn them off, right? And it's very easy, you hit launch, and I'm not gonna do this because I don't want this to turn into some kind of sales event here, but you hit launch, you make a couple selections, is it Windows, Linux?
In this case, these are all Linux machines, and then you can go and fire up whatever you want. But within the EC2 tab, I want to show you, one thing I want to show you is that I have a map server running, and down here, there's this thing called auto-scaling groups,
and so I have an auto-scaling group, it's just two machines right now building those tiles, and which one was it? This one here, I think, and if you come here, you'll see it, you got a couple of twos, some min-max too, and all I have to do is come in here, and if I wanted, I could just make this,
for example, max four, min four, and if I save that, then the EC2 system will just go ahead and clone a couple copies and fire them up. And typically, when you did something like that, and you wanted to scale from two to 20 or 200 or whatever you desire,
and if you're working in a world of 48 terabytes, for example, which is actually, from our perspective, it's not that large to test that, right? You have to worry about all the traditional things around making some kind of traditional file system available to map server or geo server
or whatever tolling server you're using. In this architecture, I don't have to worry about it because I'm using yet another open source package, and that was in my explanation, and I'll show you what that looks like by actually logging in or SSHing into one of these machines
so now I'm going back to another part of the console that shows me my virtual machines. I just want to look at the ones that are actually running. Some of them are probably going into startup mode. So here I have one, and I can get the DNS name down here.
Oops. This is the hardest part of my demos, copying this, and then I've got Putty running here somewhere. I need to make sure my key is correct because the key depends on what part of the world I'm in.
Looks okay, so I'm gonna go ahead and open it, and then this is, I'm pretty sure this is Ubuntu, and I'm in the door, right? So right now, all I did was set up an SSH session to one of these virtual machines that are running University of Minnesota's Map Server.
Well, it's actually Map Server, a GDL combination. So if I hit this, you can see that I've got a couple of mount points down here, and you can see the open source tool that I'm using right there, and it's basically making S3 look like a drive.
So if, as you'd expect, right, if I go CD, data, NAPE, do an LS, it'll take a second, but there's all the states. Now, remember, this is 48 terabytes, right?
So this is a virtual machine that has a couple of, I think they're 160 gig or so SSDs that look local to it, okay? And what this system basically does is it has access to 48 terabytes of data. It can go get any of those geotests that we're looking at before
because it's looking at a shapefile index, right? So it can go get that, but it'll only go and get the ones that it needs right now to do the tiling that it needs. So it's essentially acting as a cache for the 48 terabytes, and it's caching it on SSDs
that are local to this particular host that this particular virtual machine is running in, okay? So I don't have to, all of a sudden, as long as I have an S3 and the layout is correct and the data is good to go, right, I don't have to maintain my 20 WMS service, the data store that they can see, right?
It's all one, right? It's one copy, right? Now, that might be interesting from, okay, Mark's got his system up and running perspective, but what's more interesting is because those geotiffs are marked as requester pays, and because every object in that bucket
is marked authenticated access, if you have an account, you could do the same thing, right, and you can run your system and not have to copy the data. You do have to fire up your virtual machines in the region that this data resides, otherwise it'll, you know, it'll have latency effect
and you'll have another cost factor to deal with, but everyone in the room, as soon as you have an account and you, you know, I'm talking about one page of code here, almost all of the code, right, is open source, you know, FOSS code, right? You can have a NAEP server, a tiling server
that will deliver the United States. The other aspect, remember I said the S3 bucket has no limits, you can keep pumping data in there, so for example, when next year, you have all the states go from one meter to half a meter, right, what does that mean? You have four times the amount of data, right?
You have much more work around probably processing that data to create, for example, you know, internally optimized, so typically you take these uncompressed files that are delivered to the USDA, you'd have to internally tile them and you probably want to JPEG compress them, bunch of work, all this batch processing work
that you'd have to do in order to, you know, get that back into your working stack. You don't have to do that anymore, or even if you did have to do that, you have, now you have HPC resources you can use for a day to do that batch processing, why, because you're in the cloud, right? I'm talking really generically, right, about the cloud,
I mean that's, you know, those are the design patterns, those are the kinds of strategies you can take because we're going to, you know, a publicly available endpoints now, not an on-prem environment, okay? So, here you see the data,
and I'm gonna jump back now into the console because there's a couple of other things I wanted to point out that has to do with S3. So here, so over here, we're looking at the source data, right? And that source data is delivered as four-band,
but if you're just doing a, you know, base layer for, you know, for example, a public site, and you don't need to do, you don't need four-bands, typically what you'll do is you'll create an RGB-only derivative of that, right? So that's what this is, so this was not delivered by the USDA. This I built using, actually, Beanstalk,
and GDAL to make that happen. So over here, all of a sudden, you know, instead of these 200 meg files, so this is what the 100%, but these are just, so these are the, you know, almost 200 meg files that have now been compressed,
internally tiled, only three-bands, and compressed using JPEG at, like, I think it was rate 90, and they're much smaller, right? The point I'm trying to make here is if one person does this, right, nobody else should ever have to do this again, right? So it's another aspect of one copy should do, right?
So this is also in the bucket, right? So you don't have to go, you know, you don't wanna go look at the uncompressed originals because they're gonna be slow, right? And you probably don't need that fourth-band, you know, unless you're doing, you know, crop-rated analytics or something, right? You might, but typically, you probably want this, and this data now is just in the same bucket, right?
It doesn't have to be on some different volume because you ran out of space and original. We just added, I just added it and made it obvious in the bucket, and it just, you know, it's just part of the package. So this is the kind of thing that, you know, the content owners could do, right, because everybody on the planet's probably doing it anyway,
and in order to reduce the amount of heavy lifting around actually using this kind of content over time, right, so that's what that is. And let's see, so over here, I'm back in the console. I'm sorry I keep kind of jumping back and forth. So I'm looking at this bucket called NAIP TMS.
So this is essentially the cache, right? And if you look at this thing, there's a bunch of stuff that, you know, S3 can do or does that a lot of people don't know about. The one thing we're using right now is we're using it in static hosting mode. So S3 can act as just your website, right? So you can go upload, earlier I was talking about,
you know, universities doing things like using WordPress to do push model to S3, or you can have your, you know, your personal website on S3, simple to do. But one of the things that you can also do is not just enable, you know, index.html, but also handle a redirection.
So for example, in this case, if you get a 403, right, what do you do? And you saw a little while ago, right? You send it to the tiler, right? Tiler gets the incoming request for the tile, decodes that, generates a WMS request,
creates a tile, serves a tile, but more importantly, puts it into S3 for the next request. That's very simple. Another thing here is, let me scroll up a second. So there's static web, I'm just leveraging static website hosting, right? Existing feature of S3.
Down here I have something called lifecycle. You'll see that for my, from level 16 to 19, I have a lifecycle policy that just deletes it, right? So, you know, this is in test dev mode, so I just delete it after like 24 hours, right?
In production, you could probably, you know, keep it live for, you know, much longer than that, right? But the point here is, typically, you know, there's all kinds of heavy lifting around even maintaining this aspect of a cache this large, right? Because this is on, you know, on S3 in an object store,
it's just a lifecycle policy, right? It's the same model that, for example, folks in, for example, you know, the world of video use in order to process classroom video, right? So they're taking all kinds of classroom video or they're taking video off of roadways or something, that's a little bit scarier.
And then they'll pump it into S3. They'll probably encode that into another more, you know, mobile format, right? Having done that, they'll use exactly the same lifecycle feature to pump it into Glacier, which is our archival store, which, again, drops the price down.
It's just simply a matter of coming down here and adding a rule, right? Very little, basically no coding, it's just a setup. If you wanna write the code, of course, you can automate all of this via whatever language you like to work in. This is just a GUI implementation of a bunch of RESTful endpoints
that allow you to do things like, you know, change the lifecycle policy. So the last thing I wanted to speak to, since we're almost out of time.
I think I have one more slide, if I can find it. So the last idea here is, you know, typically we think about S3 in terms of, you know, static content, right?
So things like, you know, our web page, our HTML base, you know, front end or something that basically doesn't change that frequently, right? But if you look at, you know, more interesting and high scale use cases in the Amazon cloud, what you find is that,
yes, of course it works for, you know, relatively static stuff like a database backup. For example, a backup of an Oracle database using RMAN or something. And that might be done, you know, once a night or something, right? But we have a lot of customers out there
that are increasingly using S3 as more of a, you know, short term data store, right? You can do that because all you're doing is, you know, changing up the lifecycle policy. And so, and this is, I think, important, especially in, for example, government use cases
where we're talking about things like open data. So it maps to this idea of, you know, we have a lot of, for example, government customers that are interested in, for example, providing API endpoints to be more open. But the fact of the matter is, it might be a lot easier for them to have a system
that just pumps data frequently or more frequently into the object store and let the end user figure out how they want to use the data, right? So it's a very, it's a different model. So rather than it being like a, you know, a WMS or WMTS endpoint for, let's say, some kind of open government data,
it might make more sense for the government customer to be pumping, you know, CSV files into the object store, basically because, you know, government doesn't know what the customer use case may want to be and the customer may rather have something
that resides in something that doesn't require a government SLA, but it's just an S3 bucket because then it becomes our SLA, right? There's a big difference there. So rather than focusing on providing, you know, open data via APIs that are run and controlled by the government, it might make more sense,
for example, for government use cases, whether it's, you know, geo data or some PDF file or something, to actually just pump that into an object store and let the customer, whether that's an individual citizen or whether that's a, you know, private sector entity that's building, you know, like a traffic application on top of that, access to the raw data
from which then they can do an API endpoint or a RESTful endpoint kind of thing. So I don't want people leaving thinking that, you know, it's just good for some long duration cache. It's actually good for very short duration content. So it's more of a, you know, it really is a cache. It's not a static data store.
So that's it for my presentation. Thank you very much for listening. I'm available, I'll be here until tomorrow afternoon, I think sometime, and I'm at the booth over, kind of back in the corner over there. So if you're interested to hear more, I'm happy to help you. My, I apologize, I'm the only one here.
It's kind of a last minute thing. I knew I was coming, but I, you know, I'm like solo. So my one thing I want to ask you is if you can leave me with your business card, I would very much appreciate it because I've been told to come back with data because we're a data-driven company
and I want to make sure that we sponsor more of these events. So thank you very much. Appreciate it. Good job. Thanks.