We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Don't Copy Data! Instead, Share it at Web-Scale

00:00

Formal Metadata

Title
Don't Copy Data! Instead, Share it at Web-Scale
Title of Series
Number of Parts
183
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Producer
Production Year2015
Production PlaceSeoul, South Korea

Content Metadata

Subject Area
Genre
Abstract
Since its start in 2006, Amazon Web Services has grown to over 40 different services. Amazon Simple Storage Service (S3), our object store, and one of our first services, is now home to trillions of objects and core to many enterprise applications. S3 is used to store many kinds of data, including geo, genomic, and video data and facilitates parallel access to big data. Netflix considers S3 the source of truth for all its data warehousing.The goal of this presentation is to illustrate best practice for open or shared geo-data in the cloud. To do so, it showcases a simple map tiling architecture, running on top of data stored in S3 and uses CloudFront (CDN), Elastic Beanstalk (Application Management), and EC2 (Compute) in combination with FOSS4G tools. The demo uses the USDA��s NAIP dataset (48TB), plus other higher resolution city data, to show how you can build global mapping services without pre-rendering tiles. Because the GeoTIFFs are stored in a requester-pays S3 bucket, anyone with an AWS account has immediate access to the source GeoTIFFs at the infrastructure level, allowing for parallel access by other systems and if necessary, bulk export. However, I will show that the cloud, because it supports both highly available and flexible compute, makes it unnecessary to move data, pointing to a new paradigm, made possible by cloud computing, where one set of GeoTIFFs can act as an authoritative source for any number of users.
126
Point cloudData centerTheory of relativityNumberServer (computing)Physical systemMereologySpacetimeFile systemComputer architectureReal-time operating systemResultantData managementArtificial neural networkObservational studyVirtual machineObject-oriented programmingMessage passingScaling (geometry)Right angleData storage deviceScheduling (computing)System callWebsiteQuicksortPrime idealInformation securityLevel (video gaming)Medical imagingEndliche ModelltheorieMultiplication signCodeSoftwareDemo (music)Point (geometry)Cloud computingRevision controlFile Transfer ProtocolCore dumpSlide ruleConnectivity (graph theory)Library (computing)Process (computing)Variable (mathematics)Open sourceWorkstation <Musikinstrument>Basis <Mathematik>Position operatorLibrary catalogDiagramVariety (linguistics)Human migrationInternet service providerElectronic mailing listGame controllerArmGeometrySoftware bugFluid staticsAxiom of choiceInfinite setOpen setComputer fileGreatest elementObject-oriented programmingXMLComputer animation
Core dumpObject-oriented programmingDemo (music)Source codeTesselationData storage deviceRight angleReal-time operating systemServer (computing)Structural loadOpen sourceCodeDatabaseScaling (geometry)MereologyInstance (computer science)Physical systemFront and back endsLevel (video gaming)Virtual machineMedical imagingNumberPlastikkarteDemosceneElasticity (physics)Web portalPixelWindowSoftwareMetreInformation privacyClient (computing)Search engine (computing)Computer architectureReal numberState of matterMusical ensembleComputing platformComputer fileFile systemIntegrated development environmentForm (programming)Connected spaceEccentricity (mathematics)Web browserPower (physics)Network topologyComa BerenicesObject-oriented programmingParallel computingAsynchronous Transfer ModeInequality (mathematics)Chemical equationWeb pageSet (mathematics)Software testingHydraulic jumpLibrary (computing)CodeNintendo Co. Ltd.Open setGroup actionTexture mappingFreewarePoint cloudComputer animation
Business modelBitData modelInformation privacyVirtual machineComputer programmingInstance (computer science)Process (computing)Server (computing)Client (computing)Projective planeDigital electronicsVector spaceOpen sourceType theoryProxy serverRight angleScaling (geometry)Level (video gaming)Open setMultiplication signExecution unitElectronic mailing listNetwork topologyTexture mappingJava appletSelf-organizationPhysicsLaptopBinary fileArchaeological field surveyDensity of statesData managementMathematical analysisHypermediaSocial classWeb pageSound effectProgramming languageSampling (statistics)Different (Kate Ryan album)BlogMedical imagingDatabaseDemo (music)Functional (mathematics)CASE <Informatik>Derivation (linguistics)Computer fileEvent horizonPoint (geometry)Pattern languagePoint cloudGoodness of fitRaster graphicsView (database)Cache (computing)Price indexInternet service providerFile Transfer ProtocolVirtualizationLevel of measurementSource codeSource codeComputer animation
Computer animation
Transcript: English(auto-generated)
Can everybody see this? We're having some small non-cloud-related technical difficulty. This is all on-prem, does that make sense?
So I'm going to go ahead and get started, it's 11 o'clock. My name is Mark Korver, I'm with Amazon Web Services, I'm part of the Solution Architecture team on the public sector side of Amazon, that means I work with our government customers and our education customers, and I am the geospatial lead on the Solution Architecture
specialist team. So I'll talk for about 20 minutes today, or maybe 15 minutes, I have Kevin here who's going to keep an eye on the clock, and then Kevin Bullock from Digital Globe will
talk after me right on schedule, we're going to keep going here. So I hope you can see this, I'm sorry it's a little bit small. My message today is very simple, and as you can see in the title, it's about how we shouldn't be copying data, and instead we should be sharing data, especially if it's open data,
and because of the cloud we can share it at any scale we want. And that's what I call one of the very different architectural possibilities that the cloud affords versus on-prem deployment, especially of big geo data, okay, I don't know what
happened there, I'm going to close that. So I just want to cover, I will spend most of my time just quickly showing a demo, I'm going to race through this, and so I won't bore you with slides too much.
So just a couple of review points, so data copy is expensive, storage costs, we have network costs, compute costs, and then if we follow kind of the old world principles of what we call, at least in the U.S., clip and ship model, which means you go to some portal, you look at some catalog, you discover the data, having discovered
the data, you typically download the data, and then work with the data, then the cost to distribute, update the distributed copies becomes expensive. So there's all these costs that we have to deal with on a day-to-day basis in the kind of traditional clip and ship model of big geo data, and generally the idea is
if the data gets large enough, then you can't get it all, so you have to have some way to go get some small piece of it, download it to your on-prem workstation or to your server or to your notebook, and then work with it there. Hopefully when you're working with it, we're using something like QGIS and some
open source tools, but you still have to go through this kind of mini ETL process about getting that data. So, as you all know, we live in a, or we have been living in a world of silos. This is a slide that I borrowed from Stanford University's library website.
And it's a very simple idea, right? We have all these data centers run by vendors, run by our government customers, and they all have all kinds of interesting data at the bottom level, but they're generally siloed. So they're generally siloed for security reasons, they're siloed for economic reasons,
and the larger that data gets at the bottom of that map there, or that image, or the diagram, the harder it is to get the data out from that silo for a variety of reasons. And one of the key points here is, as one of the largest providers of cloud services
in the world now, we're seeing a huge migration of customers moving from on-prem facility to the cloud. And generally, what's happening is that we see silos moving to cloud, but they maintain siloed architecture.
And so, especially in geospatial, we have a lot of customers that are running core systems on us now, and increasingly a number of customers that I see, that I'm talking to every day, that have exactly the same data stored in their cloud, right next to somebody else's
cloud architecture. So we see that as a bug, if it's not, you know, if there aren't license considerations and if it's open data, then those customers should probably be sharing one copy of the data. And that's generally my talk today, and I'll show you a practical example of how you can do that.
So what makes cloud storage different? Well, it's not siloed in a data center. And you can provision, in real time, very, very granular access to exactly that one geotiff, exactly that one last file, whatever you want, via, you know, simple kind of static
methods or federated methods. It's your choice. There's a lot of flexibility there. And then the last point, which I can't emphasize enough, is because you're in the cloud, because you're not on-prem, you can offload the variable component of cost, which is network out. So network egress, you can offload to whoever is making the request for the data.
So what remains? Well, somebody still has to pay for storage, but for example, with the particular storage service that I'll be showing today, which is called Simple Storage Service, you only pay for what you actually store. So you don't have to do things like, you know, I might use two terabytes this year,
so I'm gonna get two terabytes of NAS storage. You just store what you need today, we charge you for what you have stored this month. We actually prorate it on a daily basis. So what's possible with cloud architecture is that you can now store what you typically
would have on POSIX file systems, on some file system deep down in your network, deep down in your data center. You can share that storage to any number of actors that you want, because it's not
your problem. It's our problem to make that data available via the network. So all you have to worry about is allowing network access at the object level. And if you had to pay for network out, then it would be a problem, but you can actually offload the network egress portion to the requester.
So here we have something called Simple Storage Service. Here we have many actors. So these could be virtual machines, this could be our Lambda service, this could be our managed Hadoop cluster, anything, right?
Whatever you want to run, whatever code you want to run. And this is your account. You pay for storage, but for example if you want to let other actors from other accounts access the storage, it's just a matter of setting access control lists for whatever data objects you want in here. So generally the idea is you have infinite network, horizontal network access here.
So there can be any number of actors horizontally on top of your data, and that you could not do if it was in your data center.
So it's a very simple concept. It's not a file system. All it is, and it's not FTP, all it is is HTTP. That's all I'm talking about, right? So in a sense we're going back to kind of HTML 1.0 days and talking about using object stores rather than file systems.
And the one comment I like to make here is that, so I've worked with a lot of customers, not just geospatial customers, but customers in an education space, customers doing genomic studies, customers doing Alzheimer's brain research, customers doing pharmaceutical research,
et cetera, et cetera. The larger the system is, the more kind of embarrassingly parallel compute the system is, the more the core infrastructure relies on simple storage service S3. In fact, Netflix has a famous comment where they say, they see the object store as their
source for truth. They actually treat the object store more like a database than just an object store. So I'm going to stop there. I will turn this thing off, and I'm going to jump over to a browser, and I need to
make this smaller, move it over a bit, and I'm going to show you a very simple demo. So here, excuse me, let me reload this. So all this is is Leaflet. All it is is base layers, right? So open source JavaScript library, I'm not doing anything here other than image tiles.
And the idea here, and it operates just like you'd expect, I'm going to go from, this is the city of Oakland's data, and I'm going to the NAIP data, which is the United States Department of Agriculture data, USDA NAIP data.
So this is a coast-to-coast, what we call a Kona set, a one meter per pixel data set, very well known in the United States. There are no copyright restrictions. I can download it, play with it, do anything I want with it. I could try to sell it to you, but you shouldn't buy it because it's free, that
kind of thing. And also, if I move this thing, you'll see that my demo's actually working, and if I really do have a network connection, you see the gray tiles coming in. All that's going on is that, based off of, right now it's a 75 terabyte set, it's
in real time, re-projecting a bunch of geo-tips and creating JPEGs on the fly. So this is a real time map tiling architecture that's using Map Server GDAL on some Ubuntu instances on the background, and then tiling this data in real time.
Whoops, I didn't mean to zoom in. And I'll show you what's happening on the back end by opening up Firebug, and move this over a bit, and you can see that as I move this thing, it's first going to this thing called naptms.s3, so it's going to the S3 bucket to see whether the JPEG
exists or not. If the JPEG does not exist, you can see that it's doing a redirect to something called Tyler, which is running on our platform service called, one of our platform service called Beanstalk. And we can open this guy up and pop it into a new tab, and it does exactly what
you'd expect. It creates a little JPEG. But it's doing this by sourcing something between 218 and 219,000 geo-tiff files that are sitting in S3. They are not on EBS, they are not on our new elastic file system, they are shared
across n number of virtual machines on S3. And so I'm using parts that I've actually been using for many years, open source parts. I'm doing maybe a couple paragraphs of code to deploy this in what I call a cloudy fashion.
For example, if I can get this to work, I'm just putting it into debug mode, and
you can see all that it's doing is taking this TMS name, JPEG, and then rearranging that into a WMS request right here. You can see, for example, that it's running in US East, it's on Amazon Web Services elastic load balancer, behind which I can have any number of EC2 instances I want.
And if I show you that part, I can go to the console, and you can see I have a couple of, I think four, three, three, four extra larges. These are virtual machines running in the cloud, and I can modulate the scale of that
just by going to the auto-scaling part of the console, finding my map serve group, and hitting edit button, and for example, I hit, I change these to 10, et cetera, and
if I remember to hit the save button, then within a few minutes, about two, three minutes, I'll have 10 Ubuntu instances running map serve or GDAL, and I didn't have to do any ETL work around the 50 terabytes of data. That's all been, it's all embedded in the Amazon machine image that I'm running,
which I'm, by the way, happy to share with anybody in the room, okay? Now, remember, this was, this is what we're pretty used to seeing with the number of large commercial search engine portals now, this kind of slippy map concept been around since what, 2005?
The idea here is anybody with a credit card can deploy this national, if not global, back end, maybe not for the whole year with a lot of machines, but you can most definitely do it for a few hours just to play with this, right? That's within your individual researcher's scope now, which is very different from
if you did this in an on-prem environment, and the reason is, very simple, it's not your data, it's shared, and so now I'm putting up, let me make this smaller, this is a vendor-provided tool, vendor's Cloudberry, so I'm running on Windows here, so that's
probably one of the better Windows tools for this. It's an S3 client, so now I'm looking, not the browser, but using a client dedicated to S3 and a couple of other things, and I'm looking, I'm gonna go look for the data that I'm using under the hood for those JPEG images. So that, on the right-hand side, is our public data account, and in the public data
account, there's all kinds of data, right? Behavioral sciences data, genomic data, Alzheimer's research data, et cetera, et cetera, and part of that, the data I put in here is somewhere, here we go, AWS APE, so NAPE, if you remember AWS-NAPE, and if you have this client, you can go look at this data, you can see
this is US data, so the state abbreviations come up right away, so let's go look at California, here's the 2002, 2014 data, remember, this container is endless, so I can have 2016, 2018, on and on and on, and I never had to worry about running out of storage,
right, it keeps going forever, and the other more important part is, if you know the name of this bucket, you, everybody in the room has access to the bucket, you do need an AWS account, which is free, but you have access to, right now there's 75 terabytes
of the data, and it follows our open data best practice pattern, and what that is, is two things, one is very simple, I have to give you access to the data, so I'm gonna drill down into the data, here's a four band original data, these are US FIPS codes,
and here's the data, in one set, there's about a quarter of a million of these files, just under 200 megabytes, and so these are from the prime contractor, this is the original TIF data, if I go open the ACL for this, you'll see that it allows read by authenticated users, okay, so that means that if you have an AWS
account, and you know, you can remember AWS hyphen nape, you can gain access to all the US data, okay, and it's as simple as that there, but remember I mentioned that as the owner of this data, I might not want to pay for your
taking the data out, right, downloading the data out, especially if you wanted to DDoS my bucket, right, because you didn't like me, let's say, right, and you had some, you know, machine process that kept downloading petabytes of data, I would cry, because it would be my bill, right, now I can take care of that by a
feature that's been available in S3 from the beginning, it's called requestor pays, and if I right-click this, go on properties, and hit the requestor pays, you'll see that it's turned on, that means that if you are another account,
and you request this data, and you download it, for example, you know, to my notebook here, right, then my account pays for the request, data egress, and that allows me, if I was a data owner, so for example, if I was the United
States Department of Agriculture that owns a USDA data, that allows me to show it to the world without expense, so I can have petabytes of data, you know, I could be NOAA with petabytes of data, and I could make available to everybody on the planet, they could even DDoS me via their own requestor pays request, but
then they would pay, so I don't care, right, it's as simple as that, how are you, about seven minutes, okay, so I'm giving you a couple views, right, one is the very familiar slippy map, which is a JPEG 256 by 256, like we use every day,
right, which is a derivative of the geotiffs that you were looking at just a second ago, which is available, you can build those as quickly as you want, as a function of your auto scaling size, min-max limit, right, so you have all the flexibility, so you can be very embarrassingly parallel, or just a
little bit embarrassingly parallel about that process, right, and then the other view is, so I'm using a client, and there's, you know, there's open source, there's command line, there's Python tools, there's Java tools, there's all kinds of tools for you to gain access to S3, S3 has been around since 2006, so
you can choose any, you know, any client or any language you want to get access to S3, and then the last thing I want to show is, so how about these machines, how's Mark doing the machine, so here is Putty, so this is SSH
into one of the Ubuntu instances that are running Map Server GDAL, and I think I have to restart this, okay, so that's where my demo died, but you can see that when I, before I did this, that I have, I'm using another open
source project called YS3FS, which is a Python project, you can find it on GitHub, it uses our Boto core, and it allows you to mount any bucket you want,
and just make it available now to GDAL, right, so I can run Map Server GDAL on this instance, this instance does not have, you know, 75 terabytes of data immediately on it, but this package will go get it, put it in, put it on some SSDs that look local to the machine, and then manage
your cache intelligently on the background, so that allows me to spin one instance, spin a hundred instance within minutes, and then basically provide SlippyMap for United States, Korea, or the whole globe, if I wanted, or I should say if I have the data, I probably have to talk to Digital Globe
for all the global data, but I could now if I wanted to do that. So I'm going to stop now, excuse me, and I want to leave a couple minutes for questions, any questions, I know I ran through a bunch of different things, please feel free to grab me afterwards, happy to share the data,
happy to share the machine image, and the specific techniques that I'm using here, it should all be very familiar to many of you in the room, any questions?
Yeah, generally yes, so we do have a public data program, this is actually what I showed you just now is not, an example that is, is if you look for Landsat 8, that's our latest large open data, public data project that I
assisted a little bit on, that's where we're piping data from a USGS, a bunch of USGS FTP servers, we're putting it in the same S3 bucket, and then making that, you can use the same tool to go look at all the Landsat 8. Now, kind of the caveat here with the public data program is that we're
very, so we're interested in a type of data, and it has to be the kind of data that obviously facilitates interest in using our virtual machines, right, but more importantly, we want to make sure that it's properly curated and maintained over time, so generally that needs to be whoever the source is, right, not somebody in between, and they need to have
a good, you know, a relevant business model that makes sense for, you know, we want to be able to trust that organization to maintain that public data set over time so that it doesn't become stale and old.
Ordinance survey, yes, so you know, so we're happy to, you know, if you want to, I don't know if it's public or not, but Ordinance survey is
actually a customer already, and so they've been a user, I've helped them a little bit on that, but yeah, that's a, you know, I think a good example, right, so we're looking for, you know, other projects that look like Landsat 8, or for example, the NAEP data, where we
can work with the data owners to make it more easily available. That might be public, public-public, or that might be just the data owner's S3 bucket with requester pays turned on. The two general patterns, I would suggest looking at S3 with requester pays turned on, see whether your business model or the data owner's
business model makes sense there, and then as a next stage, there would be, you know, potentially consideration for public data, okay, any other questions? We've got one minute left, so one last question.
Hi, so far you've talked about the raster data, how about the vector data, how about sharing vector data in large scale, and how about spatial indices for that? Yeah, so we have a couple of projects going on, we actually have one, and you know, maps might be talking about it, but the idea there, for example, was getting the OSM data in a tiled pattern on S3,
so instead of a large, I can't remember what it's called, the global blob, the binary file, you know, there's techniques where we can tile that, put it in S3, so you don't have the ETL, you don't have to have the database to do a web-scale vector-based service,
so happy to talk to you more about that after too, so the same general idea applies for both vector imagery and things like point cloud or LiDAR data. Thank you very much.