Logo TIB AV-Portal Logo TIB AV-Portal

Don't Copy Data! Instead, Share it at Web-Scale

Video in TIB AV-Portal: Don't Copy Data! Instead, Share it at Web-Scale

Formal Metadata

Don't Copy Data! Instead, Share it at Web-Scale
Title of Series
CC Attribution - NonCommercial - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date
Production Year
Production Place
Seoul, South Korea

Content Metadata

Subject Area
Since its start in 2006, Amazon Web Services has grown to over 40 different services. Amazon Simple Storage Service (S3), our object store, and one of our first services, is now home to trillions of objects and core to many enterprise applications. S3 is used to store many kinds of data, including geo, genomic, and video data and facilitates parallel access to big data. Netflix considers S3 the source of truth for all its data warehousing.The goal of this presentation is to illustrate best practice for open or shared geo-data in the cloud. To do so, it showcases a simple map tiling architecture, running on top of data stored in S3 and uses CloudFront (CDN), Elastic Beanstalk (Application Management), and EC2 (Compute) in combination with FOSS4G tools. The demo uses the USDA��s NAIP dataset (48TB), plus other higher resolution city data, to show how you can build global mapping services without pre-rendering tiles. Because the GeoTIFFs are stored in a requester-pays S3 bucket, anyone with an AWS account has immediate access to the source GeoTIFFs at the infrastructure level, allowing for parallel access by other systems and if necessary, bulk export. However, I will show that the cloud, because it supports both highly available and flexible compute, makes it unnecessary to move data, pointing to a new paradigm, made possible by cloud computing, where one set of GeoTIFFs can act as an authoritative source for any number of users.
Scheduling (computing) Message passing Theory of relativity Scaling (geometry) Computer animation Multiplication sign Cloud computing Right angle Endliche Modelltheorie Mereology Resultant Computer architecture
Axiom of choice Code Multiplication sign Workstation <Musikinstrument> Source code Real-time operating system Software bug Fluid statics Medical imaging Core dump File system Object-oriented programming Diagram Endliche Modelltheorie Information security Eccentricity (mathematics) Position operator Physical system Arm Mapping Electronic mailing list Cloud computing Variable (mathematics) Data management Process (computing) Data storage device Internet service provider Website Right angle Quicksort Spacetime Geometry Point (geometry) Slide rule Server (computing) Game controller Open source Observational study Variety (linguistics) Connectivity (graph theory) Virtual machine Parallel computing Number Revision control Object-oriented programming Energy level Computer architecture Demo (music) Artificial neural network Basis <Mathematik> Database Library catalog File Transfer Protocol Human migration Computer animation Software Data center Library (computing)
Medical imaging Computer animation Demo (music) Tesselation Source code Hydraulic jump Library (computing)
Metre Pixel Software State of matter Network topology Real-time operating system Set (mathematics) Instance (computer science) Freeware Demoscene Connected space Computer architecture
Web page Server (computing) Computing platform Power (physics)
Computer file Open source Code Chemical equation Source code File system Virtual machine Instance (computer science) Mereology Inequality (mathematics) Number Asynchronous Transfer Mode
Nintendo Co. Ltd. Group action Scaling (geometry) Texture mapping Integrated development environment Search engine (computing) Virtual machine Plastikkarte Cloud computing Instance (computer science) Mereology
Laptop Web page Functional (mathematics) Server (computing) Digital electronics Computer file State of matter Java applet Density of states Virtual machine Open set Client (computing) Coma Berenices Web browser Mereology Information privacy Code Event horizon Power (physics) Derivation (linguistics) Software testing Physics Form (programming) Social class Programming language Mapping Sampling (statistics) Bit Cloud computing Instance (computer science) Process (computing) Computer animation Data storage device Network topology Right angle Musical ensemble Window
Web page Point (geometry) Server (computing) Open source Computer file Multiplication sign Execution unit Virtual machine Archaeological field survey Open set Information privacy Binary file Computer programming Medical imaging Different (Kate Ryan album) Business model Proxy server Texture mapping Scaling (geometry) Demo (music) Mapping Projective plane Electronic mailing list Mathematical analysis Sound effect Bit Database Instance (computer science) Type theory Data model Data management Computer animation Vector space Personal digital assistant Data storage device Blog Self-organization Pattern language
and back
few end and there is here this where
some in snow non-cloud related technical difficulty result on prime that makes sense to OK but someone
going it started Newton's 11 o'clock and I minus mark or aligned with the Amazon Web Services signed the and part of the solution architecture team on the public sector side Obama's on the news I work with our our government customers and our education our customers of time and I am the geospatial lead on the solution architectures specialist E from and as I'll talk about the 20 minutes today or maybe 15 minutes by Kevin here is gonna keep keep keep down a clock and I come about from model Globe I will talk after me on right on schedule you keep going here so I hope you can see this and so it's a little too small my message today is is very simple and as you can see in the title and that's that's about how we should be copying data instead we should be sharing data especially this open open data and because of the cloud can shared at any scale we want them that's of you know what I call 1 of the in the different you know it is a very different architectural possibility that the cloud of 4 affords of versus on prime on deployment especially
big of big Geod data OK I don't know what happened there close
that so can I just want to cover I will thus spend most of my time just quickly showing a demo I'm erase this because I won't bore
you with slides has too much so just a couple review points their copies expensive file storage cost we have Network-OS compute cost and then if we follow the kind of Old World principles so we call this the US in clip and ship model which means you go to some portal you look at some catalog you discover the data having discovered the data you typically download the data and then work with the data then the cost of distributed and updated distributed copies of becomes expensive so there's all these costs that we would then we you have to deal with on a day-to-day basis in in kind traditional clip and ship model of how the geo who John idea if the data gets large enough then you can get at all so you have to have some way to go get some small piece of it downloaded to your arm around know workstation reduce server Tier nobody and then work with the there a fully when you're working with that you using select users some open-source tools we still have to go through this kind of the the kind of ETL process about getting that data so you as you all know we we live in or we have been living in a world of silos the slide that I borrowed from Stanford University's library website and in simple idea right we have all these of of data centers of you run the vendors run by our government customers and I walked in all kinds of interesting data at the bottom level but they're generally silence of the general the journey silo for security reasons the silo for economic reasons and the larger that data gets on the other but the bottom of that that map there that images diagram the harder it is to get the data out from that side for a for a variety of reasons and 1 of the key points here is as the uh as low in 1 of the largest providers of cloud services in the world now receive a huge migration of customers movie from on from facility to the cloud and Germany was happy that we see silos moving to cloud but they maintain siloed architecture and so especially you geospatial of we have a lot of customers that running the core systems on us now in increasing the number of the number of customers that I see that I'm talking to every day that have exactly the same data stored in their cloud right next to somebody else's cloud architecture so we see that as a bug by this so if it's not in the foreign license considerations of its open data than those customers should probably be sharing 1 copy of it this John generally my talk today and I'll show you a practical example of how you can do that yeah so may scores for different well it's not silent in a data center the provision in real time of very very granular access to exactly that 1 due exactly 1 that 1 last vowel or whatever you want the idea in a simple kind of static methods or federated methods at your choice is a lot of flexibility and the last point of which I can't emphasize enough is because you're in the cloud because you're not on prime you can offload the variable component of cost which is network out the network you grasp you can offload to whoever is making the request for the data so what remains well somebody so has to pay for storage of the for example whether the particular storage service LB showing today which is called Simple Storage Service on you only pay for what you actually store so you don't have to do things like you know I might use to terrabytes this year some get to terabytes announced storage we just store we need today we charge you for what you have stored this this this mom yeah she prorated on a daily basis so was possible with cloud architecture what is that you can now store what you typically would have on your posit 5 systems on some file systems deep down the neural network deep down in your data center you can share that you can show that storage to any number of actors that you want because it's not your problem or problem to make that data available via the network so all you have to worry about it allowing the network access at the object level and if you had to pay for a network out there would be a problem where you can actually also the network heroes portion to the requester so some of the the point the so here we have something called a simple straw service here we have many actors so these are these can be version machines could be a land to service this could be our our manage Hadoop cluster anything right where you wanna run whatever code you wanna run and this is your your account you pay for storage but for example if you if you want to let other actors from other accounts access sort it just a matter of setting access control list for this for for whatever data objection 1 year so you have to generate ideas you have infinite network after a network of horizontal network access here today can be any number of actors horizontally on top of your game and that you could not do it if it was in your dataset OK so it's a very simple concept it's not of those not a file system all is inside FTP all that is is http that's all I'm talking about right so it's in the sense for going back to an additional 1 . 0 days in talking about using object stores all rather than a file system and you can do the 1 comment I like to make here is that of so you know I I will put a lot of customers much as just a special customers of the customers in the in education space customers doing genomics studies customers details on his brain research our customers doing pharmaceutical research eccentric cetera the larger the system is all the more kind embarrassingly parallel computer of the system is the more the core infrastructure relies on a simple straw service as street on fact Netflix houses of the famous comment where they say you they receive the ObjectStore as a source for truth actually there she treated the ObjectStore more like a database then just objects for so 1 stop there I will turn this thing off and on
a jump over to browser and I
need to make this smaller and over there and I'm sure you very simple demo
so here facts many reload this so I had this is leaflet all that
is is basically layers right so all of know other source JavaScript light library and not doing anything here other than image tiles the an idea here
beginning in operates just like you'd expect from a go from this is the city of Oakland's data and I'm going to the need data which is the United States Department of
Agriculture data USDA need data so this is a coast to coast we call Connor set of 1 meter per pixel dataset very well known in the United States are there no copyright restrictions I can download it play with it do anything along with that I did try to sell it to you but you shouldn't because it's free that kind of thing and also that if you if I move this thing you'll see that to my demos actually working in a fuckin and if I really do have a network connection you see the great great how's coming in I was going on is that of based off of know right now so they 75 terrabytes set up it's in real time re projecting a budget you taste in tree and J. on the fly so this is a real time and not tiling architecture that using natural you'd all on some to instances on the background and then how about telling the state in real time was centimeters in the and I'll show you what's
happening on the back and by opening up
firebug this over there you can see that as I move this saying His 1st going to the same called make TMS thought as to the SOS going to the S 3 bucket to see whether the J page exists or not if the JPA does not exist you can see that as doing a redirect to something called power which is running on our platform service called 1 of platform service article called being stopped here and we can open this guy up and pop it into the new cabinet does exactly what you
expect it creates a little J. but is doing this but by source seeing something between 218 and 219 thousand to files that are sitting in history OK they're not on PBS hand not on a new elastic file-system their shared across a number of virtual machines on history and so I'm I'm using you know well with your past that I actually have been using for many years open-source parts and doing maybe you know what a couple paragraphs of code how to deploy this inorganic inequality called cloudy fashion and a fight for example and get this to work you have to do this way some immediate I'm just putting it into
debug mode and you can see all that is doing is staying the taking this tennis Japan and then of non rearranging that into the Adobe must request right here we can see for example that is running in US East is on Amazon Web Services Elastic Load balancer behind which I can have any number of E C 2 instances I want right and they show you that Parliament and
Council you can see I have a couple of with a further CT for extra larges these are virtual machines running in the cloud and I can modulate of the scale of that despite good and going to the scaling are part of the consul finding maps served as a group and killing edit button and for example I had you know I change these 2 10 but but etc. fighting a remembered it the save button then within a few minutes about 2 3 minutes I have been Nintendo between instances running that's all that not servitude all I did have to do any ETL work around the 50 terabytes of data that's all been banned now it's all embedded in the Amazon Machine Image that I'm running which and by the way happy to share with anybody in the room OK now remember this was this is where were you know for use the
seeing with the with the number of in a large commercial search engine portals not looking up concept the answers 1 2005 the idea here is anybody with a credit card can deploy this national if not global back and and maybe not for the whole year of with a lot of machines but you can most definitely do for a few hours just to play with this right that's within your individual researchers scope now which is very different from if he did this part in in off from environment the reason is very simple it's not a data the chair and so now non pretty up there and make the smaller this is a a vendor provided tool vendors
cloud bury them on my on Windows here's this by 1 of the better Windows tools of for this it's an S streaklines so now I'm looking and not the browser but using a client dedicated S 3 in a couple of other things and I'm looking I'm go look for the data that i'm using under the hood for the synthetic images so that on the right hand side is the public data come and in the public data account there's all kinds of data from the shared behavior sizes data genomic data is restricted set a and part of that the data I put in here is some more a year ago didn't use 8 so need if you remember a dubious hyphenate in you have this client you can go look at this data you can see this is US status of the state abbreviations come up right away so this go look at California the 2000 to 2014 he didn't remember this containers and just so I can have 2016 and 2018 on and on and on and I never have to worry about running out of storage right it keeps going forever and the other more form part of if you know the name of this bucket you everybody in the room has access to the bucket you do need a native USA com which is free but you have access to right now the 75 terabytes of data and In a follows power Open Data best practice that and what that is is 2 things 1 is very simple I have to give you access to the data Summary drill down into the data here's a 4 band original data is a US Phipps codes here's a dangers 1 set there's about a quarter of a million of these files just under 200 megabytes and so these are from the prime contractor this is original test if I go open the ACL for this you'll see that it allows read by authenticated user so that means that if you have annotated this account and you know you can you remember a that is heightened they you can gain access to all the U. and the simple that there but remember I mentioned that but as the owner of this data I might not want to pay for you're taking the data out right downloading the data out especially if you want to do DOS my bucket right but he didn't like me that's a right and you had some you know machine process circuit downloading petabytes of data I would cry because it would be my my bill of rights now I can take care that allowed by a feature that's a band available and has tree from the beginning phys called requester pays and if I right-clicked those 1 properties and that the requester pays you see that it's turned on that means that if you are another account you request this data and you download for example you to my notebook here but then my account pays for the requests the rest and that allows me that if I was a data owner so for example if I was a United States Department of Agriculture culture of that owns UST that allows me to show to the world without expense so I can have petabytes of data and I could be no with petabytes of data and to make available to everybody on the planet how they could even do DOS media they're on request a page of the class but then they would pay so I don't get it right right it's that this the 2 samples that so how felt 7 minutes OK and so on have someone give you a couple of you try 1 is the very familiar so the map which is a j 256 by 256 that we use every day right which is the derivative of the Judas that you're you're looking at just a 2nd ago which is we which is available you can do all those as quickly as you want as a function of your auto-scaling size min-max right so you have all the flexibility so can be very embarrassingly parallel or it's a little bit embarrassingly parallel about that process I I n and the others view is them using a client and and there is no you know there's open command line of the Python tools Java tool there's all kinds of tools offer you gain access services has been around since 2006 so you can choose any any any client or any language of you want to get access to the street In the last thing I wanna show is so how about how about these machines how's mark during the machines
secure party so this is SSH into 1 of the 1 of the event instances instances that are running massive original and I think that after we start this the if
have is so that's where were my demo died as you can see that when I before I did this that I have from using another open-source project called wires 3 FS which is out of Python project you'd find it on did how users or Bodo for then but this allows you to mount any bucket you
want to make a list just make it make it available not ejido right so I can write run maps a rigid all on this instance with instance does not have you know 75 terabytes of data been it really on it but this this package will go get it put in competence analysis of the local to the machine and manager cash intelligently on the background so that allows me in to spin 1 instance spent 100 instances within minutes and and basically provides that the map for United States Korea on the whole the world if I wanted to worse if I have the data I might have to talk to Digital low for all global data me but I could now if I want to do that so in some of the sad now whom this is me finally a couple minutes for questions any any questions on know around to a bunch of different things about this feel free to me afterwards are happy to share the data happy share of machine image file and then the specific techniques that I'm using here I should all be very familiar to too many of you you have any questions and and the media in general johnny yes on so in we do have a public data program this is actually the what I showed you just now is not but in the example that is is if you look for Landsat data I we that's our latest large open data of public data on a project that i bias a little bit on that's were were typing data from a US abided USG is FTP servers were putting it in the same as 3 bucket all 9 and making making that you can use the same tool take a look at all and simply not and that kind of a cardiac here with the public data programs that were very so we're interested in the type of data that has to be the kind of data that are not the facilitates the unit interesting using or virtual machines right but more importantly we wanna make sure that is properly curated and maintained over time so generally that needs to be whatever the sources I not not somebody in between and the need to have a bit of relevant business model but that makes sense for you really want you to trust that organization to maintain maintain that public dataset over time so that it doesn't become stale and all and so on and OK you know you and Ordnance Survey this is an the you question yeah so you know so were happy to you know if you want to what we have in and it is public and of Lawrence's actually customers already so that any user of than that that help them a little bit on that but yeah that doesn't know I think a good example of and so were were looking for you know all the other projects of effect Landsat data of example in a data model where we can work with the data owners to make it more and more easily available that might the public public public for that might be just the data owners as the bucket with requested requested case turned off the 2 general patterns I was just looking at St with requested page turned on see whether you're business model the owner of the data on this small makes sense there and then as the next stage of the you know that potentially consideration for public the any other questions from real 1 another so 1 last question a but had come in so far we've talked about the waste roster of data how about the vector data how about a sharing vector data in 9 large scale and how about spatial this for that there so so there's a bit of a couple of projects going on we actually have 1 and I you know maps and might be talking about it but the idea there for example was fitting the loess ended up in a tiled better monastery so a large camera was called the the the global blog to the binary file know there's there's techniques for we can call that put industry so you don't have the ETL you don't have to have the database to to do a web-scale based of service so happy to talking more about that 2 so the same general idea applies for both vector imagery and things like point quarter-light thank you very much I thank could