Running Geospatial Workloads on AWS
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 52 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/44722 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSS4G SotM Oceania 20198 / 52
9
11
12
14
15
17
20
23
26
28
30
32
34
39
44
00:00
Point cloudPointer (computer programming)Product (business)Service (economics)NeuroinformatikProcess (computing)OnlinecommunityInstance (computer science)Data storage deviceGraphics processing unitService (economics)Projective planeLibrary (computing)Lambda calculusIntegrated development environmentDigitizingObject (grammar)Different (Kate Ryan album)Task (computing)Network topologyMereologyMultiplicationShared memoryNumberAxiom of choiceServer (computing)Point cloudComputer fileWorkloadType theoryWave packetBlock (periodic table)Self-organizationScalabilityData managementContent (media)Virtual machineMultiplication signCodeMedical imagingIdentifiabilityFile systemSoftwareMachine learningSatelliteQuery languageBitWeb 2.0CASE <Informatik>InformationsmengeVisualization (computer graphics)DatabaseDisk read-and-write headFile Transfer ProtocolDatei-ServerComputer configurationQuicksortLaptop
05:12
MeasurementTelecommunicationSoftware developerVolumeSoftwareMechanism designCore dumpPersonal digital assistantComplementarityFrequencyTrailSatellitePhysical systemSpacetimeProcess (computing)Scale (map)System programmingImage resolutionPoint cloudTemporal logicComputer fileOpen sourceSimulationDisintegrationPoint (geometry)Open sourceCurveMultiplication signEndliche ModelltheorieInternet service providerRepository (publishing)Program slicingSet (mathematics)Self-organizationQuery languageWater vaporCASE <Informatik>BitDenial-of-service attackLatent heatDatabaseLocal ringLevel (video gaming)Information privacyAreaMechanism designFigurate numberState observerSurfaceElectronic data processingMultiplicationMathematical analysisRaw image formatQuicksortTerm (mathematics)Pattern languageObject (grammar)Core dumpSoftware repositoryOperator (mathematics)Function (mathematics)SoftwareSatelliteFile formatRow (database)Power (physics)Binary codeUniform resource locatorRepresentation (politics)Image resolutionAuthorizationTrailConnectivity (graph theory)Decision theoryDrop (liquid)Web pageOpen setLink (knot theory)Point cloudStack (abstract data type)OrbitDomain nameMeasurementGoodness of fitDirectory serviceSampling (statistics)Temporal logicSpacetimeCodeComputer programmingSensitivity analysisNumberWindows RegistryProcess (computing)Subject indexingMobile appRange (statistics)SubsetNatural numberWeightFluid staticsDivisorFilm editingService (economics)Coefficient of determinationType theoryInformationSelectivity (electronic)LaptopObservational study
13:23
DatabaseDisintegrationWeb browserLibrary (computing)File viewerInterface (computing)Lambda calculusScaling (geometry)Vertex (graph theory)Parallel portWeb pageObservational studyCubeElectronic program guideScale (map)Design by contractAsynchronous Transfer ModeComputer fileService (economics)CASE <Informatik>Set (mathematics)Open sourceBlock (periodic table)Functional (mathematics)DiagramMereologyPartition (number theory)WebsiteProjective planeStructural loadSelf-organizationArchaeological field surveyType theoryLambda calculusFile formatMultiplicationComputer configurationProcess (computing)Cartesian coordinate systemQuicksortComputer programmingQuery languageBitLibrary (computing)Different (Kate Ryan album)Uniform resource locatorRepository (publishing)Electronic program guideNeuroinformatikAdditionMoment (mathematics)Virtual machineDatabasePoint (geometry)SatelliteMedical imagingDialectParallel portCubeSlide ruleVisualization (computer graphics)Software repositoryMultiplication signCloud computingQuantum stateInformation privacyCodePoint cloudOpen setConnectivity (graph theory)AreaWhiteboardRight angleObject (grammar)Sign (mathematics)Shared memoryAlgorithm
Transcript: English(auto-generated)
00:01
Hey everyone, so what I wanted to talk to you about first of all is just to explain what is the difference between running a geospatial workload on the cloud as opposed to on-premises or any other type of infrastructure, because you're probably already aware that if you were running a workload on any cloud you could stand up a bunch of virtual machines and provision a bunch of block storage and everything would look exactly the same as any other environment.
00:26
But there's one picture I'd like you to take away with you when you leave here, and that's that the cloud gives you the option of flipping that whole infrastructure on its head. So instead of actually bringing the data to the compute infrastructure, you basically bring the compute infrastructure to the data.
00:44
And in the AWS environment that data is represented using S3 object stores. And the concept there is S3 is a highly durable, highly scalable object storage layer that pretty much any AWS service can integrate with natively.
01:00
And so I've got Amazon Athena there. You can run SQL queries directly against CSV files or JSON files or Parquet files sitting on S3. You don't need to load them into a database. You can run Apache Spark jobs on them. You can do MapReduce-type tasks on S3 directly without moving it from S3 into a Hadoop file system.
01:22
You can have Amazon EC2 instances directly referencing the S3 data. And I'll show you an example a little bit later where we've got a bunch of data, the NREL organization running HD5 data, sorry, HDF5 sitting on S3 and directly accessing it through EC2 instances, again without moving it to block storage.
01:44
So that's probably the key difference is because you're basically bringing the compute to the storage, it takes a lot of the plumbing and a lot of the hassle out of dealing with data. So the key phrase there is share, don't copy. By sharing the data, by publishing the data once from the
02:02
producers and actually consuming it n number of times globally across your choice of infrastructure, whether it's workspaces for running, say, a heavy-duty visualization software stack that needs multiple GPUs and 4K monitors and hundreds of gigabytes of RAM. You can run that on a virtual desktop. You can do EC2 instances, MapReduce as I
02:23
mentioned. You can do serverless compute with Lambda where you don't actually stand up servers at all. You write a piece of Python code, you put in pandas or your choice of geospatial libraries and directly reference the S3 data. Or you can do machine learning training based on the S3 data as well. And there's a bunch of customers of AWS in New Zealand, in Wellington in fact, doing a
02:44
whole lot of machine learning work with large image files and large personal identifiable information sets as well. So the reason we, what I want to talk about specifically though is about open data. So taking that a step further, once we're actually talking about sharing data and bringing the compute, it means it
03:03
makes it very attractive for organizations to be able to publish their data and share it using that S3 layer. And what we've been hearing from customers like MetService and DigitalGlobe is it would be really great if someone could actually take the problem of having to manage big FTP servers or file servers or web
03:20
servers for distributing gigabytes or even terabytes of data a day away from us and us having to ensure availability. And likewise, the consumers of that data are telling us that what they really want is to avoid that hassle of having to do regular downloads. When I was at MetService, that was a big part of our job was actually downloading global meteorological files from ECMWF,
03:41
from the UK Met Office, from GFS, from a bunch of Japanese meteorological agency, satellite data all over the network infrastructure. That was almost like our number one headache. Compared to that, running the weather models is relatively simple, being a bit flippant, but it's a big job. So what AWS has responded to is by saying, well, why don't you let us actually manage that storage for you?
04:04
And if the data is actually genuinely useful to a large number of people, we'll take on the actual storage costs and the data access costs as well. So the only thing you need to do is basically provide, as a producer, is provide the content. And probably the biggest advantage is it opens up your data as a producer to a larger community of users.
04:27
We have a saying in AWS that 99% of the smartest people in the world do not work for AWS, they work for someone else. And that's probably true of your organizations as well, maybe not as much. But it's certainly the case that most of the brains actually don't work for you, and you can leverage that by publishing your data.
04:44
And I'll go a little bit further than that. It's actually a lot more than just dumping the data somewhere on S3. But by publishing that data, you open it up to that community. It lowers the cost of research as well, because if you're not actually having to worry about network pipes and that sort of thing, if you're able to run Jupyter notebooks or write your own Python code directly against the S3 data,
05:06
it means you're actually running projects in a matter of minutes, and you're not having to invest in infrastructure as well. So I'll talk to you a little bit about our public data sets that are already in our registry.
05:20
These are some of the major providers of those data sets. As you imagine, NASA publishes a lot of satellite imagery. You can see the Hubble Space Telescope data there if you're interested in it as well. But there's some surprising organizations there which probably aren't geospatial in nature, but the National Institutes of Health, they publish de-identified patient records into this repository as well.
05:43
There's financial data, there's a stack of meteorological data sources ranging from radar, satellite imagery, global weather models, local area models like WARF and that sort of thing. Now in terms of actually how you go about finding it, if you go to the AWS, there's a URL there,
06:02
open.aws, you'll see there's a registry, and you can basically just put in a search. And that will give you all the information, a link to the data set, and it will have a full page about what's actually, just back up, a full page about how to access that data, and it's usually just a URL. But more importantly, there's robust documentation that sits behind it.
06:24
There's tutorials, there's Jupyter notebooks, there's GitHub links where you can actually access some tools that will make it easy to consume the data as well. So in terms of how to share data, one of the things we've learnt through this open data program is there's some really,
06:41
it's more than just dropping the data somewhere and saying go for your lives and make use of this. There's a few things you need to do. The first thing is non-technical considerations associated with sharing. Number one is actually trust. If someone's going to invest time in building a solution, a pipeline, or a data processing pipeline against your data, they need to be able to trust your data.
07:01
And that means that if you've got any sort of remote sensing network with actual mechanical components or cameras or whatever, those things go wrong. So when they go wrong, you need to be able to tell people that it's gone wrong so they can actually adapt to that. Landsat's a really good example of that, that they actually had some sensors fail on a satellite a few years ago,
07:21
and they actually bent over backwards to make sure that everyone knew about that and everyone was factoring in those sensor failures into their software stacks. We used to have that with MetService as well with some of the polar orbiters that over time they would break down, sensors would get irradiated and they would stop working or lose sensitivity. And as a provider, they were obliged, they told us that and we were able to incorporate that.
07:45
So that really builds trust. The second thing is actually exactly the same as software. You need good documentation. You need use cases. You need tutorials. You need sample code that people can actually get started with. And lastly, accessibility. Again, if people are going to put a lot of effort into tooling, they want to know they can get that data when they need it, particularly if they're doing operational processing.
08:05
Some of these meteorological agencies are publishing hourly observations on this repository. It's not just static data. It's actually live data. Some data sets are being updated every minute. So if you've got consumers or if you are a consumer and you're consuming that data, you want to know that that data is always there and you can get to it.
08:22
And it's not going to grind to a halt when lots of people are downloading it. So just that little curve there, it means that what that's suggesting is that as the trustworthiness level of documentation and accessibility of your data resource improves, your user base basically starts to grow pretty steeply.
08:40
If the data is really useful, people will persist and they'll get onto it and they'll figure it out somehow. But if you're willing to access a large user base and people, particularly busy people, you need to invest a bit of time. The second thing is technical considerations. And this is a little bit Karen Teweter, but you can overthink as well as underthink your data representation formats.
09:04
Underthinking is pretty obvious. It's when you just take a binary database dump or something like that, drop it onto S3 and say go for your lives. That's no good to anyone. Some people access it, but it's going to be very difficult for them. You can also overthink it. If you design or handcraft a really fine-tuned API for a specific use case,
09:22
you might think you're doing people a favour, and for some of the user base you are, but you're also going to alienate some people that actually sit in the middle that need something a little bit more flexible than originally defined API, but don't want to go and read binary database dumps. So ideally what you're doing is actually trying to cater for multiple use cases.
09:44
By all means, build your API, but also make the data available in a more raw format so that people can actually download it and do their own manipulations and slice and dice it in the ways they make sense for them. So some common use cases, you can also optimise your data for some common use cases.
10:01
I'll show you an example with wind farm modelling in a couple of minutes where the provider's actually gone through a techno-economic analysis of the weather data. So they converted wind speeds into actual shaft power outputs for the turbines. So local agencies or power authorities could make decisions about how useful it would be
10:22
to put aside a wind turbine in that location. So moving from the meteorological domain to the economic domain. Build a map and an index. Don't just have a directory and trust that people will understand when you file naming conventions. Build a searchable index. Provide a cherry-picking mechanism.
10:41
So if you've built an index and people are able to target a slice of data or build a query that can select a subset, build a mechanism by which they can extract that data using range-type selections, because they'll save them a lot of time and it'll save them a lot of download time as well. And lastly, track usage. Measure how people are using your data.
11:03
So I'll just finish up with a couple of examples. So one of them is in our common data repository. One of them is called the Blue Dot Observatory. And what that is, it's a consortium that's set up to analyse the Sentinel-2 satellite observation data.
11:21
And their objective is actually to look at patterns of water consumption and actually the depletion of water resources throughout the world, particularly in the developing world. And there's a pretty scary map showing that could be Sydney. If anyone's from Sydney, that could be Western Sydney, by all means, with the reservoirs.
11:40
But you can see there over space and time how those water resources are drying up. One of the things that the Blue Dot Observatory told us is that with this model of actually accessing the data on the common data repository is they can process a month of data consisting of 7,000 bodies for only €6.
12:01
So it can be super cheap to do some serious science on these data sets. Another example, LIDAR data. So this is a US Geological Service has published a LIDAR point cloud consisting of 10 trillion points. As you might expect, it's over the US. But potentially it's actually got the ability to provide mechanisms for flood researchers,
12:27
for agricultural researchers to do some serious science on very high resolution elevation data. So that's available on the open data repository as well. And one I'll just drill down into a little bit is called the NREL,
12:40
or the National Research Renewable Energy Laboratories, the Wind Integration National Data Set. Now what NREL have done is they've produced a weather model which has been corrected using observations, surface observations, scatterometer observations from satellites, and they produced a two kilometre resolution grid of the US
13:03
and with five minute temporal resolution spanning seven years at all the vertical elevations that are meaningful to commercial wind farm operators. That's 500 petabytes of HDF5 data. So it's a reasonably chunky set of data.
13:21
Now what's actually built up, and that's published on the open data repository, and you can download that. If you've got a good pipe, you can download that. But what most organisations are doing, or what most consumers of this data, which are basically people at one of our own wind farms, are doing is actually using a thing called the Wind Toolkit. And that's an open source set of tools that have basically been developed
13:43
to specifically consume this data, as well as reusing some other HDF5 formats. So there's a couple of components there. There's PyWTK that provides an API for accessing the HDF5 data. And it's designed to be able to work in like a download mode
14:03
where you actually download these files, or it can work in a native mode where you run it against loaded up as lambda functions in AWS and directly query the S3 data. We see more and more of this is where open source projects will have an option to basically process data using cloud native services,
14:21
as well as downloading type services. There's a tool called WinVis, which sits on top of PyWTK, and another service I'll mention in a moment called HSDS. And that image there, which should be moving, but there's no player on this machine,
14:41
would show the wind barbs moving through time on that map. And that's a visualisation tool where you can zoom in on particular wind farm sites, different elevations, and work out what the economic suitability of that site might be. There's another library which isn't actually specific to NREL,
15:01
but if you've got HDF5 data, which there's an awful lot of that. It's the dominant format in meteorology. It's used by a lot of satellite formats. I imagine you've come across that as NetCDF or NetCDF4, that sort of thing. It's basically a data format that's optimised for large-scale compute jobs. It more or less presents a database or a multidimensional data grid
15:23
as represented in a single file to a piece of code. So it can basically just do array processing, dimensional array processing, directly on the file. So there's a library called H5PYD that's been designed to actually optimise
15:43
the access to these big data sets. And the way it does that, and I'll just show you a little diagram here, is that it can directly query the S3 objects. So you've got HDF5 data sitting in S3 as individual, probably multi-gigabyte files. And then it does parallel queries against that data.
16:03
And there's a little graphic there showing you how that works. It basically chunks up the data sets, works out which blocks or which files or partitions in the data set it needs to retrieve, goes out and gets them all in parallel to S3, and then brings back the data set and presents it to the overlying application. So the benefit of that is that you can do potentially,
16:23
these things can scale to, this library can scale to hundreds of nodes if need be, working in parallel against S3. So they can very quickly process, allow algorithms to run directly against HDF5 data, even though the data's just sitting in place on S3. No downloads required.
16:43
So there's a couple of use cases I've covered now. We've got a guide here, there's a URL, which has also got a lot of guidance about sharing and consuming data that I've mentioned. There's some additional use cases, Transport for London, the African Regional Data Cube, NOAA's NixRAD repository,
17:04
which is their composite radar, Doppler radar for the whole continental US, I wish we had it here, and that's shareable on the repository as well. So have a look at that guide, it's a really useful resource. And lastly, this is usually the first slide in the deck,
17:22
but I hate leading with all sorts of cloudy economic type stuff. But aside from what we've been talking about with cloud infrastructure, one thing to bear in mind is that some of the other key differences, and this applies to any cloud, not just AWS, is that it avoids that no upfront expense, you pay for what you use.
17:41
Most of these science projects I've been talking about have cost literally 10, 15 cents, in some cases a couple of dollars to stand up and get running, and they can be provisioned in minutes. So you only pay for what you use. Improves your market agility because you don't need to provision resources, sign contracts, you can scale up or down, and the whole thing's self-service.
18:02
Thank you. Thanks Craig, really good talk. So the nearest AWS region to here is Sydney, right? So that's another win for Australia, right?
18:22
Sucked in New Zealand. So how many New Zealand organisations are part of the public dataset program? How many organisations actually have a bucket sharing data? DOC would probably be the most proficient in that area.
18:41
DOC? DOC, Department of Conservation, they've got ecological datasets that they're going through the onboarding process for this. MetService has some, we're talking to MetService about data as well, but DOC have also got a bunch of APIs for doing things like sites and national success. And are they part of the public dataset program?
19:01
No, they're actually conventional APIs. Cool, exciting. Questions from the audience? Alright, hold on.
19:20
Obviously a lot of those examples that you used were from North America. What do you see as the biggest roadblock in this particular region, so Australia, New Zealand, Pacific Islands? It seems like different governments are deploying point solutions by different vendors.
19:41
How do we bring this all together? Well, to answer your first question, the answer is, to be honest, it's funding. The US organisations, the reason they do this is because they're funded to do this. NOAA, for example, is funded to actually publish their weather models to the world.
20:02
They've got a mandate to publish it to the world. The same with the Japanese meteorological organisation, they've got a mandate to do this. So this just provides them with a channel to access the data. New Zealand, for various reasons, and the way the organisations, most of the scientific organisations have been set up in New Zealand,
20:22
they don't really have access to those sorts of funding resources and they've had to rely on commercialisation of a lot of their data. So we're talking to them all the time about saying, let's see what we can actually share, but it's a long journey at the moment. The other question about fragmentation, the example I've shown are US based,
20:47
but there's a lot of European meteorological organisations, ESA, for example, the European Space Agency, they've been very prolific with the use of this repository as well. And actually some of the Australian agencies are in the process of sharing data through this repository as well.
21:05
We've actually got quite a few agencies talking to our open data program at the moment to try and get their data sets on board. Any other questions? I have heard that an agency, Geoscience Australia, may share some data through a cloud provider.
21:29
Is that it? Cool. All right. Thank you very much. No worries.