We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The Open Data Cube Sandbox

00:00

Formal Metadata

Title
The Open Data Cube Sandbox
Title of Series
Number of Parts
295
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
FrontierSI have been working with the Open Data Cube Community to develop simpler ways of running the ODC, and one of these is the ‘Sandbox’ project, where we have built a pre-configured ODC instance as a reference deployment. The Sandbox aims to provide developers, data scientists, decision makers and everyone else a way to learn about remote sensing, the ODC generally, or about the various applications that earth observation data can be used in. The use of Docker, Kubernetes, the ODC and Dask enables for a scalable environment that can run non-trivial workloads. The ODC Sandbox is as easy to use as visiting a website, it's powered by vast quantities of open data, it’s available now and the architecture is open, so you can build one yourself too. This presentation will briefly introduce the ODC project and will then discuss the Sandbox, how it is structured and what is required to get started. The presentation concludes with demonstrations of a number of use cases and applications of the ODC that are available in the Sandbox in addition to future plans.
Keywords
GeometryDigital signalOpen setCubeStreaming mediaSpacetimeBit
Observational studyGroup actionProjective planeOpen setWebsiteSoftware developerCubeBitSpacetimeRaster graphicsOrder (biology)SoftwareComputer architecturePresentation of a groupState observerSystem callExploit (computer security)DigitizingMeeting/Interview
WebsiteOpen setSatelliteArchitectureLengthCubeLibrary (computing)Source codeVolumeLibrary (computing)Open setVolume (thermodynamics)CubeCartesian coordinate systemLengthRaster graphicsComputer architectureProjective planeOpen sourceXML
Library (computing)Source codeVolumeSatelliteExponential functionMathematical analysisObservational studyLevel (video gaming)Different (Kate Ryan album)Mathematical analysisProduct (business)Computer-generated imagerySatelliteWordTwitterWave packetSingle-precision floating-point formatState observerMultiplication signPixelRight angleData managementVolume (thermodynamics)Computer animation
MetreGeometryMathematical analysisLevel (video gaming)Pairwise comparisonWater vaporPoint cloudSystem identificationScaling (geometry)Multiplication signPixelGoodness of fitSingle-precision floating-point formatData storage deviceExtension (kinesiology)Mathematical optimizationBit rateServer (computing)Object (grammar)Subject indexingFile formatWeb 2.0Diagram
GeometryPoint cloudServer (computing)Web 2.0Data storage deviceObject (grammar)Extension (kinesiology)Order (biology)NeuroinformatikVolume (thermodynamics)Point cloudComputer filePhysical lawXML
Structural loadPoint cloudData managementOpen setData storage deviceCubeComputer animation
GeometryCubeOpen setGeneric programmingSatelliteCubeOpen setMultiplication signType theoryProjective planeData storage deviceCoordinate systemDigitizingSet (mathematics)SoftwareBitRevision controlSubject indexingTape driveStrategy gamePhysical systemTexture mappingFile archiverLecture/ConferenceComputer animation
Equivalence relationDatabasePrice indexGeometryService (economics)ArchitectureInstallable File SystemOpen setCubePhysical systemWeightComputer fileBitDifferent (Kate Ryan album)Cartesian coordinate systemMusical ensembleNumberRow (database)DatabaseData storage deviceObject (grammar)CubeLibrary (computing)Open setWeb serviceSubject indexingDiagram
Web applicationCubeRight angleInstallation artCuboidOpen setLecture/ConferenceMeeting/Interview
GeometryLaptopCubeSoftware testingIntegrated development environmentPerformance appraisalIntegrated development environmentSubject indexingLocal ringQuicksortCubeDatabaseOpen setLevel (video gaming)Scripting languageComputer fileConfiguration spaceCuboidVideoconferencingInstallation artTemplate (C++)2 (number)Dot productComputer animation
Integrated development environmentLaptopCubeSoftware testingPerformance appraisalGeometryDigital signalData storage deviceMultiplicationIntegrated development environmentCubeState observerWater vaporProjective planeVisualization (computer graphics)Open setWeb serviceSpacetimeTemplate (C++)Standard deviationSubject indexingQuicksortMathematical analysisLevel (video gaming)Cartesian coordinate systemWave packetExtension (kinesiology)BuildingAreaMultiplication signObservational studyWeightSystem callStrategy gameExecution unitComputer animation
GeometryMathematical analysisData miningObservational studyOpen setCubeGraphics tabletAreaRight angleWebsitePolygonData miningPixelConnectivity (graph theory)RoutingProduct (business)Covering space
GeometryData miningObservational studyMathematical analysisPixelAreaCovering spaceData miningWebsiteFraction (mathematics)Product (business)Connectivity (graph theory)Right angleTwitterData acquisitionMathematical analysisPolygonWater vaporMultiplication signVisualization (computer graphics)Shared memoryDiagram
GeometrySystem identificationMathematical analysisMathematical analysisSemiconductor memoryPixelEntire functionProcess (computing)Water vaporNeuroinformatikMaxima and minimaIntegrated development environmentCondition numberMultiplication signInformationGraphics tabletPairwise comparisonRevision controlComputer animation
Fibonacci numberGeometryInformationSoftware developerCubeSoftware developerAreaPoint cloudAcoustic shadowSatellitePairwise comparisonInformationCubeOpen setRevision controlComputing platformSelf-organizationInteractive televisionProjective planeCurveWater vaporBitScripting languageDemosceneCuboidWeb 2.0Multiplication signGraphics tabletStress (mechanics)Process (computing)Source codeComputer animation
Directed setService (economics)GeometryDerivation (linguistics)Product (business)Digital signalDemosceneValue-added networkProduct (business)Structural loadWater vaporElectronic mailing listInstance (computer science)CubeOpen setDemosceneSocial classCircleSubject indexingPhysical lawComputer animation
Scaling (geometry)Open setCircleStandard deviationIntegrated development environmentCubeSubject indexingPoint cloudProduct (business)Form (programming)MetadataVisualization (computer graphics)CurvePoint (geometry)Electric generatorServer (computing)Web 2.0Order of magnitudeLevel (video gaming)Service (economics)Process (computing)Covering spaceForestMathematicsRoundness (object)Web serviceEndliche ModelltheorieMathematical optimizationLecture/Conference
GeometryCodeSoftware developerArchitectureCubeCodeSoftware developerTemplate (C++)ImplementationComputer architecturePlastikkarteComputer animation
Digital signalQuicksortFile formatOpen setSubject indexingComputer fileEndliche ModelltheorieDialectOverhead (computing)Queue (abstract data type)Row (database)RAIDScripting languageCube
Lattice (order)DatabaseComputer fileScripting languageRow (database)Scaling (geometry)Endliche ModelltheorieOpen setSubject indexingSelf-organizationDemosceneCoefficient of determinationFile format
Digital signalSoftwareWave packetTouch typingOpen setProjective planeFamilySelf-organizationSupercomputerSlide ruleImplementationRoundness (object)Bit ratePlastikkarteTemporal logicCubeWebsiteMereologyLink (knot theory)
Lattice (order)Subject indexingLibrary catalogStack (abstract data type)Ocean currentComputer fileStandard deviationRepresentational state transferPoint (geometry)Scripting languageWeb 2.0Network topologyOpen setFile formatObject (grammar)QuicksortMetadataRootProcess (computing)
Ocean currentStandard deviationLecture/Conference
Transcript: English(auto-generated)
Alright folks, I think we'll get started. My name is Alex Leith and I'm the chair for this session, one of the final stream sessions of the day. Thank you all for hanging on and coming along to this late thing.
I hope you're recovered enough from last night to have a bit more space left in the brain for a few more things. I'm your first presenter as well. So, a bit of background on me. I work currently for Geoscience Australia. I work in a group called Digital Earth Australia, leading a team of developers. We do work on Earth observation data.
We also have a project working in Africa called Digital Earth Africa. Within Australia and in Africa we seek to make Earth observation data easier to use and make it accessible for government, for industry and generally easier.
Today I'm going to talk about the Open Data Cube, the Open Data Cube sandbox and a few other ways that you can use this piece of software in order to work more easily with large amounts of raster data. So, what is the Open Data Cube and what am I doing here talking to you all about it?
So there's a definition of the project here from the website. I'll read that out. The Open Data Cube seeks to increase the value and impact of global Earth observation satellite data by providing an open and freely accessible exploitation architecture. The IDC project seeks to foster a community to develop,
sustain and grow the technology and the breadth and depth of its applications for societal benefits. But we're technical people and we really want to just have the nutshell answer. And in a nutshell, the Open Data Cube is an open source Python library that facilitates working with large volumes of raster data.
So, which data? Generally, it's Earth observation data. It can work with any regularly gridded data. The trend around Earth observation data is that there's more and more satellites going up in the air and more and more data coming down to the Earth. I think there's a funny little blip there in 2018
where there weren't so many launches, but if you have a look at that, it's one of those exponential trends where there's just going to be more and more of this stuff. So we need better ways to be able to manage huge volumes of data. And more and more, when we talk about data, we talk about analysis-ready data, which is a word that means a lot of different things
to a lot of different people. But this image up here demonstrates the need for analysis-ready data. So on the left here, we have a Landsat level one product and we have a level two or analysis-ready product on the right. The level two data has had things like the atmosphere removed from the data,
which means that you can compare a single pixel over time and it's going to be consistent. On the left, what we've got is the same scale in both of these images from minus one to one and it's the NDWI or an index of wetness. And you can see over here there's the clouds that are getting highlighted as being water,
which they're not, they're clouds. And the lake really stands out in the analysis-ready data as being water. So we're going to get a better identification of water when we've got good data to be doing it on. So there's a need for good quality data, useful data.
The other side of things is usable data, and more often we're seeing that people are storing data as cloud-optimised geotiffs, which is a backwards compatible extension to the geotiff format, which means that you can store it on an object store or on a web server and just read small pieces of that file
rather than having to download the whole thing onto your computer in order to use it. And this means that you can store all these vast volumes of data in somewhere nice and convenient, like on the cloud. And as we all know, the cloud is somebody else's problem, right? It makes it easier. We don't need to manage loads of data storage.
Thanks, Kurt. The good news for us is that COGs are supported natively in the Open Data Cube, which means that we can stand on the top of the shoulders of all these people that are organising big amounts of data.
So to get back to the original question, what is the Open Data Cube, I'm going to go through a little brief history of the Open Data Cube. So some time ago there were Landsat satellites put up and they weren't able to store data on them and so they had to push it down to the ground as soon as they got it, just about.
So in the middle of Australia we have a big dish and we receive the data and it would go through some cables to Canberra. We put it on some tapes and we'd ship it off to the USGS but we store it there in some kind of underground bunker. Now there's a project eventually called Unlocking the Landsat Archive, where this data was pulled off those tapes and put into digital storage
and a piece of software was written to make it easier to work with that data, to make an index for that data. And that was called the Australian Geoscience Data Cube. Later that was rewritten so that it worked with data that wasn't just Landsat, it worked with any data set and it worked in other coordinate reference systems, not just the one, so it made much more general purpose
and they came up with the really innovative name of the AGDC version 2. This was later renamed to something a little bit more exciting, which is the Open Data Cube. So that's where it comes from. To get into a bit more of the technical architecture, an installation of the Open Data Cube looks something like this.
So we have data, whether that's a native file system, so net CDFs or GeoTFs or whatever it might be stored on a local system or stored on S3 or some other kind of object store. We have the Open Data Cube Python library itself, which you ask the library to index that data
and it stores a record in a Postgres database of where that data is, what the bands are, what the footprint is, that kind of thing. So then on top of that you write a number of different applications and these can be all kinds of things. There is an OGC web services application, which is quite mature and stable that you can stand up.
Data science folks like using data science tools like Jupyter and you can just use Flask or whatever to build some kind of web application up on top of there. So in what ways can we use the Open Data Cube? So one of the first projects that I worked on
was something that we called the Cube in a Box and this is using Docker and Docker Compose to have a kind of reference local install of the Open Data Cube. So on the right-hand side we have a little video there of starting up the Cube in a Box. And for those of you that are familiar with Docker,
we're using a YAML configuration file to start up a Postgres database and a Jupyter environment which has the Open Data Cube. And you can run some scripts in there which is sort of all organised for you to go and index the Landsat level 1 data from the USGS. So you don't need to go and download the data, it's all there, it's all public accessible. And you have an environment with the Data Cube up and running
within a few seconds. We also have a CloudFormation template for the Cube in a Box which means you can click the button, deploy it into your AWS account and it does the same thing. You end up with a Jupyter environment where you can use the Open Data Cube and you can specify an area of extent to index into that which takes minutes and you can do it anywhere in the world.
The only downside is that the Landsat data there is at level 1 data which means you can't do real, great, meaningful analyses with it. So personally I'm looking forward to when the USGS pushes up there and analysis ready data and we can do this kind of thing for anywhere in the world.
So another initiative we're working on is a sandbox. So within Geoscience Australia we're working on the Digital Earth Australia sandbox and that was yesterday around a training course using the DEA sandbox. And this is using JupyterHub, Kubernetes and some of the sort of standard Helm templates for deploying JupyterHub.
With the Open Data Cube supported in there. So this environment will have an index of all of the data that we have available through the Digital Earth Australia. So you can turn up into this space and you get your own little container with Jupyter in it and it's already got the data ready so you can start doing some data science applications
without having to install anything. So this has been great for training and for our scientists within Geoscience Australia to be exploring some of the data holdings that we have. And after that you can start doing some cool things with it. So within the OGC web services we have some visualisations here
of what's called water observations from space, which is a water classifier. And here you have an annual summary of this which shows in brighter levels of blue areas that are always wet and then as it becomes more into the reds it's those areas that are wet less of the time. And so this dataset is available for all of Australia
and we're working on building this for all of Africa as well. So we have a sort of first stage of the project over Africa where we have generated this data for about 10% of the continent for the 30 years of Landsat. Or you can use the Open Data Cube to be doing some analyses like this one.
So on the right here we have the Peppa Pig polygon which is a mine site that's being rehabilitated. And on the left we have a little area of native vegetation. And if we have a look at one of... This thing doesn't work, I keep pushing it.
So we have another product called Fractional Cover which looks at the amount of area in a pixel that is covered in either bare earth or healthy vegetation or dry vegetation. And here if we have a look at the bare earth component of that fractional cover for those two polygons, the blue area is the Peppa Pig mine site
and so it's got more bare earth than the native vegetation. And if you have a look at the trend over time, so we've got like four or five years here, we can see that the mine site is getting closer to that area of native vegetation which demonstrates that the mine site actually is being rehabilitated as it should be.
I have another example here using Sentinel-1 synthetic aperture radar data. And this is an example of using DASC to do an analysis. Now the visualisations, it's pretty cool but it's not really that useful. But over here are ships kind of floating around in the water
parked off a place called Newcastle in New South Wales in Australia. And what we've done here is get an entire year of Sentinel. So we've got two and a half thousand by three thousand pixels and 85 time steps. And through those we're getting the maximum brightness of those pixels. And in doing that you get a glimpse off the ships.
The processing of this is done, what DASC is, is a distributed computing environment. So the processing for this is done over a few nodes on AWS and it runs off all of these pixels but only returns the one maximum value which means that you're only using a really small amount of memory
in your actual Jupiter environment and all the heavy lifting, heavy processing and data handling is done off in the cluster which means you can do some pretty cool stuff. And a lot of what we do with this kind of work is we get data and we turn it into information. So here we've got a comparison between a Landsat scene here
and what is our water classifier on the other side. I quite like it, it's like an 8-bit version of the world. You've got your clouds and your cloud shadows and areas identified as water and areas identified as land. But really it demonstrates that you can now say that this is water, this is probably water, this is probably not water.
And whilst a human can do it in the other one, it's not that useful. So the other priority that we're working on in our team is improving the developer experience around the Open Data Cube. And this is a concept that I've found somewhere on the web sometime.
And the idea is that you're trying to make onboarding into a piece of technology easier. And so things like the cube in a box where you can deploy this on your laptop and get started without having to ramp up the learning curve so heavily. And then once you've got it running, you can open it up and see how it's been implemented and have a look at the scripts that tie it together and have a look at the infrastructure that's needed.
And that way you've got a bit of a flatter learning curve. And the Open Data Cube is a collaborative project. We've got some pretty good interaction going on in GitHub. We have people from NASA, from the satellite application, Catapult in the UK, from the Australian CSIRO organization,
Federal Research Organization, and from our organization Geoscience Australia. And more and more there's people, there's quite a few folks here. We heard Steve Ramage talk about the Open Data Cube and Geo are adopting it as one of their technology platforms. And Digital Earth Africa.
So I'm going to briefly mention some of the work that we've done recently for Digital Earth Africa where we use the Open Data Cube and some related technologies, firstly using Kubernetes, which is fantastic fun if you're one of the DevOps kind of folks, to process about 60,000 scenes, 30 terabytes, about 10% of Africa. And using Spot Instances on EC2 it cost us about $150, took about 12 hours.
And then what we do is we do derivative products based on that. So we get a listing of these 60,000 scenes and we can run the water classifier. And so we use the Open Data Cube to do the load of this data to run the classification, to write out a GeoTIFF and then we push that back up to S3 where we index it back into the Open Data Cube.
And so we're really completing the circle of doing these large scale analyses using this tool. We're about to work on some continental data products in a similar way. So we're going to get some ALOS data, which is a synthetic aperture rater, and bring that together so we can start looking at things like forest cover and the change in that.
We've hopefully got some preliminary run of the USGS Collection 2 data, which is a big deal, and that'll be coming later this year. And hopefully we'll process all of that and do the same kind of visualisation,
or product generation and visualisations on top of that. And looking to do Sentinel-1 and Sentinel-2 for all of Africa as well. And the way that people access that, really with the model, is to provide these data as cloud optimised GeoTIFFs and best intentions to have stack metadata next to them, but there'll be some form of standard metadata.
So all the data will be able to be publicly accessible. Then we run a level of services on top of that, so that'll be the data science environment with JupyterHub and also some OGC web servers so that you can visualise this kind of work. It's a pretty exciting project, it's a pretty big scale. I think I've been working with some reasonably large data over the past,
but I think that I'm getting to the point now that we can say that we're working with big data. I don't know, whenever you make that claim, someone else comes along and says they've got that something, which is like orders of magnitude bigger. But it feels pretty big, it's inconvenient. So, in conclusion, I think that we're trying to flatten the learning curve here with the Open DataCube,
trying to make it more accessible and easier to get started. We've been writing infrastructure as code templates, which means it documents the architecture, and so the deployments that we have, you can have, you can pick them up and use them. We've got these sandbox environments where the users can just turn up and start doing data science and don't have to even worry about the infrastructure.
Developers and implementers can open up those infrastructure as code templates and have a look at the cube in the box and see how it works and pull all the pieces apart and start putting those back together. And the templating means that we can replicate this stuff and we can build some stuff for Australia that we build for Africa, that we maybe build for somewhere else, and it sort of gets stamped out. It gets easier at least.
So, that's it for me. I've finished a bit early, but I'm happy to answer any questions that you have. Thank you very much.
I have a question on the overhead, if there is any, and how big it is, of preparing a DataCube. So, suppose you have tiled data in the original format and you want to be able to get a DataCube from it.
What's the steps to be taken and how long does it take for? Sure, that's a really good question. So, the old model of the Open DataCube was that you would find all the data that you want in the DataCube and do something that we call ingest the data. So, you transform it into cubes, netcdf files, and then you'd index that.
And so, when you use the Open DataCube, you're getting the netcdf files. But more and more, what we're doing is just indexing data in place. And so, if it's a data format that, say, Rasterio can read, you can just, we have a Python script that does the indexing. So, you get the scene and you add that record into the Postgres database
and then the file just stays in place. And so, really, what we'd aim to do is for things like Digital Earth Africa, which we store all the data on S3, as a standard format, which anybody can use, that we can point the ESRI folks at, the, I don't know, any organisation up and they can use it. And we consume that data in place. And so, it's kind of like dogfooding.
I don't know if you know that term. You know, we want to have an opinionated front door but let people go into the back door. But the answer to your question, though, is that indexing data in place without ingesting it is, I think, a stronger model. Certainly works with vast scales of data better.
Thank you. Question, how you can get access to the Open DataCube. I guess it's hosted by you and you can sign up to it or request access. How does that work? So, the Open DataCube is more a software project or a family of software projects.
And so, you can get the code and build your own whenever you want. We have an implementation in our supercomputer called the NCI in Australia. And to get access to that, you kind of need to be government or research. But you can get access as a private organisation. But then, what I'm talking about here with the Digital Earth Australia Sandbox,
that's actually public facing. And so, you can just go turn up at that website, register and start using it. And if you'd like to do that, get in touch with me and I'll... That's what I meant, yeah. Not the Open DataCube Sandbox. Okay, the DEA Sandbox. Yeah, sure. Yeah, get in touch after and I'll show you. I can send you a link.
Actually, I can send you the training manual that we ran through yesterday. So, you can pick that up and have a go. Hi, thank you. Very impressive, I have to say. My question is about the stack support.
So, you mentioned for the one slide is the stack support. So, I wanted to know, is this like a standard feature of Open DataCube and this is like a static one or is it dynamic? So, spatio-temporal asset catalogue support. So, there's no reason why it can't be supported.
We have... So, with the Open DataCube, we do an indexing process. And in Geoscience Australia, we have a similar metadata to stack. But it's in a YAML format. And so, it's very easy to pick up an existing indexing script and point it at like files on S3, YAML files on S3.
Now, we could and should rewrite that to be able to just pick up a stack file or point it at the root sort of object in a stack catalogue and climb the tree and index that. There's not necessarily the scripting written yet. The other side is that we have all this data in the Postgres index.
And we could build a REST API that pushes out stack kind of work too. But it hasn't been done yet. Great question. Hi. Great work. Do you have... Sorry.
The... I'm trying to understand how this relates to the OGC web standards or the kind of ongoing efforts for those. Do you have kind of a sense from your use history and user experience history how you'd like to see that work?
Are you talking about future standards or current standards? Yeah, no, future standards here. I don't think I can comment on that, sorry. So, we implement the current standards, so WMS, WMTS and WCS. But yeah, I'm not sure further than that, sorry.
Is that the alert thing? It's the alert thing, yeah. Yeah, weird. It'd be worse. Yeah. Any other questions?
No? Well, thank you very much. We have to... You can applaud.