An overview of Cloud-Native Geospatial
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 266 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/66349 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
Coxeter-GruppeGüte der AnpassungPunktwolkeAggregatzustandTermQuick-SortElementargeometrieMAPStrömungsrichtungComputeranimation
00:28
Google EarthSkalierbarkeitPunktwolkeStandardabweichungDifferenteProzess <Informatik>Service providerPunktwolkeCodeSkalierbarkeitMultiplikationGebäude <Mathematik>Dienst <Informatik>AnalysisZentrische StreckungAnalytische MengeZusammenhängender GraphVisualisierungQuick-SortVollständigkeitStandardabweichungDateiverwaltungRichtungOpen SourceResultanteMengeBrowserPhysikalisches SystemComputeranimation
02:25
PunktwolkeStandardabweichungOpen SourceExploitMultiplikationsoperatorRechter WinkelComputeranimation
02:52
Online-KatalogLokales MinimumReelle ZahlKeller <Informatik>StandardabweichungOpen SourceQuick-SortDatensatzEndliche ModelltheorieBitmap-GraphikVektorraumMetadatenZusammenhängender GraphComputeranimation
03:30
Quick-SortDateiDifferenteGrenzschichtablösungKategorie <Mathematik>Demoszene <Programmierung>MetadatenKeller <Informatik>ComputeranimationVorlesung/Konferenz
04:04
Gleichmäßige KonvergenzDialektKategorie <Mathematik>Uniformer RaumKeller <Informatik>Computeranimation
04:19
WellenpaketGüte der AnpassungVirtuelle MaschineElementargeometrieMaßerweiterungAlgorithmusKeller <Informatik>FitnessfunktionComputeranimation
04:56
SystemplattformComputerSoftwareentwicklerKeller <Informatik>Konfiguration <Informatik>TaskDienst <Informatik>RechenschieberAutomatische IndexierungOpen SourceImplementierungDifferenteOffene MengeVerschlingungSystemplattformNeuroinformatikFunktionalProzess <Informatik>Zentrische StreckungComputeranimation
06:29
Produkt <Mathematik>Funktion <Mathematik>Zentrische StreckungEin-AusgabeWeb-SeiteTermVersionsverwaltungVerzeichnisdienstDateiRechter WinkelKlasse <Mathematik>Computeranimation
07:24
MetadatenSpeicherabzugVererbungshierarchieHyperlinkWurzel <Mathematik>DatentypPunktwolkeBitmap-GraphikDatenkompressionAuflösung <Mathematik>DateiverwaltungMultiplikationsoperatorTesselationDifferenteURLPunktwolkeTabelleAudiodateiCASE <Informatik>VersionsverwaltungProzess <Informatik>Inhalt <Mathematik>DateiBildschirmmaskeMetadatenWeg <Topologie>VerschlingungSkalierbarkeitOrdinalzahlSoftwareentwicklerTermRechenwerkInformationAlgorithmusLesen <Datenverarbeitung>ElementargeometrieQuick-SortDatenfeldEingebettetes SystemDateiformatSchreiben <Datenverarbeitung>Globale OptimierungPunktComputeranimation
10:07
TesselationMagnetkarteDatenkompressionOrdnung <Mathematik>DateiBildgebendes VerfahrenGanze FunktionPixelFlächeninhaltDefaultStandardabweichungCASE <Informatik>Computeranimation
10:59
Array <Informatik>AudiodateiDateiDatenmodellKoordinatenReservierungssystem <Warteschlangentheorie>MengePunktwolkeTesselationSchaltnetzDateiSchlüsselverwaltungMathematikSoftwareentwicklerInformationsspeicherungKontextbezogenes SystemSystemaufrufEndliche ModelltheoriePixelQuick-SortTransformation <Mathematik>StreuungsmaßDatenmodellDateiformatCASE <Informatik>Projektive EbeneComputeranimation
13:47
ElementargeometrieGruppenoperationDatenmodellComputeranimation
14:11
MetadatenPunktwolkeAudiodateiAusnahmebehandlungTypentheorieGlobale OptimierungRechenschieberPunktComputeranimation
14:35
Kollaboration <Informatik>EDV-BeratungRenderingAudiodateiPunktPunktwolkePhysikalisches SystemRechenschieberTypentheorieZeichenketteOrientierung <Mathematik>PolygonElementargeometrieDatentypDatenfeldDefaultEinfacher RingKoordinatenReservierungssystem <Warteschlangentheorie>GasströmungGlobale OptimierungFokalpunktVersionsverwaltungComputerIndexberechnungWeb SiteFatou-MengeGebäude <Mathematik>HIP <Kommunikationsprotokoll>DateiE-MailBrowserSupercomputerLastEbeneSpezialrechnerOffene MengeProzess <Informatik>Zentrische StreckungTesselationInhalt <Mathematik>TermClientPunktwolkeVisualisierungBitOffene MengeAnalytische MengeTabelleProgrammbibliothekDateiVerschlingungDatentypKeller <Informatik>WürfelOpen SourceSkalierbarkeitElementargeometrieBlackboxVersionsverwaltungStandardabweichungNotebook-ComputerAudiodateiLastAusnahmebehandlungComputeranimation
16:37
InformationsspeicherungSkalierbarkeitMultiplikationsoperatorVisualisierungServerPunktOrdnung <Mathematik>DateiDynamisches SystemTesselationComputeranimation
17:05
Lemma <Logik>Online-KatalogFormation <Mathematik>EindringerkennungProzess <Informatik>MetadatenPunktwolkeQuellcodeTropfenSuite <Programmpaket>SoftwareSoftwareentwicklerRechenschieberDatenstrukturProgrammierumgebungSondierungNeuroinformatikPunktwolkeOpen SourceClientTropfenWeb SiteSondierungZusammenhängender GraphRelativitätstheorieQuick-SortElementargeometrieDifferenteComputeranimation
18:07
MultiplikationsoperatorRechter WinkelComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:09
So, this is a good birthday present to be all done. So, I will be talking about Cloud Native Geospatial. Some of this is sort of intro, high level. And it's not just an overview, it's a current state of Cloud Native Geo.
00:24
So, what is exactly Cloud Native Geospatial? We hear the term a lot. And it's Google Earth Engine is what it is. So, back in 2010, Google came out with this Earth Engine. And they ingested a whole lot of data, there's a whole bunch of remote sensing data and geospatial data on it.
00:42
And it's a browser, you can put code in, you can visualize this and really operate things at a planetary scale. It's fantastic. Worked really well. It's still a service that Google provides. And it really showed what could be done
01:01
with processing of big data in the cloud. It's got a few different components. There's a unified data discovery in Google Earth Engine. So, you can search for data across multiple different data sets. There's an orchestration pipeline, presumably, somewhere in all of that that they use to ingest data and process it.
01:22
There's scalable file access. We kind of don't really know how that works on Google Earth Engine. There's analytics that are scalable to the planetary scale and you can visualize it. So, these are all the components that go into Google Earth Engine. And this is really what make up cloud native geospatial.
01:41
We have these complete workflows, they're performed in the cloud from data discovery, processing, analytics, visualization. There's some sort of direct data access to efficiently get to the stored data. And through this system, you can scale this. It's easy to reproduce analysis because you don't have to worry about downloading the data
02:01
or not having access to the data. And it's a way you can publish results online. But the issue with Google Earth Engine is that it's not open source. It's not based on any sort of standards. You can't deploy Google Earth Engine into your own account. You can't really easily integrate with commercial data providers
02:22
or build things on top of it. So, cloud native geospatial is really, it's a collection of APIs and technologies for the programmatic access and exploitation of geospatial data in the cloud. So now, enter standards and open source. This was 2010.
02:40
It took some time to catch up, but over the years, there wasn't necessarily a single concerted effort to reproduce what Google Earth Engine did, but that's essentially where we got to. Through replicating in the open source world, basically each one of these components. So, first, unified data discovery.
03:03
Google Earth Engine had some sort of metadata model. There was no real standard around it. Stack rose up. Matthias gave a talk this morning. You could watch the recording about the overview of Stack. So, there's Stack and that's based on GeoJSON
03:20
and there's also OGC API features. So, these things together really represent APIs for serving up raster and vector data. And I wanted to bring up this question because I get this question a lot is should I use Stack or should I use a features API for my data? So, for Stack items, it's GeoJSON
03:42
and the Stack items defines the region where there's data and that region could contain some additional properties that is in the Stack metadata, but the data's in some sort of separate file in what we call assets. So, here in the illustration, I'm looking at footprints of different scenes
04:00
and I'm rendering one of them, but that's actually coming from a different file. Well, with OGC features, it's also GeoJSON that describes regions of uniform properties, but there's no actual separate assets. And so, the question you should ask yourself when you have data and you think that maybe it fits in Stack because people do try and fit things in Stack
04:22
that maybe shouldn't belong there is that if your geometry is the data itself, then use OGC features. If your geometry is really not the data and it's really just metadata about the data, about where it's located, that's when you use Stack. So, like training data labels for machine learning algorithms,
04:41
let's say, that's not really Stack. We went down that road. If you go to the Stack extension, you'll see there's label extension and people have done that and that is something that you can do, but I think that that's a good example of something that actually does belong in a features API. So now, data orchestration.
05:00
Now, this is really, there isn't a single solution here for any of this. When we, at E84, we do a lot of work with small satellite companies and companies and help them build these orchestration pipelines. I think there's a lot of different solutions out there. You're probably all familiar with things like Airflow and Prefect
05:21
and there's generalized orchestration platforms that people will use for geospatial. Planetary computer uses something called PC tasks for doing all of the ingest pipeline when they ingest new data, but it's really specific to the planetary computer. It is open source, the links in the slide.
05:42
At E84, we use something called Cirrus, which is also open source. It's completely based on AWS. It uses AWS step functions and DynamoDB and managed services. And it's what we use for our search for our stack API for indexing geospatial data on AWS. There's another open source platform
06:01
that we're working on now. It's all the development is happening completely out in the open on our GitHub called SWOOP, which stands for Stack Workflow Open Orchestration Platform. And it'll be exposing an API based on using OGC processes. So there's options. And I know having talked with some people here, there's other options.
06:21
People often have their own implementations of doing orchestration, but there is lots of options out there for doing this at a high scale. Now, I just wanna take a digression here and talk about, I wanna implore people to stop thinking in terms of files. This is related to the workflow issue. So here, as I used to work with a lot of scientists
06:42
and this here is your typical working directory of a scientist, right? It's like they vomited on a page. There's no understanding of what actually the final version of a product is. Like this might've been used to publish a scientific paper. Now look at it. Which one is the final version?
07:01
Is it final version class or is it final version final because final's in there twice? Like you don't know. So if the scientist, like something happens to him, quits in a rage or an accident happens, who's gonna actually be able to replicate this data because they think in terms of input of a file
07:23
and output of a file when they're processing. Instead, think of workflows, cloud-native geospatial workflows is this idea of metadata in and metadata out or stack in and stack out where rather than writing an algorithm or a workflow that takes in a file name and then it generates one or more files with names
07:42
where often people use file naming for metadata. Terrible idea. Don't try and embed date times into your file name thinking that that's gonna be persistent. Instead, write workflows taking in metadata and you perhaps modify that metadata,
08:00
perhaps you add assets, perhaps you're generating new stack items altogether and that's what you return. And if you're generating assets, you can put those in the cloud and you put the location of those things in the metadata that you return. This is a great way to do things like track data provenance. In the stack metadata, we can put a derived from field here
08:20
that says that this item was derived from another item and you have a link, you can put in version information for the processing that you use and it's really powerful to think in terms of metadata, it becomes your atomic unit of processing rather than the data files.
08:40
So now we get to scalable file access and we've seen lots of development here over the last several years. I'm gonna talk briefly about four different formats. We've got Cloud Optimized Geotiff, a lot of people are familiar with that. ZAR and the new GeoZAR effort.
09:01
Copic or Copsy, depending on how you like to pronounce it and GeoParquet. So these are all sort of for different use cases, vectors, point clouds. The use of COG versus ZAR isn't as straightforward as using some of these other file formats. But first let's just talk briefly
09:20
about what a Cloud Optimized Geotiff is because the concepts here really translate well to everything else. There's essentially three things that make up a COG. One is that you have a table of contents, really any file form, is you have a table of contents. And the table of contents is either at the beginning of the file or maybe it's a sidecar file
09:42
in the case of some other format, but it's easy to get to, okay? You can access this and you can tell where all of the different pieces are within your file, all the different chunks, these tiled segments. And we also have overviews. So easy to read table of contents, tiles,
10:01
internal tiling and overviews are really the three things that make up most Cloud Optimized formats. So to the left here, we see a standard TIFF file. By default, things are in stripes like that. And that's really inefficient because each one of those stripes is compressed individually.
10:22
This is why you can't read just like a few pixels out of a single stripe. You have to read the entire thing and decompress it. And then you can pull out the pixels that you want. So in this case, that's our AOI. We have to read in both of those stripes. So the better way to do it is tiles,
10:40
like things that are the tile size. There's no easy answer here either about what that tile size is, but the idea is that it's more closely aligned with the actual areas of interest that you might have or with how you might divide up an image in order to process it with multiple workers.
11:01
So a cloud native data set is really one with small addressable chunks, okay? This is the key piece here. Those could be files, they could be internal tiles or some combination of them. So when we talk about ZAR, ZAR works in very much the similar way. It's chunked, compressed, multidimensional data,
11:22
really similar to how HDF5 stores, how that data model works, but it's an exploded file format. And so rather than having all of the tiles in a single file all of those tiles are actually individual files on blob storage. And so a ZAR data set, like rather than one file that has all these tiles,
11:41
each tile is you just have like a bunch of files. And this is great because now you can just read it very easily in a similar fashion and you have multiple workers. A drawback here is that in cloud storage, I don't have a bullet for this, but in cloud storage accessing each blob in blob storage
12:00
is an API call. And so it does become very difficult to move that data or to delete that data without making a ton of API calls. So one should think twice about really adopting ZAR in that way, if you're gonna wanna move the data around. The ZAR developers are aware of this
12:21
and there's a change possibly in the works to combine these tiles, actually putting them back in a file, which kind of brings us back to more like the HDF model. So that's why I said it could be a combination of both of those things.
12:42
Recently, there was an effort for GeoZAR. Now the GeoZAR, so the issue with ZAR, so ZAR came out of climatology work, really where climatologists are dealing with homogeneous global model data.
13:01
Like that was the primary use case where you'd have lots of variables. And in those cases, you have these lat long arrays. Like you don't have something like you have if you're familiar with the GDAL model where you have an affine transformation and a projection, like that doesn't exist in ZAR. It's just a lat long array.
13:21
So if you could kind of think of those as like, you essentially have a GCP for every pixel. And so there clearly was a need for having some sort of geospatial awareness within ZAR. And so the GeoZAR effort recently took off, which is based on using CF conventions,
13:43
that's climate and forecast conventions that are common in the climatology community. And this has been, there just recently has been a working group through OGC that has been started for driving forward this GeoZAR. So in the future, we hope that GeoZAR will enable
14:04
something more like the GDAL data model as well for data that has been projected. COPIC, so COPIC, optimized point cloud, is the same type of cloud optimize format,
14:23
except for point clouds. I got these slides from Howard Butler and I forgot to put the Michael Smith's talk on here. I think it's later this week. So, but if you have any questions, talk to him, don't talk to me about COPIC. The big announcement here is that there's point cloud feature in QGIS now.
14:44
And so QGIS is really now the way to visualize point clouds. And this is a super cool feature from what I've seen about it. Lastly is Parquet. So GeoParquet, you might have heard the term GeoParquet,
15:02
and the hope here is that GeoParquet isn't actually really another standard. It's just that the way that geospatial data is encoded within Parquet files just becomes another data type. And internally, the GeoParquet standard is fairly simple.
15:22
And it is on a 1.0 beta version now, and you can read more about it at geoparquet.org there. There's also cloud optimized tarballs. We've seen a library called kerchunk that allows you to essentially make cloud optimized files out of really any legacy format,
15:41
because all you need is that table of contents for those internal tiles. So you could do that with something like a tarball. Scalable analytics. So in Google, we had the, I said there's a way to scale up, and it's a bit of a black box. Well, in the open source world, the Pangeo effort came about
16:01
and promoted a similar way to do things except using Python data science ecosystem. So we can do similar things to what we do in Google Earth Engine using Dask and X-Array and interacting with those in a thin client through like a Jupyter notebook.
16:21
And on that same note, we've got Open Data Cube or ODC stack enables exactly this, but links to stack. So you can search for a bunch of stack items, load that up in ODC stack and get an X-Array and then process that data at scale. For visualization, also a whole bunch of great work,
16:42
largely done by Vincent over here in dynamic tiling. I mean, there's been lots of work by other folks as well, but he's sitting right here, so I'm gonna point at him. So scalable visualization in order to hit lots of files
17:01
at the same time that are on blob storage and serve up the tiles. So what is cloud native geospatial? It is the planetary computer. The planetary computer is essentially Google Earth Engine except built on top of open source components, all these open source components that I just talked about.
17:22
It's also because I work for a company, we have something called film drop and film drop is what we use. It's completely open source. We reuse it for different clients and it's essentially a way for folks to have their own Google Earth Engine, their own planetary computer in their own account that they can then build on top of.
17:42
Lastly, I wanna mention the Cloud Native Geospatial Foundation, which is a new initiative from Radiant Earth to support all things cloud native geo. Michelle Roby will be giving a talk, sort of related talk on some of the stuff that Radiant Earth is doing tomorrow. And you'll see that there's a website here,
18:02
cloudnativegeo.org, which I implore you to go to and fill out the survey that is there. And that is it. It looks like I'm right on time. Thank you.