We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Architecture of OGC Services Deployment on Kubernetes Cluster based on CREODIAS Cloud Computing Platform

00:00

Formal Metadata

Title
Architecture of OGC Services Deployment on Kubernetes Cluster based on CREODIAS Cloud Computing Platform
Title of Series
Number of Parts
156
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The Copernicus Data Space Ecosystem provides open access to the petabyte-scale EO data repository and to a wide range of tools and services, limited to some predefined quoatas. For users who would like to develop commercial services or for those who would like to have larger quotas/unlimited access to services the offer of CREODIAS platform is the solution. In this study an example of such a (pre)commercial service will be presented which publishes Copernicus Sentinel-1 and Sentinel-2 products (and selected assets) in the form of a WMS (Web Map Service) and WCS (Web Coverage Service). The architecture of the services based on the Kubernetes cluster allows horizontal scaling of a service along with a number of users requests. The WMS/WCS services to be presented combine data discovery, access, (pre)-processing, publishing (rendering) and dissemination capabilities available within a single RESTful (Representational state transfer) query. This gives a user great flexibility in terms of on-the-fly data extraction across a specific AOI (Area Of Interest), mosaicing, reprojection, simple band processing (cloud masking, normalized difference vegetation), rendering. The performance of the Copernicus Data Space Ecosystem and CREODIAS platform combined with the efficient software (Postgres 16 with PostGIS extension, MapServer with GDAL backend) allows to achieve WMS/WCS service response time below 1 second on average. This in turn, gives a potential for massive parallelization of the computations given the horizontal scaling of the Kubernetes cluster. The work demonstrates the capabilities of European data processed using open software deployed on European cloud-based Ecosystem in form of CDSE.
Keywords
127
Streaming mediaSoftware architectureWordCloud computingData storage deviceService (economics)Natural numberRaster graphicsLecture/ConferenceComputer animation
Physical systemType theoryLaptopVolume (thermodynamics)Physical systemVirtual machineConnected spaceGraphical user interfacePoint cloudSoftware repositoryForm (programming)Cloud computingInternet service providerData storage deviceComputer animation
Sanitary sewerNetwork topologyKernel (computing)Library (computing)Internet service providerLevel (video gaming)Power (physics)Service (economics)Computer animation
Computer-generated imageryInstance (computer science)InformationVirtual machineImplementationMultiplication signMedical imaging2 (number)Computer animation
Standard deviationType theorySoftware frameworkFunction (mathematics)Service (economics)Computer animation
Configuration spaceReading (process)PixelFunction (mathematics)Auditory maskingPoint cloudProcess (computing)Product (business)Software repositoryDressing (medical)Functional (mathematics)Level (video gaming)Analytic setVisualization (computer graphics)Point cloudVirtualizationServer (computing)DemosceneMassComputer fileAuditory maskingInternet service providerConfiguration spaceFile formatDirectory serviceData structureInformationMultiplication signZoom lensComputer configurationDifferent (Kate Ryan album)Form (programming)Photographic mosaicUtility softwareComputer animation
View (database)PiRaster graphicsLevel (video gaming)CuboidLibrary catalogTable (information)Subject indexingExtension (kinesiology)Server (computing)Digital photographyType theoryTesselationBoundary value problemInformationAngular resolutionConnected spaceData storage deviceMultiplication signUtility softwarePoint cloudDifferent (Kate Ryan album)Image resolutionComputer fileZoom lensMetreWeb 2.0Query languageService (economics)Vector potentialForm (programming)Front and back endsPixelStructural loadUniform resource locatorPhotographic mosaicCentralizer and normalizerComputer animation
Process (computing)Analytic setPlug-in (computing)Visualization (computer graphics)Computer animation
AreaMetreVirtual machinePlug-in (computing)Maxima and minimaComputer animation
AuthenticationAnalytic setProduct (business)Data structureReplication (computing)Group actionSingle-precision floating-point formatVisualization (computer graphics)Demo (music)Flow separationOperator (mathematics)DatabaseImplementationMusical ensembleForm (programming)Point cloudAuthenticationRaster graphicsMaxima and minimaScripting languageScaling (geometry)Library catalogSubject indexingInternet service providerUtility software
Point cloudFile formatRepresentational state transferCubeData compressionExpert systemSlide ruleoutputData structureLevel (video gaming)CASE <Informatik>Set (mathematics)Server (computing)TesselationSoftware repositoryDifferent (Kate Ryan album)Analytic setData analysisSubject indexingComputer programmingProcess (computing)Library catalogOpen setData storage deviceMetadataElement (mathematics)Computer animationLecture/ConferenceMeeting/Interview
Computer-assisted translationEmpennageComputer animation
Transcript: English(auto-generated)
So hello everyone, my name is Martin, this is Michal and today we will talk to you on behalf of the entire consortium about the architecture of OGC services deployment on Kubernetes cluster based on Creodios cloud computing platform. So a few words about us, we work at CloudFero at data science department, as you can see we are diehard
users of Phosphor-G and because of the nature of our company, which serves as a storage for Kubernetes program, we focus mainly on rasters and we are trying to implement this amazing Phosphor-G on our repository and on our infrastructure.
So the ecosystem idea, in my personal opinion, it's kind of a system of connected vessels, where by initiative of ESA, the biggest players when it comes to spatial data gathered and created complementary solutions, so
there is no one solution, there is many solutions, so the user can select which one fits better for them. So it's amazing, and as you can see, there is CloudFero as an infrastructure and storage provider, there is Synergize as a provider of processing services, also there is systems, which provides also the cloud infrastructure and many more.
So in this work, ecosystem has four roles, first of all the cloud environment, you've got to do these things somewhere, another aspect is the S3 mounted data repository and now it's almost 80
petabytes of data in immediate access with various types of access for both commercial and non-commercial users. For example, if you are a non-commercial user, you can download the data, you can access the data through API, and for example, using the Jupyter notebook or various types of graphic interfaces.
If you are a commercial user, the data is already on your machine, mounted via S3, so after running the machine, there is this folder called EOData, and within it, there is almost 80 petabytes of data.
And if we are discussing the volume of this size, we've got to have some dedicated tools which enable the efficient data discovery and access, so this is the APIs and the graphical user interface in front of these APIs.
And also, we've got to do something with the data, we are providing our users with the tools in the form of, for example, OpenEO or Jupyter notebooks or OpenFOS4G on their machines so they can process the data.
This is the example of running the infrastructure for a non-commercial user. After logging on the CDSC website, you can run Jupyter notebook, you can select the size of it, and select the kernel, ranging from the clear Python tree to Python with geoscience libraries like Geopandas and stuff, OpenEO, or recently added R kernels.
And this works based on the predefined quota limits, which resets every month. And if the quota limits
provided by us for you is not enough, then you should really consider requesting for higher tier accounts. So, if you work on public institution or scientific institution, you can elevate your quota levels up to the Copernicus services provider level, so this is like crazy amount of computing power totally for free, and it resets every month.
And if you want to do this, please, oh, that's it, please visit this link, it's very well described how to do this.
And yeah, today we will discuss the Credias implementation. This is all of, I thought, this is, everything is great, but if you would like to run some clusters, you would like to run actual virtual machines, or you would like to expose your services, then you should try out the commercial aspect.
The Credias is the product of entire consortium, but it started at CloudFero, and based on the really easy graphical user interface, you can just run through it and provide information starting from basics, like virtual machine name, then select the size of machine, then select the flavor.
Yes, the flavor is the actual size, then select the image, which describes what operating system, but not only. For example, one of the images is the OSGO Live 16, so it's really nice because you can jump into known environment and start your computation and work.
And yeah, this is, like I said, this is really easy process, the startup time of the machine, it took usually less than 60 seconds.
So why OGC services? This one is really easy to answer, because it's popular, everyone uses it, it's official standard, so this is very important for public institutions. Yeah, it works with almost all types of spatial data formats, and the output of OGC
services can be further processed via every GIS network, starting from QGIS to GDAL and to POSGIS. So let's get technical, so as a framework which is realizing the OGC standards, we select
Map Server, because it's C-based, so it's fast, the C software, so it works like this. The happy user sends getMapRequest, it hits the Map Server instance, and then Map Server checks the map file configuration.
Map file is the heart of Map Server, it stores the information on what data the user wants and how you should render this data. Based on this, Map Server reaches out to spatial data formats supported, as you can see, there are Postgres, Geopackage,
COGs, which are my dream that every data would be stored in COGs, but unfortunately it's not, and the JPEG2000. Then it renders back the data to the user, and the user gets just
the visualization of data or the analytical value data, if you are discussing the WCS. And another advantage of Map Server is that it's GDAL-based, and GDAL is my favorite geospatial software, and we are using two very cool functions of GDAL.
So, first of all, the VRT, virtual rasters, you are able to provide the GDAL with instruction on how it should process the data, and it happens on the fly. So, for example, we are using it for on-the-fly cloud masking for Sentinel-2.
In Sentinel-2, there is this file called SLC, which represents the scene classification map, and we can just provide GDAL with it, the virtual raster with it, and it will mask the cloud out.
So, if you are requesting the data from more than one day, the cloudless mosaic will be created. Yep, that's it, and there is another great utility of GDAL, which is called Virtual Storage Systems.
For example, this works as a prefix to the actual path for the data. For example, if the data is stored in zipped archive, you can use it to extract the data out of zip without actually unpacking the zip folder.
We are using it to reach out the data stored on S3, and if you have ever used the Criodios, there is this, like I said, there is this folder EOData, and in fact, it's a massive S3 bucket. So, with this prefix, we are able to reach remotely to all of the data gathered by Copernicus program, and
not only, and provide with some custom instruction on how GDAL, or rather Map Server using GDAL should process it. And yeah, another nice place where we use GDAL, if you ever use the Copernicus browser, which works really nice and it's an amazing
tool, but if you zoom out too much, you should be greeted with the information that the data will be displayed if you zoom in. We are trying not to do this here in this solution, so what we do is creating daily overviews for now for Europe.
So, the actual data from EOData is requested only after the daily overviews in form of COG.
From the user perspective, it does not have any, users have no problems with it, because the overview is only on low zoom levels, and there is no difference between this and the original data, but there is a difference in the rendering time.
And there, we are using the amazing function option of GDAL, which is called sparse-ock-true. So, the non-value data is actually not transferred to the existing file, thanks
to this, the files are really small, so there is no problem with the storage. And yeah, let's discuss the map file, the heart of Map Server, the instruction on how to render the data. In my dream scenario, I love the idea of having all of the data in cloud-optimized geotaves, but so it has this,
I'm sure you're familiar with the internal structures of overviews and filing, and this would be amazing, but unfortunately this is not true. We are storing the data in JPEG 2000, but the Sentinel-2 L2A have this directories representing the
different spatial resolutions of rasters, and you can use it to create your own pyramid within the map file. So, using the layer definition, I can first reach out to 60 meters, then to 20
meters, and the best resolution in form of 10 meters is displayed only on high zoom levels. So, this is great, and thanks to this, we can make the service work very fast. Another nice utility of Map Server is the tile indexing, because we are talking about, for Sentinel-2, it's almost
40 petabytes of data, I think, so we cannot store it just in the form of paths to a text file. This is not readable by any software, I guess, but we can store it in Postgres, and we can define the connection from Map
Server to Postgres, so the data actually stored in the paths to the data is stored in the table with at least three columns.
So, the location, path to S3, the date time, so the times type of the data, and the geometry, which is the footprint of the actual photo. So, if you are familiar with, for example, getMapRequest, it can provide the service with information about
boundary box of the requested data, and the boundary box is then transferred to the SQL query, and only intersecting polygons, well, rasters are chosen. So, yeah, that's great, and this is, one more time, this is the table, which represents the spatio
-temporal index of the rasters, so this is pretty much the stack, which stands for spatio-temporal asset catalog. And we are using the, as a backend of the tile index, we are using the pgStack, so it's just a Postgres table, and
thanks to the Map Server SQL query, we can, yeah, transform the table to pgStack table, to the table which is recognizable by Map Server.
So, thanks to this, we are loading stack with Sentinel to items, and in the same time, we are creating the WMS service out of it, because then the query stays the same, the table is growing, and yeah, and the stack has this amazing stack extension called Web Map Links,
and it gives possibility of also adding the services URLs to the stack items, so yeah, so also the stack items
associated with OGC services, which is, in my opinion, great, and the whole magic here is in writing the correct SQL query. So yeah, the Kubernetes deployment, we are, as I said, we use the Cloud ES dashboard to deploy the Kubernetes cluster, which is super easy, and the
idea is here that we wanted to be protected from potential bottlenecks in form of, yeah, in form of Map Server not being able to scale up enough, so there is many pods, workers deployed on the Kubernetes cluster, and in front of them, there is load balancer, so if there is
many users, every request is aimed to other pods, so there is no problem on the side of, yeah, of the Map Server efficiency. Okay, so one more thing about why we're using GDAL, and as Martin said before, we're working on Cloudless Mosaic of central images, but the big
thing is we're not creating new images, we're using one, the same dataset, and it's all defined inside this VRT file, which is something like XML,
and we defined that some pixels are Cloud, some are not, and those that are Cloud should be transparent, but we're, again, we're not creating new images, new data, and if we wanted to, like, for
example, create a Mosaic of NDVI or NWEI, we would just create a new definition inside those XMLs, so there's no big storage behind this. We define how data should be just visualized, and the final step
in our pipeline, in our solution, is QGIS plugin, which will be something between user and our OGC services, and its main goal of this plugin is to provide an easy tool that will handle
all the requests and give users easy visualization and easy access to data through WCS and WMS. And here you can see how it looks like in the plugin. You can visualize data just via
choosing daytime, and you can, hitting one button, you can add a whole dataset, a new layer to QGIS,
that will be, for example, one week of data. Also, as I said, there's download via WCS, where you just, like in WMS scenario, you just define time.
Here, you can, you must define also area of interest, when you just drive, drive VBox into QGIS, and you hit download button, and you will get a geotiff with maximum resolution, ten meter on ten meter,
and you can just process on your machine as you like. And, oh, sorry. Okay. Yeah, thanks. So, what we are missing right now, it's, yeah, we should really consider moving
to OGC API since the old standards have right now, I don't know, like 20 years. So, yeah, and PyGeo API seems promising here. Another thing is that we should rebuild our QGIS plugin, the demo of it, so it uses stack, and here it will use the stack for search and download of original data,
so the data discovery, download of separate products, and the assets, and single assets within products should be based on stack, the visualization should be based on WMS, or, and OGC maps, and the analytics should be based on WCS and OGC coverage.
And, yeah, regarding the processing, yeah, we are about to add the possibility of creating your own band composition using the URL. Also, we want to provide users with utility of writing their own scripts, so, for example, you can, you would like to create the script which
takes only, yeah, which scales the raster by minimal value, and you can do it in Python or C, and we will implement it for you,
and it also will be possible through the, yeah, I guess some kind of URL, we are not, this is not actually figured out yet, and, yeah, custom index computation, and it's also on the table, we are pretty ready to go with it. Another aspect is that, because we are, the PgStack is great, and I love it, with all my heart, but currently we are running it
on a single Docker container, and if we are aiming to create the whole catalog out of it, we should deploy it the cloud native way.
And here, the EOAP is about to release the crunchy data postgres operator for Kubernetes, and this seems really promising. If not, we will probably go with our own implementation in form of, based on cloud native Pg operator.
And then it will be super easy, because then you can use the PyPgStack to, yeah, to start the PgStack, the selected database, all you need is the credential of the database. And, yeah, the whole idea of the replication of the whole catalog in stack will be based, will
be more, will be easier if, after the ISA transition to geosr format, which supports the internal overviews and internal structure of tiling, and, yeah, it's great for analytics, thanks to its, yeah, being native to data cubes.
And authentication of users using the S3 create that out of the CDSE. And that's it. Thank you. Thank you, Macin and Michal. It's always nice to hear so many enthusiastic speakers.
I'll start out with, since I know the rest of your slides, I'll start out with the first question. Do you have any input on the future to the system? Maybe you have an extra slide you could show us with the future. So that's my question. Do you have something about the, ah, great. There's the future slide. Yeah. So, ah, so, yeah, it's maybe I will ask for just discussing the
slide. So, ah, we, ah, starting from the beginning, we will extract the metadata using the stack tools, ah, or our own tool, ah, based on the actual metadata stored in the repository with associated safe catalogs.
Then we will load this JSON to PGStack, and the PGStack will serve as a tile index for both Map Server and PyGeo API, which uses the MapScript. And then we will use the GDAL VRT to provide this analytical value, the processing on the fly for WMST, WCS, OGC maps, and OGC coverages.
Ah, and this will, ah, yeah, this will, ah, do its work on the QGIS plugin, as well as the StackFast API.
So, so, yeah, that's it. That's the future, and I hope that we will be able to present it in the next Phosphology.
Ah, you mentioned, ah, that you're using JPEG 2000 right now, and in the perfect world, ah, it would be cloud optimized GeoTIFF, and now I also saw GeoZAR. Could you elaborate on, on benefits or, like, downsides of, of the different formats?
Okay, so, ah, the main difference between the, ah, COG and JPEG 2000 is that the COG has this internal structures overviews and tiling. So, ah, if user wants to, for example, get the data over Tartu from Sentinel-2 photo, ah, the Map Server reads only the, this little part of tile.
In case of JPEG 2000, the whole set of overviews must be, ah, must be read to get the, the only, the actual data. So, that's the, that's the main, ah, that's the, that's the main difference.
But, hey, the JPEG 2000 has, ah, better compression from what I understand. Plus, ah, the GeoZAR, the GeoZAR, I'm not an expert on the GeoZAR, but, ah, I know that it has amazing analytical value because it's, ah, format native to the data cubes.
So, that's it. And, in the same time, it also supports the internal, ah, structure of tiling and overviews. So, that's kind of supreme format when it comes to data streaming and data analytics. And, that, that's all. Thank you.
Yeah, also, I want to add that, ah, element 84, if you know them, ah, providing cloud optimized GFS with stack, ah, up to date, ah, for Sentinel-2. So, you can use them directly from S3, public S3 pockets.
Ah, from element 84. Open datasets from AWS. You can use Sentinel. Okay. But, here's the thing. We are, the, the, the CDAC and company I represent is, ah, the official storage of, ah, of, sorry, of Copernicus program. So, I guess it's not a good idea to get the data, the original data that we have from other providers. So, yeah.
Thank you for your question.