Open Data Analytics API in GeoNetwork
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 266 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/66496 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Computing platformOpen sourceMereologyOpen setCASE <Informatik>SoftwareMetadataPoint (geometry)Library catalogData analysisComputer animation
00:47
Disk read-and-write headMultilaterationDisk read-and-write headBitComputer animation
01:05
MetadataLibrary catalogState of matterLevel (video gaming)Service (economics)Presentation of a groupLevel (video gaming)MetadataSoftwareLink (knot theory)Library catalogGeometryLatent heatPoint (geometry)MereologyComputer animation
01:58
Term (mathematics)Computer networkComputer configurationLink (knot theory)MetadataFocus (optics)HeuristicEuclidean vectorComputer fileService (economics)ZugriffskontrolleRegular graphQuery languageCommunications protocolEmoticonFunction (mathematics)Electronic mailing listCodeComputing platformSubject indexingDifferent (Kate Ryan album)Self-organizationMobile appMereologyCommunications protocolStandard deviationMetadataComputer fileComputing platformLevel (video gaming)Link (knot theory)CASE <Informatik>GeometryServer (computing)Function (mathematics)Library catalogService (economics)InformationLatent heatTime zonePlanningProjective planeSoftwareMultiplication signOpen setInstance (computer science)Object (grammar)Point (geometry)Process (computing)Computer configurationExecution unitQuery languageInternet service providerFront and back endsXMLComputer animation
07:46
Computing platformVector spaceMetadataOpen setMetadataVisualization (computer graphics)GeometryOpen sourceSoftwareSubject indexingLibrary catalogCASE <Informatik>Projective planePhysical systemFront and back endsComputing platformPlanningFreewareDebuggerFlow separationAddress spaceComputer animation
08:59
Visualization (computer graphics)Interactive televisionPersonal digital assistantUniform resource locatorData streamPie chartClient (computing)CASE <Informatik>Visualization (computer graphics)Server (computing)Interactive televisionQuery languageDifferent (Kate Ryan album)PiComputer animation
09:59
Process (computing)Personal digital assistantPoint (geometry)Focus (optics)Core dumpMereologyFile formatFunction (mathematics)Extension (kinesiology)HookingSimilarity (geometry)Vector spaceStatisticsProcess (computing)Functional (mathematics)Query languageMetadataStreaming mediaOpen setBitLevel (video gaming)Machine visionTesselationPoint (geometry)Computing platformExtension (kinesiology)CASE <Informatik>outputGene clusterData storage deviceInstance (computer science)AverageAnalytic setAbstractionStandard deviationFile formatCodeImplementationMereologyInternet service providerVector spaceGeometryComputer animation
13:14
CAN busFile formatStandard deviationProcess (computing)Process (computing)Cartesian coordinate systemMereologyGeometryMetadataLibrary catalogComputing platformInformationOpen setChainFile formatAnalytic setComputer animation
14:37
E-textDigital filterOperations researchFunction (mathematics)QuicksortPersonal digital assistantFile formatDatabase transactionExtension (kinesiology)Computer fileCASE <Informatik>Open setFunctional (mathematics)Table (information)MappingDatabaseSubject indexingData storage deviceElectronic visual displayGeometryProcess (computing)Mathematical analysisAttribute grammarMereologyOperator (mathematics)Level (video gaming)Computer configurationInstance (computer science)Visualization (computer graphics)AbstractionHistogramFilter <Stochastik>QuicksortInformationComputer animation
17:04
Formal languageQuery language
17:25
Local GroupQuicksortLimit (category theory)Cluster samplingDigital filterFunction (mathematics)DatabaseDatabase transactionRow (database)ResultantInstance (computer science)Field (computer science)Form (programming)Mathematical analysisElectronic mailing listMathematical optimizationOperator (mathematics)Analytic setFunctional (mathematics)Orientation (vector space)Position operatorDatabaseFile formatCASE <Informatik>Computer animation
18:44
MereologyMobile appCASE <Informatik>Source codeComputer animation
19:12
Open setCASE <Informatik>Computing platformLibrary catalogComputer animation
19:39
Control flowSeries (mathematics)Local GroupCartesian coordinate systemCASE <Informatik>SummierbarkeitQuery languageComputer animation
19:59
Control flowSeries (mathematics)Local GroupLevel set methodArithmetic progressionLibrary catalogQuery languageRow (database)Open setOpen sourceMetric systemComputer fileSelf-organizationMobile appMetadataImplementationComputing platformSummierbarkeitMathematical analysisSoftwareInstance (computer science)Multiplication signProduct (business)CASE <Informatik>Series (mathematics)File formatGeometrySource codeSequenceTheoryComputer animation
23:38
Open source
Transcript: English(auto-generated)
00:08
So, I'm going to talk about the idea and the work we've done so far about to have an open data analytics API. And I mentioned GeoNetwork here because we are part of GeoNetwork and we see GeoNetwork
00:22
as an entry point to any SDI. It's a metadata catalog. And we try to enlarge its scope to open data. And the open data world brings new ideas and new use cases. And one is the reuse of the data.
00:43
And that's why I'm going to talk about this work today. So, first, I'm Florent Graven, head of technology at Camp2Camp. And I'm also PSC chair of GeoNetwork. And I was supposed to present with my colleague, Olivia, but she has a flight issue, so she
01:02
will arrive a bit later. So, just for a very brief introduction, but if you want to know more about GeoNetwork, you can assist attend to the presentation this afternoon at 2 in the outer stage.
01:22
It's just the OS Geo solution for metadata cataloging. So, you can edit your metadata, link to the dataset, search for metadata. And the point here is that GeoNetwork is really focused on metadata. So, it's identified as a metadata catalog.
01:42
And what we want nowadays is not to look for specific metadata, but it's to look for data and to value data and reuse the data. So, it's in this move that the data API is part of.
02:02
So, first, I will introduce briefly how the data appears and are linked in GeoNetwork and in the metadata overall. Because it's the entry point to any search of data. Then I will talk about our plans within a customer project to enlarge the possibilities
02:20
of GeoNetwork and this data platform. I will talk about some technical consideration, about what solution did we think about for achieving that. Going through a quick example and then give the conclusion about this work.
02:41
Okay. So, let's look at how the data works in GeoNetwork so far. So, actually, GeoNetwork really focuses on the metadata more than on the data. There is some option about the data. Like, there is the GeoPublisher where you can push a shapefile into GeoServer, but
03:02
it's quite limited. There is also WFS harvesters where you can index features collection into ElasticSearch. But it's mostly used internally for the legacy UI features. But it's not meant to deliver some output for external organization or external services.
03:30
Then there is link. In the metadata, there is link to the data. But the problem is there often relies on third-party services like GeoServer, for
03:41
instance, and what makes them not reliable all the time. Then the link to the data themselves, they can be quite complex. And there is no standard, actually, in the ISO format, for instance, to define a data
04:04
or a data link. There are some protocols. But sometimes you have to guess what is the format of the file when you want to download a CSV or a GeoGzone or something like that. So, it's quite hard. The picture here illustrates that we are looking more about the information about
04:27
the object than the object itself. And it's why it's how GeoNetwork works. And it's really what we want to change like making being a data catalog instead of a metadata
04:41
catalog. It's quite all good so far. Because GeonetRock works with the metadata and it works well with that. But it doesn't really do things about hosting files, hosting data, providing backend services
05:03
to update the data. So, it's really separated from a data service, such as a map server or a GeoServer. And it works well because there is kind of best practices how to define the data links. And thanks to OGC standards and, yes, protocols, which allows us to visualize the data in
05:27
some case. But mostly if you look at the whole ecosystem, when you harvest many different metadata catalogs, you really see that it's not generic.
05:40
And a lot of metadata, they don't provide the data. And you have to implement a specific processing to extract the data from the metadata and see what are the links, what are the formats, et cetera. So, this is basically how it looks like in GeonetRock. So, there is a link, there is a protocol, but you have to get things.
06:03
What is the MAM type? What is the format? What is the size? What is the amount of items? So, all this information, it's really what you want when you look for metadata. You want to know these kind of things. But you don't have them.
06:20
And one thing which is important as well is that it really focuses on geodata and it doesn't really offer all the use case that modern catalog, such as Open Data Catalog, provides. So, all this OGC standard doesn't really work well outside the geoecosystem.
06:43
And the API to provide. So, when we talk about that data API, it's an API to fetch the data. And mostly in geo world, it's WFS or OGC API features. But it's limited. So, in the new specification of the features, you have a full text search, but it wasn't
07:01
the case in WFS. You have no aggregation possibilities. You have no processing queries. It's very limited to fetching the data and not analyzing the data. And when we look at the Open Data ecosystem, they have some kind of tools, some kind of
07:20
API to make analysis out of it. So, to address that, we think that geo network, this catalog, has to provide its own data API on top of that to embrace both geo special data and open data.
07:43
So, our current plans. We have a big customer in France who is actually working with his geospatial data and open data software is open data.
08:03
And it's a very common use case where people, they have an open data catalog and a meta data catalog. Yes. Geo meta data catalog. So, they want to go out from this system because there is high license costs and they don't like to have two separated catalogs and APIs.
08:23
So, they want to rely on geo network because geo network is always geo free and open source software and it deals great with indexing, searching. And we want just to extend it to be kind of a data platform instead of just meta
08:41
data. And we would like to have the geo network UI project as the front end to address new user experience for the new use cases like data visualization, search, et cetera. So, this is the plan. This project has already started.
09:01
So, what is the target use cases? It's to provide fast and interactive data visualization. So, we would like to have an API where we are able to dynamically and instantly provide value out of them. Like charts, pie charts, et cetera.
09:21
If you use WFS, you have to fetch all the data before processing them to deliver a chart. But if you go beyond that and deliver a new API, you can just ask the server to process the data and send them back to the client to serve your charts.
09:45
So, we want to have aggregations as well to be able to have, yes, a dashboard up to date, different queries like search, aggregation. We want to have functions, processes like sum, average, et cetera.
10:04
Map things as well. Like we would like to be able to do some joins. If you have an open data with a zip code to be able to join to a geo data. And then do aggregation, clustering, heat map.
10:21
So, this is what we target as use cases. The search as well. We would like when you have a platform, usually you have a search input to search for data. And if you just query the metadata, you won't find everything. So, the point is as well to be able to search in the data through this API.
10:45
It's a bit technical here, but it's now how we design such an API. It's hard to tell because actually in the geospatial world, you have OGC, you have turned out and you have features API.
11:02
In the open data ecosystem, there is nothing. Not really standards. They have their own API, their own formats. And it's hard to imagine what could be an API which unifies them all. For sure, it has to be compatible with OGC API features.
11:23
So, there will be an implementation. But it's limited. So, we think of proposing an extension to extend, yes, to provide the missing parts like the aggregation, the analytics, the clustering, extracting.
11:43
So, we would like to propose a query extension to the API to be able to achieve all these use cases. But actually, it's quite limited as well. So, if we not the open data world and people, they will find weird that we use this kind
12:03
of API to fetch non geodata, for instance. So, I would say that our vision is more to abstract the rest entry points. Provide OGC API features compatibility, but maybe provide other compatibility.
12:21
We like, for instance, open data soft API, which works quite well. And then there will be an abstraction and processes will be done under the hood. So, we have other idea to extend the capabilities beyond this API.
12:42
So, for instance, the API will just return features. And we know that if everything is indexed or stored in a storage, we could imagine to have other use cases like providing vector tiles of the data based on filtering, analyzing,
13:03
et cetera. Styling for WMS. Streaming an API to stream the data instead of fetching them all at once. And so on. So, what we are focusing in our approach here is, yeah, we don't see the data platform as
13:23
an end. But as a part of a bigger chain, which aim to provide more information and application based on the data. And we want to focus on the reuse as well. Because when you land on a metadata catalog or a catalog, it's always hard to know what
13:44
you can do with the data and how you can really value them. Some technical aspects about our work we have done so far. So, how we want to set up this kind of API. It's a recap.
14:02
In the geo world, you have WFS. And if you want to process, you have WPS. But I don't know if many of you have already their own WPS or have integrated WPS in our chains, but very few customers have this kind of tools available. That's why we want to embed that in a better API.
14:21
In the open data world, it's more a mess. Custom API, custom formats. So, the idea is to extend the data API for non geo data to do analytics, facets, and geospatial processes. And one goal behind that is to provide the unification of both of these worlds.
14:44
So, this is the use case we want to, yes, to address. Boost the data. So, for instance, for table or you want to have the data information, the attributes. So, full text search, pagination, sorts, filters, and facets.
15:02
For the data visualization, we want to do operations like that, your extraction, aggregation on date, histograms, functions. And in the map, we want to do spatial filter, clustering, heat map. Because if you want to display features on maps with WFS, you have to fetch them all.
15:24
And if the file is very big, it won't appear. Then there is the question about how are we going to store the data? So, I talked about the API design. And now on the left part, there is the data themselves.
15:40
So, there is different solutions for that. The most obvious could be like a transactional database like PostGIS, where you can store open data from CSV, from whatever. And you can store geo data as well. And you will benefit from all PostGIS functions.
16:03
But if you have big files, it maybe can be slow. So, we have to identify other solutions depending on the use case. Like what kind of data do we want to provide? So, there is columnar formats. So, you know Parquet and GeoParquet.
16:22
Citus, which is a Postgres extension for columnar formats. OLAP, which is online analysis processes like DuckDB or Clickhouse. So, it's based on columnar format, but it provides some API, not REST API, but APIs to be able to get value from the data.
16:42
And there is another option, actually, which is an index like Elasticsearch, which is already in Geonetwork and which address all the use cases. So, we have to identify what would be the best solution. In the end, it's the same as the API. We need to have an abstraction. And we need to be able to plug any kind of warehouse that we would like to have.
17:07
In this API, we think that SQL should influence the API design. Because SQL is a language, query language, which make able to get anything from the data, okay?
17:23
And actually, it could work quite well. So, for instance, here is the list of the use cases that we want to address. And there is always, yes, a function or a keyword or something in SQL that addresses that.
17:43
For the facets, for instance, we would have to make one request per field that you want to have the facets to know how with some operation and an aggregation to know how many data and features we have for this field.
18:03
So, actually, this could orient the API design. Some optimization. I talk about columnar format. So, here is a transactional database. So, you see that if you want to do some analytics on column 1,
18:21
you have to bruise all the columns of all rows, which is quite slow. Why? Columnar format just reverse the position of the column and the rows. And if you want to perform an analysis on column 1, you can just go straight through and have an instant result.
18:43
So, this has to be considered. But actually, it depends on your use case. This is a click house, all app pictures that show what we want to address. So, we would like to be able to plug to any kind of data sources. And then provide an API on top of that which just make able to reuse the data.
19:07
So, there is an ingestion part that I won't talk about today. One use case to illustrate what we are doing.
19:20
I took on the open data platform of Metropole de Lille, I took a dataset. And here is what we can do with the open data catalog, which is an open data soft. And I tried to see how we could address that. And actually, it was quite easy.
19:42
So, I took two use cases. So, the sensor name and the sum. And instantly we can have after importing the data in PostGIS, we can have a SQL query which fetch exactly the data to provide such a chart.
20:02
A more complex where we break down series by year. So, it means that there is two aggregations. One for the sum and one for the date where we extract the year of the date. Because it's a date time. And it's the same. It's quite very easy to do it with SQL.
20:25
And what is great with SQL is that when you see, for instance, all app things which work on columnar format, it uses SQL as well. So, actually, if you define SQL oriented API design, you are able to play with PostGIS,
20:44
but you are able also to play with DuckDB, which is already able to plug with any kind of data sources. So, here I transform my data from Postgres to Parquet. And then I use Python to just test the DuckDB and the columnar format performances.
21:07
So, it's the same queries. It's just that the from takes the Parquet thing. So, it's implementation detail, I would say. And some metrics about that. The file is 1.5 million of rows.
21:22
So, it's big, but it's not huge. The geo-season, when I extracted it, 500 mega. On the DB, 300. And the Parquet file is 20 mega. And for the performances, the two use cases, so, yes, 2050 milliscons for Postgres
21:43
and DuckDB, 70 milliscons. And for the breakdown series, a bit longer. But we see that DuckDB is pretty faster. So, it was just like a storytelling about this work, which is in progress.
22:01
The idea is really to make Geonetwork providing both the data and the metadata to make to take the move and the turn to open data world. Because the open data catalog are more data platform. And we think that Geonetwork will also be a data platform.
22:23
So, we are doing this under the umbrella of Geonetwork. But as you can see, all this analysis can be separated because it's just how to design a warehouse, how to design an API. And it could be just independent from any solution.
22:47
The conclusion is that, yes, we're still working on it. We hope that we can have a release next year. Our goal is clearly to be a reference in the free and open source ecosystem
23:04
for both open data catalog and geodata catalog. And it's to be able for organization to have just one catalog instead of managing two things and having to do harvesting
23:22
and synchronizing back and forth all the time. So, yes, we really think that there is a potential. And hopefully for Metropole of Lille, it's going to be in production next year. Thank you.
Recommendations
Series of 6 media