Geocint: Open Source Geospatial Data Processing Pipeline
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 156 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/68469 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
| |
Schlagwörter |
FOSS4G Europe 2024 Tartu51 / 156
6
33
35
53
55
59
61
67
70
87
97
99
102
103
104
105
107
111
121
122
123
124
125
126
127
128
134
144
150
151
155
00:00
SoftwarewartungProzess <Informatik>Open SourceOpen SourceElementargeometrieSpieltheorieWhiteboardGebäude <Mathematik>DatenverwaltungSoftwareWald <Graphentheorie>Offene MengeRPCDemoszene <Programmierung>Framework <Informatik>Information EngineeringCoxeter-GruppeWort <Informatik>DADSMaximum-Entropie-MethodeAutomatische DifferentiationVorlesung/KonferenzComputeranimation
01:34
BitDifferenteGruppenoperationSystemplattformMAPInformationPhysikalisches SystemKoordinatenDatenverwaltungEntscheidungstheorieMathematikEreignishorizontMomentenproblemClientSchätzfunktionSnake <Bildverarbeitung>Open SourceService providerMaximum-Entropie-MethodeEchtzeitsystemWeg <Topologie>Kontextbezogenes SystemGüte der AnpassungOffene MengeGamecontrollerVorlesung/Konferenz
03:10
Singuläres IntegralProzess <Informatik>ServerProgrammschemaProzess <Informatik>DatenverwaltungTransformation <Mathematik>GruppenoperationRandwertSchnittmengeDifferenteOffene MengeMAPMultiplikationSchedulingSelbst organisierendes SystemSnake <Bildverarbeitung>ServerSpeicherabzugOpen SourceLastElektronische PublikationProjektive EbeneSystemplattformComputeranimationVorlesung/Konferenz
04:26
BewegungsunschärfeAuflösung <Mathematik>VersionsverwaltungVollständigkeitMAPGebäude <Mathematik>Textur-MappingHitzeDienst <Informatik>SichtenkonzeptDichte <Physik>SpeicherabzugFlächeninhaltp-V-DiagrammOpen SourceZusammenhängender GraphRechnernetzSchätzungBenutzerfreundlichkeitSoftwarewartungFeuchteleitungRapid PrototypingSystementwurfDatenbankAbfrageDesintegration <Mathematik>Produkt <Mathematik>Globale OptimierungParallele SchnittstelleCLINabel <Mathematik>Hierarchische StrukturSechseckIndexberechnungPolstelleVerzerrungstensorDateiformatDokumentenserverKette <Mathematik>SystemplattformSchnittmengeRandwertSystemverwaltungFreewareGebäude <Mathematik>Ganze FunktionMAPPhysikalisches SystemProzess <Informatik>DatenverwaltungKartesische KoordinatenMapping <Computergraphik>VersionsverwaltungSchätzfunktionPunktSoftwareVollständigkeitMereologieAuflösung <Mathematik>EreignishorizontNatürliche ZahlLastInformationDemoszene <Programmierung>Service providerFeuchteleitungCodeAbfrageLeistung <Physik>DatenbankElementargeometrieDifferenteParallele SchnittstelleMathematikProdukt <Mathematik>Elektronische PublikationDatenverarbeitungDatenaustauschGradientVektorraumOpen SourceDokumentenserverBitmap-GraphikDateiformatSpeicherabzugInstallation <Informatik>Offene MengeDichte <Physik>Rapid PrototypingKoordinatenRobotikProfil <Aerodynamik>SkriptspracheWort <Informatik>Verzweigendes ProgrammZentrische StreckungPräprozessorRechenbuchTransformation <Mathematik>VerzerrungstensorVisualisierungSchaltnetzGRASS <Programm>InformationsspeicherungNichtlinearer OperatorZahlenbereichRelativitätstheorieKompakter RaumIntegralE-FunktionEinfügungsdämpfungCoprozessorStellenringUltraviolett-PhotoelektronenspektroskopieIndexberechnungComputeranimationVorlesung/Konferenz
13:37
Prozess <Informatik>p-BlockMathematische LogikMeta-TagRapid PrototypingParallelrechnerSystemzusammenbruchSkriptspracheAggregatzustandFehlermeldungDesintegration <Mathematik>GruppenoperationStrom <Mathematik>ServerSoftwaretestComputervirusMessage-PassingMultiplikationsoperatorNabel <Mathematik>VersionsverwaltungFunktionalElektronischer ProgrammführerCASE <Informatik>Demoszene <Programmierung>MathematikGeradePhysikalisches SystemNichtlinearer OperatorIntegralDatenverarbeitungp-BlockFormation <Mathematik>PunktVirtuelle RealitätCodeVerzweigendes ProgrammGebäude <Mathematik>SchlüsselverwaltungElektronische PublikationProgrammbibliothekMessage-PassingServerGraphenzeichnenDatenbankAggregatzustandFunktion <Mathematik>Prozess <Informatik>LoginProfil <Aerodynamik>Reelle ZahlStatistikArithmetische FolgeWeg <Topologie>CoprozessorMereologieSkriptspracheLesen <Datenverarbeitung>DatenverwaltungMetrisches SystemVerkehrsinformationKonfiguration <Informatik>FehlermeldungProgrammfehlerGamecontrollerDiagrammHeegaard-ZerlegungGanze FunktionProgrammierumgebungMathematische LogikLastSoftwarePräprozessorVorlesung/KonferenzJSONComputeranimationDiagramm
20:02
SpeicherabzugOpen SourceMereologieSpeicherabzugQR-CodeVerschlingungCodeVorlesung/KonferenzComputeranimation
20:24
SummierbarkeitKurvenanpassungMomentenproblemSechseckMigration <Informatik>ProgrammbibliothekMAPMereologieRückkopplungGruppenoperationInstantiierungAuflösung <Mathematik>BitBildschirmfensterSnake <Bildverarbeitung>Ganze FunktionSchnittmengePhysikalisches SystemElementargeometrieMultiplikationsoperatorVersionsverwaltungOpen SourceRandwertPunktNatürliche ZahlAutomatische IndexierungProzess <Informatik>Quick-SortWort <Informatik>SystemaufrufComputeranimationVorlesung/Konferenz
24:53
Methode der kleinsten QuadrateComputerunterstützte ÜbersetzungComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:00
Um, hi guys. Uh, my name is Andrew and develop shook. Uh, I'm just special data engineer at contrary. Oh, and uh, I have a little ground, uh, with, uh, remote session, especially for forest management. Uh, but now I'm working like a data engineer.
00:24
I like it's, I like memes. I likes open data and open software. And today I'm going to mix some of this topic to make my presentation more funny. Yeah.
00:41
And also today I'm going to present our open source framework for data pipeline buildings. Uh, we named it geo-synt. It's like board game from, uh, from Russia and it's like a flower higher scent. Yeah. And we mixed it with gel because it's a pipeline for producing geo data and it,
01:06
it pronounced like deal scenes. Uh, yeah. What we learned of us is a whole of data. I think we all know about it and uh, this is why we'll update and working with it so much and especially open
01:23
data and uh, open software for it. And also also we love medicine. It's a world of menace. Uh, but let me say a little bit about my company and about the main topic that
01:41
we are focusing in our work. Uh, control specialize in disaster management and G information. Uh, we develop custom solution for different companies around the world. We create platforms for disaster management for wrapping decision making. Also we developing system for, uh,
02:03
coordination in disaster management. Uh, we also hope our client to track events in real time. We have disaster feed with disasters across the world from very different sources like from private, like a Pacific disaster center on Hawaii and from open like G Dax or any small
02:28
providers. Uh, also we help estimate risk and impact. Uh, two hours ago, my colleague from contour was silly. He was presented disaster Ninja.
02:41
It's our critical event management platform. And so he told a bit about, uh, about our work. Uh, some moments that I will present will duplicate here, but I will try to go a little deeper. Uh, we'll also help our client to get notified about change in the, uh, in the course of event. Uh,
03:02
keep a situation awareness on the very good level and uh, to take action based on multiple criteria. Yeah. So what is geosynth? Uh, geosynth is an open source make, make based, make file based, geodata ETL pipeline, extract, transform, load. Uh,
03:24
just it was initially designed for internal contour needs and we used it for a few years. And after a few years of usage inside of our team and after a few projects, some of them both public, like we build a geospatial data pipeline for map action might be familiar
03:43
with this organization. Uh, we decided to public it under MIT license. Yeah. And so contour, uh, uh, we're on the data pipeline and multiple servers on schedule every day to download and process data from open street map,
04:03
like a full dump of open street map and very various different data sets like, uh, HR cell, GHS, wicked data, some data from them and rebuild our data set that we produce contour population data set and contour boundaries and data for disaster ninja.
04:22
It's a, our disaster management platform. Uh, yeah, we, we use geosynth especially to build data sets. Uh, we have to open data set that we public for free, uh, published for free. It's a population data set and administrative boundaries.
04:42
Uh, uh, our pollution data set, uh, actual version five. We build it in, uh, uh, on H3 grid system and it's available for free on eight resolution for entire world on humanitarian data exchange. We also load a downsized version in three kilometers and 22 kilometers.
05:03
You also can download it for free. And, uh, uh, from version three, we published a small files with extraction for any country. So now you don't need to more download a big data set for entire role than extract data for particular country. You can download extraction.
05:24
Uh, yeah. And so just in Asia, it's, as I mentioned before is our critical event management solution. Uh, we developed it together with humanitarian of the street map team. It's open source and it's free. Uh,
05:42
this to provide information about recent natural disasters and visualize mapping gaps. We have a lot of flares that demonstrate our quality of coverage on open street map, like open street map, building completeness, uh, robot completeness, antiquity. Uh, uh, yeah. And,
06:02
uh, so, um, the mapping coordinators from humanitarian open street map team can connect with local mappers to estimate those gaps because we have a layers with local mappers. We collect data about open street map users and uh, create, uh, create a layer where the most active users can be founded.
06:26
And if you are activator from humanitarian opposite my team and you want to find the guy who can update a local map because he knows a local situation, you can go and check this layer and you will directly find the guys who is local
06:44
and who can map this for you if you will connect. Uh, yeah. And in few words, it streamlines the process of mapping or, uh, humanitarian open street map team activation when disaster strikes. Yeah. And, uh, uh, the, the last tool for which we use GLC into,
07:05
we have on our platform, the varied layers. It's a special kind of layers where you can join our four indicators. Uh, the good example is a building quantity or the building completeness. Uh, when on one axis we have a relation between a number of buildings that
07:24
mapped in open street map with buildings from AI estimations, very different AI estimation. We collect a lot of very different data sets and merge them and make a quality check. And we compare data from a constant map to define places where the not all
07:43
buildings mapped and by another axis you can, uh, uh, yeah, the, the, I'm sorry, there is a building axis and there is an access with population, population density or, and uh, uh,
08:00
we did it to highlight places where there are a lot of people and uh, you have a very small completeness because, uh, uh, uh, if for, for, for disaster management goals, so it's most important, uh, parts of mapping.
08:21
Yeah. Uh, it was, it was our solutions, uh, for which we produce data with using GLC and let's, let's talk about just indirectly, uh, though, uh, does our products as of, uh, as a whole have defined a set of care requirements. For the software that we developed, like simplicity,
08:42
reproducibility, uh, low entry barrier, easy and rapid prototyping from the start point to ready made map and a high scene load performance. Yeah. And uh, the main priorities during a system design was, uh, uh,
09:01
explicit reproducibility. So, uh, losing your database shouldn't be a problem. You just should recreate database and a pipeline will rebuild every data that you had. Uh, we keep, uh, uh, follow to this, uh, approach. So every layer that we have in disaster Ninja,
09:23
any data that we have in our, in our system can be rebuilt without any losses. Uh, yeah, about low entry barrier. Uh, most of JS engineers know a scaled well and uh, just sent gives your ability to easily run powerful post JS queries and uh,
09:51
uh, we're also did a deep G and get integration, uh, uh, to effectively collaborate inside the power team to be able to use AI tools to
10:05
check our quality of our code. And a fast, uh, to, to, to be able to fastly, uh, change a version of our, of our pipeline or from different branch.
10:22
Yeah. And a few words about stucco. Uh, we collect, uh, uh, uh, open source, uh, packages that we use for processing just special data. It was based on a new maker and uh, uh, uh, uh, uh, uh, my profile is a,
10:42
is a linter and the preprocessor for my files. And so we use a new parallel for parallel operation, the Dow OGR or CMC tools or smell for OSM data processing and uh, uh, uh, uh, uh, uh, uh, post grade scale plus post JS and H three PG to make a data transformation,
11:07
uh, transformation and uh, uh, all calculation inside of database. As you see all this data, all this packages in open source. Yeah.
11:21
And we are mainly focusing on a post grade scale was JS and HDPG in our team. We have a contributor who write to post grass and post JS. Also we have a contribution to H three PG. So we actively support a software that we use in our work and uh,
11:42
that, uh, uh, that packages allows you to combine powerful data processing feature of post grass with efficient geometric operation from post JS and to keep benefits of using the H three grade system. And uh, uh, H three is, uh, I think most of us are very common with history.
12:02
And it allows minimize distortion of the pulse, uh, compact format to store and high performance lockups. And the main conscious why we use it, uh, we use it to join very different data sets on one base.
12:20
We transform them to H three grade system and after that you can compare it. You can, uh, uh, uh, interesting is to get system is very useful because so it's, it's not a vector format. It's not a raster format.
12:40
When you want to have a vector, you have a vector. When you, uh, when you want to have a raster, you use H two grade system as a raster. And this is the reason why we use this hexagonal system for disaster Ninja, why we build our day population data set and based on it. Uh, yeah.
13:00
Uh, just in consist of four to open source repositories. It just seems runner and you'll see it on the street map just in Toronto. It's like a core part with install. I was in style scripts. Uh, and uh, just sit on open street map. It's a open repository where we, uh, give us an example of a small pipeline that will dump open street map planet
13:25
and we'll process and load it to database. So, uh, after that, you will be able to process data from open street map inside of Postgres. Uh, yeah. And so, uh, as this was mentioned before,
13:45
just sent based on the make file. So there's the main logical walk of the ETL process is a target. And uh, why is it easy? Because you can migrate your bar scripts, your shell comments to make file almost without changes.
14:05
It only, if you have a very special cases like you have a dollar seen in your comment, you should duplicate it. But in all any cases, you can just take it, uh, split into a small logical block, uh,
14:21
transform it to target and you will have a make pipeline that will allows you to, uh, uh, pro pro produce your data or any other operation that you want. And as was mentioned before, you can easily use magic of post JS without using any, any operator slack in, uh, in big,
14:45
uh, the data pipeline builds in tools. You can just take your request, uh, SQL request and, uh, put into your target. It will work. It will work fine in our, in our broad system. We have near like five sons, uh,
15:03
lines and it work stable and efficient. We checked. Yeah. And the very logical question that I've heard a lot, why not a Python based? Uh, and uh, almost everybody in JS community made the situation.
15:21
When you try to do something, you try to produce data that you haven't ever produced before and you are looking for the guide and finally you found this guide and it looks pretty, pretty nice. But when you load your G piter and start to do this, you, uh, you, you met with situation when, uh,
15:43
some function is outdated and you version of libraries that you have required it. Uh, installation, something like a virtual environment. So this is, this is the reason why we try to avoid using a lot of Python and we based
16:02
on make file. Yeah. And a key advantage that make follows you to easily process intermediate states when you have a very, very long data processing pipeline and something of fall in the middle. You can start from the place where it fall.
16:24
You exactly know why and where you don't want to start from the, from the first point. Just, just finding them about it. Yeah. Uh, and uh, some, some not very obvious advantages, but uh,
16:45
okay, uh, when you have a pipeline that will work like, uh, one one day or two days and in the middle of the work of this pipeline, you decide to change something. For example, name of your variable or try to different approach to build data.
17:02
You can stop pipeline right there and uh, all target that were built will keep built, built, and you can switch your branch to new branch with new code and you can, uh, uh, resume your pipeline from this point,
17:20
not from start. It's, it, it, it's very useful because at the start of our, of our broad pipeline, we have a load in OSM data to post, post, postgres database. It looks like a few hours, so we don't want to, uh, duplicate it. Yeah. Also, also we have a dash dashboard
17:44
if you want to spatially the, it allows you to track pipeline execution, progress and time statistics. And uh, uh, yeah. And K to lead to that we use to manage all this is make profiler is it's Python based, uh, uh,
18:02
linter and preprocessor for make file that output and network diagram like you see there. It's, it's a real network. They are the diagram from my profiler in broad. We have it's much more huge, but it's just for demonstration and the output chart allows you to see what went wrong and we quickly quickly get to the locks.
18:23
You can just click on the target and see your locks with ours. And uh, uh, we also have an option for slack integration. Uh, uh, integration allows you to send message about K pipelines, execution steps or bug reports. They're likely to a slack channel.
18:43
So you need, you need to check, don't need to check the locks on the server. You can go to your slack channel and just read. It's an example about messages. Uh, slack integration is the option of feature.
19:01
So you can use just into pipeline without it or integrate any different messenger. Uh, but from our experience, I can say that the most useful part of such kind of integration is that you can receive a report with K data set metrics without going to the database or any management tools. As I mentioned before, we, uh,
19:21
we produce a control population data set for entire world and uh, we have a lot of quality checks and some of these quality checks and message to our slack channel. And sometimes, um, uh, it's a, it's, it's send us message that, uh, uh, the bender's dream came true and now what population is zero? Uh,
19:45
yeah. And, uh, the, the final step, uh, let's say hello world. Yeah. Thank you. Uh, for, for, for, for this, you just should check if post for G 2024 Europe is true and yes,
20:00
it's true. So hello world. Uh, yeah, this is the link and car code directly to the geosynth runner. It's a core part where you can find the guideline, how to install. Uh, you also can, uh, make your, uh, pull request if you want. Uh, this is the sources and final parts.
20:27
I bought my mems. I want to your question. All right, so who has questions about kitten mems?
20:43
More kittens, more geospatial. The sessions after lunch are always a bit tricky. Yeah. Yeah. Uh, maybe, maybe one important moment that I missed, uh, uh, there, there, there is open source already made sort of part,
21:01
especially that was built from up from map action so you can take it and travel without writing your own pipeline, how it works. Yeah. I have questions, but it's mostly about the data, not so much about the pipeline. If you don't mind. Uh,
21:22
are these data set special and based on each tree, uh, publicly available, uh, uh, uh, a publicly available counter population on eight resolution for entire world publicly available or counterboundaries data sets. It's, it's not a H3, it's a geometry of boundaries. Uh, any,
21:42
any other data sets, they available on disaster ninja. You, you cannot download it, but you can use, you can use them. Okay. Folks, another opportunity. Yeah. And, uh, in, in, in few months we are going to launch our, our commercial version of disaster ninja without focusing on disasters.
22:01
And, but we will have, uh, additional notification about that. Okay. Let's, let's look at, we have time. Um, so another question about the H3, if you don't mind, um, how approachable did you find it to, can you do all your work only based on the H3 API or did you have to code,
22:21
uh, something extra to, to get that data into your system or to get data into the H3 grid? Uh, it's very depends on, uh, on the nature of your data, which kind of data you use. So for some data sets, we just convert a point geometry to H3 index.
22:44
For some of them, we use intersection with geometry of hexagons. It's, it's, it's, it's really depend on data and the sense that you have in this data. Okay. Anybody know?
23:02
So now a question more about the pipeline. Yeah. Um, I like make, but it's a, there is, let's say, let's call this, it's a learning curve. It's not so much a learning curve for someone, for instance, that is not used to Linux or someone that didn't start working on Linux,
23:24
uh, over 20 years ago like I did. Do you get feedback from users about, uh, this particular setup we'd make do, do our users happy with this? If you don't mind the question, they, they, they have you,
23:44
but, uh, as, uh, uh, as it was, uh, published, uh, uh, uh, recently, they still have some question, but it's, it's, it's like working moment. Uh, yeah. Uh, make, it's, it's very useful for Unix based system. And, uh,
24:02
you will have some pain if you will try to do it on the windows, unfortunately. But what they can say about that, uh, I'm from windows initially. I did my thesis, uh, uh, on windows and I used Python libraries to process remote sensing data.
24:20
It was pain, especially for me. It's, it's my, uh, uh, experience from the past. So now after migration to the Linux, uh, I feel much more better. I can use anything that you guys write and process.
24:40
Okay. I think that's a good, a good moment to end those words on, on migrating from windows to Linux. Um, unless there is some more, something to say from the audience. Okay. Let's leave it here. Thank you Andrei.