How to deal with a massive geographic database when surrounded by datascientists?
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/69029 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
DatabaseSystem administratorLocal ringBuildingCycle (graph theory)Video gameStandard deviationNormal (geometry)Endliche ModelltheorieFood energyInterior (topology)Normal (geometry)Division (mathematics)Strategy gameConstructor (object-oriented programming)Expert systemStandard deviationWindowUniform resource locatorWave packetReference dataVideo gameProjective planeEndliche ModelltheorieSoftware developerAuthorizationWater vaporSet (mathematics)Integrated development environmentSystem administratorLocal ringQuicksortRemote procedure callService (economics)Customer relationship managementDigitizingInternet service providerProduct (business)Student's t-testData analysisTesselationComputer animation
03:27
FAQDecision support systemComputer animation
03:44
WaveProcess (computing)TrajectoryWritingReading (process)BitStatisticsExtension (kinesiology)Statement (computer science)Online helpPrice indexHash functionLattice (order)Loop (music)Query languageUniqueness quantificationQuicksortRevision controlDomain nameTable (information)Relational databaseDatabaseFatou-MengeCodeTable (information)Software maintenanceMultiplication signExtension (kinesiology)Subject indexingEndliche ModelltheorieSystem administratorPrice indexQuery languageDomain nameProjective planeGoodness of fitRevision controlDatabaseLaptopComputer fileKey (cryptography)CASE <Informatik>Food energyFocus (optics)Online helpDiallyl disulfideProcess (computing)Rational numberMereologyFrame problemClient (computing)Complex (psychology)Wrapper (data mining)StatisticsSet (mathematics)Virtual machineMessage passingType theoryGodImplementationElectric generatorCoprocessorIntegrated development environmentCondition numberMedical imagingPlug-in (computing)Statement (computer science)Web 2.0GeometryMultiplicationLevel (video gaming)Configuration spaceLibrary (computing)PolygonContent (media)Numeral (linguistics)PredictabilityServer (computing)InternetworkingCloningEnumerated typePlanningElectronic mailing listStructural loadMathematicsWindowMetadataNP-hardArray data structureCartesian coordinate systemSoftware developerMeta elementPoint (geometry)View (database)Video gameDifferent (Kate Ryan album)Row (database)Predicate (grammar)Electronic visual displayTime zoneSlide ruleWeb applicationMachine learningData modelMixed realityContinuous integrationBoolean algebraWindow functionRelational databaseData warehouseCodeLine (geometry)Crash (computing)NumberSoftware development kitHeegaard splittingWhiteboardPower (physics)Branch (computer science)Greatest elementFlow separationMassCategory of beingNetwork topologyError messageParallel computingUnit testingDatabase normalizationExecution unitObject-relational mappingMiniDiscAdditionCustomer relationship managementTheoryComputer animationSource codeProgram flowchart
Transcript: English(auto-generated)
00:00
Thank you, Giorno. Hi, everybody. I'm really happy to be there and speak to you with such a crowded audience for sort of a journey of my last year, learning a new project that was really exciting to me. And I came up dealing with a bunch of data scientist
00:21
teammates, which I didn't encounter before in my life. And we encountered the need to build a giant geographic database and handle it and reproduce it very often. So this is my journey. I hope you will find some learnings
00:41
and transmit it to students or whatever, or yourself. So I am Regis Abou. I was a GIS administrator. Before that, I was an agronomist working on soils and on rivers and on pollutants. And I've been working for some years
01:01
at Auslandia as a project manager, SQL developer. And now I'm, since one year, a research engineer in the French Scientific and Technical Institute for buildings. So I've been involved in QGIS as a funder, then a contributor, and OSGO charter member
01:20
and president of the locale chapter of France for two years. And I also love PostgresQL. So this is the team of brilliant guys I joined one year ago. They are in three remote locations in France.
01:42
And they are all experts of buildings. They are all data analysts, data scientists, all of these DevOps. And I joined a project that was running for three years. Where is the CSTB, this French Institute? They do the norms and construction standards.
02:02
They monitor, build model for housing and do performance houses. They also do a lot of environmental stuff, like global life assessment of buildings to know how much carbon is emitted when building the house, retrofitting it, how much water is needed, and all this. And we are starting to work on biodiversity of the buildings.
02:23
That's strange. We also are a service provider, a commercial one to test the products of construction. Anything, a window, a roof, a tile. We do training for the builders to learn how to set up windows correctly.
02:43
We are helping the French government and the local authorities to deal with their data, with their policies. There is a division that does some 3D stuff and beam stuff. And we are doing now digital strategies. And GIS is coming to this institute
03:02
since one year, since I arrived, in fact. The project I'm working on, I've been talking about this morning at 9. So please go and see the whole dedicated talk about this. It's about building a reference data set
03:20
to fight against climate change The French have all the data on buildings in France. So we build this kind of fun stuff for anyone to just check his house or building and take conscious. They have to do something to improve the situation, insulate their houses and all this,
03:41
and then find fundings and people to help them do it. Well, the rational is that one third of the carbon emissions are probably linked to building. So that's a huge part to address if you want to attain our goals. Most of the carbon is emitted during the building process,
04:04
not when you heat your house or cool your house. So we have to stop focusing on new, shiny, passive houses, but on the existing buildings. And this is the past that CSTB took three years ago
04:21
with this project, stop focusing on new things, focus on existing things. And we don't know what are the buildings. So this is where I learned one year ago in a team that already had finished the work of building this database.
04:41
And I'm a GIS guy and more database admin guy. And I learned a few things. This is my background, maybe like all of you. This is OSGO stuff. This is theirs. This is Panda DataFrame before anything else than database.
05:01
Then there are really killer guys in science, machine learning, statistics. That's crazy what they do. I'm really impressed. And I was shy. The database, when I entered the project, also was frightening. 7 terabytes, 23 millions of buildings,
05:23
which we do maybe 10 version of it each year. So it grows. 41 million street addresses, 70 million houses, all the cadastral parcels, and all the external data sets we put on it, like the performance, energy performance diagnosis.
05:42
Well, it's very, very, very, very large. So I was in love with the challenge. Then I started to think, oh my god, me alone too with this team. But I couldn't help.
06:02
And I jumped into the project. And then I hit a wall. I didn't understand anything of the pipeline. I'm not that much a Python guy. It's full Python. So I'm used to do some QGIS plug-ins and small stuff. They are making huge pipelines with parallel computing
06:25
on web environments. So lots of stuff to learn. Conda, do you use Conda, Anaconda? Who likes Conda? OK. I come from a developer team where
06:42
we stayed the closest to the packages. And Conda is for Windows users, gigantic. It's easy to use. But when it comes to industrial use, it's big. I discovered gigantic YAML files because the team fell in love with YAML three years before.
07:04
Mostly, the most important thing, they were developing alone on each pipeline with different philosophies, no cross-code reviews. And it started to be a problem. Very few unit tests depends on the persons.
07:21
Some had none. Other were very well tested. So continuous integration was just starting. And we had a lot of complex dependencies with web applications, like you saw, and machine learning prediction pipelines, and data model changing all the time at each version, which
07:42
is an issue. So I didn't know how to jump in the projects. Analyzing the code was too big. So I started to just try to use it in pugis, use case first. Be yourself the next user of the database.
08:00
Try to just load data, list the tables in the Postgres application, then try to configure the web app to do a new application. And it was a bit hard.
08:22
First learning, pugis and 400 columns tables is not a love story. Huge columns, data scientists, they make arrays, arrays, arrays, multi-dimensional arrays sometimes. And they put all this in Postgres tables,
08:42
because it's easy. They are using data frames. But we are, as user, we just can't use it. So I had multi-geometries, all this. And it's all denormalized data sets. So the building table was handling the parcel geometries and the list of walls of each building and each houses.
09:05
So with a lot of redundancy. And it's almost impossible to use those data as is, in pugis at least, which is the main use GIS in France. From the database administrator point of view,
09:22
I started to monitor things, because there was nothing before. There was no server monitoring, you know, when your disk is full and the database server crashes and you lose your data. So there was no alert. And no monitoring of the frequent queries and all this.
09:40
So there was a lot of data copy and missing indices. And strangely, that led me to appear like a savior. Just try it at home. Just add an index. It's a thousand times faster, right? People really like this. The database listing, first, if you want to do GIS,
10:03
teach people when they do views or create table to type the geometry. It's a multi-polygon. It's 43, 26 PSG SRID. If you don't do this, all the clients, they struggle to list the data.
10:20
So they have to read all the data sets to find what is the SRID, what is the geometry type. And it kills not only the clients, but also the database. So type your data. I monitored slowly with the Pools Guys monitoring tool, which is PG activity.
10:41
I really recommend you. You've seen live all the queries running, but then you have other tools like PG stat statement extension, which shows the frequent queries and you can analyze the history. And on top of it, you can use a lot of different tools like 10 board power, PG Badger,
11:00
to analyze on the long run the bottlenecks of the person who queried the database. So you get really the missing index and missing and wrong models, modernization errors.
11:20
But doing this as a database admin and doing this for the others is a bottleneck too. So you have to learn the others to do this. Fortunately, data scientists, they learn fast. That's really cool. So I learned the explain of every query they were doing.
11:41
And some of their queries are running two weeks long on a massive machine. So you should use this tool from Dalibo, explain Dalibo. You just paste your explain plan and you have all the cost time, the explanation, and you can share it on the internet and discuss with people.
12:01
And you learn how the internal of the database and really quickly you see what you can improve. So just do it. Another thing surprised me. They love enums, enumerations, because in Python, that's how you do it. You have list values for tables.
12:21
Let's say the value list of the insulation type, windows type, you have enums. And it was not versioned, so all the enums were shared by all the versions of our data. So I had to say, sorry, we are going back to pure tables, relation tables,
12:41
so what we call domains in the relation world or list of values. That's not fun, it's not a table, but that works and you can export it to your package, you can export it to anyone and it works. Enums is not interoperable between the different world.
13:05
Hundreds of properties of columns, it's not possible. So we started this. It was new to them. I was surprised. Data science is the best of the best. I've never done this. At school.
13:21
So if you're a teacher, please. Buildings are not simple. Oh, there's a missing slide, it's not an issue. Buildings, define what is a building. They didn't do it before. So we had some time to find all the possible case and find a realistic model for all the use case,
13:42
but still usable. So after this, we had the coding practices. As I said, there were different between each pipeline. And the code review, they didn't want to go. Before going into the project, I asked them if they wanted,
14:02
if they are ready to do code review, submit their code to the others. They said, no, we don't want, it's too slow. We are in urgency. And in fact, it was not the case. They were afraid because it's hard to become humble and see all the ugly things we do when we are all,
14:24
we all do this. We do ugly code when we are alone. And looking at the practices in QGIS project for years, had learned me that going to code review, being humble is the only key to have good code,
14:41
maintainable code. And it goes faster in the end. That, that's learning. Conda, maybe, that's it. That's the L of Python. But yeah, it's better now.
15:02
We have Mamba, at least it's faster. So Mamba will save a lot of energy. It's all about climate change and we are building huge CIs consuming so much processor. So let's try to be in, I don't know,
15:21
performant and not use that much electricity. Mamba is cool, it's a C implementation and it goes maybe a thousand times faster than before, but it's still not that fast, I think. So when you build images on the fly, when committing something and launching the CI, it counts.
15:42
Another point, meta generation hell. Who has a clue what it means? No. Have you ever once in your life thought about, oh, this code have been pasting, copy pasting this 10 times. Let's build a library to do it at my,
16:01
for SQL to do it for you. It's the start of hell, really. And they, in my team, they love meta generation for SQL for all kinds of things. So they reinvented SQL, but in YAML. And that YAML file, I would say GitLab refuses
16:25
to display a configuration file, but maybe we have not only configuration, but also data in this YAML file. And in fact, all the table names, the columns, the column types, the metadata of the columns, they are all in this YAML file.
16:42
So when you merge two branches, what could go wrong? So let's split it in separate parts, separate files, maybe the Linux philosophy or everything is a file and a small file dedicated to something is the right philosophy.
17:01
And they didn't want to start because it looked really nice from a theoretical point of view to have one YAML file. No, it's not a good idea. And as it was a 30,000 lines, 30, yeah, 30,000 lines config files,
17:21
they couldn't edit by hand. So they made their own Jupyter notebook to edit the file on the fly using the Postgres table definitions. Hell. So in the end, same thing for Mapbox configuration, Map Libre. There was a Gson file.
17:41
I said, hey, we all have in the YAML file, let's make a wrapper to generate all the styles. But you have to get rid of all the complexity, of all the powerful features. So it ends. In the end, you are stuck going back to manual editing.
18:04
This is the commit message of my colleague. I did the best I could. I could, sorry for the broken code. Meta generation is, the human mind is not made for this. It's made for the person who coded this the first time. Maintainers, they can't come in.
18:22
So let's remove layers. That's what happened to me. Oh, I'd like to do a window functions to generate that view. Okay, it's not in your wrapper of YAML. How can I do? Crap. I don't just have to learn Postgres. I'd also have to learn the Python code
18:42
and the ORM code and the YAML code that makes the query in the end. And it probably is three to four times longer, harder than doing the row addition by hand sometimes. So do templating only if you really need it
19:04
and you have the skills to maintain it and to make a really nice library to maintain it. And don't forget that someone, maybe he's not a coder and will have to change the content, the name of the column, for example.
19:21
He won't go into git clone, compile conda and all this. One, five, one, one, okay. There is a lot of cool stuff like this but what could go wrong when you're on the same server as PostgresQL? Who am killer?
19:41
We started to split Postgres and libraries and clients. There are a lot of things on cultural stuff. It's all about teaching. How powerful is SQL? How powerful we can go with joins, good indices, explain and all this.
20:02
I told you already about it. Code review process, I told you, I'm going fast. Go lean, try to learn to work together. And not multitask too much. That's a change for searchers. And for techies, I will finish on some numbers.
20:24
One thing, I am getting slowly older and the caveat was that the junior developers did just think I was saying the truth with a big T. Even when I was wrong, that's learning for me.
20:40
So use SQL to store data and treat as much possible, handle as much possible data and then in the end, you finish in Pandas, Julia, whatever it is. Some unpopular opinions.
21:01
In the end, you will have a traditional relational database that will be Postgres and PostGIS. And if you really need it, you can then normalize, do no SQL downstream if you need, do massive geo-pocket file on the data lake, data warehouses if you like.
21:21
But at start, you have to query the data and the others are not so easy. Try sobriety. Do not spend too much electricity in the air, please. And if you have a new shiny, cool tool that is a new layer, it adds a combinatory complexity that you should avoid.
21:43
I'm finished with a massive heart to my team. I've been criticizing them for the whole talk. I'm sorry for this. They were surprising each time I told them something two days after it was done and learned. They learned so fast. So what I liked in this project
22:01
is we mixed two really different cultures. We all went out of our own comfort zone and this culture mix made us progress. So that's the story of PostG, I think. Thank you.