The Scientific and Technical Center for Building (CSTB) built the first French database of buildings and houses to address climate change challenge, helping knowledge and decision making for massive retrofit. The pipeline factory intersects massive datasets (21 Millions buildings, >400 descriptors) and keeps adding new predictions and external datasets all the time. It allows to run analyses and predictions for all the climate change related indicators, such as housing price and energetic performance relation, heat wave impact, solar potential, etc.. While the first versions where a direct image of the classical datascientist’s approach -ie a massive dataframe driven by massive yaml config files and cryptic meta-templated scripts– ease of use and access performance soon became a limiting factor. This is a major concern since this dataset will be one long term foundation of derived information systems. Between brute force approach based on scaling resources up, and the old fashioned « data diet » normalization and optimization process, the truth is not easy to find. Abusing from cartoonish humor, this talk will try to explore the benefits of normalizing back hugely redundant geographic datasets and making public interfaces (public SQL model, API’s, vector tiles, OGC API’s) so that both end users can analyze efficiently this dataset, and the data manager team can rely on more stability using those good old’ database constraints. |