Bestand wählen
Merken

Crazy data: Using PostGIS to fix errors and handle difficult datasets

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
hello everybody have good morning the this this could decrease data presentation using both shares to fix errors and handle difficult datasets my name is Daniele made under and I'm with the Brazilian Federal Police on the forensics examiner and I'm working with j ust right now and I would like to talk to you about
about what crazy dailies in our context in the context of division of the police and the book off this present presentation this SQL recipes for fixing stuff actually I I would like to apologize because they left 3 items after the a abstract out I couldn't fit that 20 minutes so sorry about that and how crazy there comes into existence so from our experiencing and all this stuff but how does it how does it come to be and why is also important for us and there is a treat at the end we have a few bounties but they're going to be awarded at the Cold strange not not what is
crazy data so this is our very objective definition for something that's absolutely subjective so it lacks metadata because it contains a there too many or the grave errors or it is too big it is presented in an awkward format it has more than 1 source of for example that the same data the a reverse from the Amazon sold more than 1 institution has mapped and when they when you get data from chew on sort different sources it you you're bound to have inconsistencies and make it makes uses rules here so that's of some of these issues can be approach with the help of full shares the the so quick recipes are
that's what our team used we're by while while the 850 shape and the Xs and stuff like that into the database and using a lot of use of that data and publishing it on our internal network so that is being made me for forensics saw forensic experts especially in the environmental area have to have a good idea of the place they're going to examine so this is my favorite the actually I saw a recipe in the US to website but I think this 1 is
simpler that's how it goes so you'll make the or polygon just a little bit bigger you make it a little bit smaller than you make it appear again at the end it's the same size so why did they do this it's a logarithm borrowed from image processing is called being on the order of the operations is called the closure or opening have anybody of any 1 of you have heard of that OK so that's no borrowed from image processing so what happens is if you have a spike to the inside of the polymer you buffet it will go away if you expand the policy but what about despite that go to the outside so you shrink it again to the same side is start with the shrink it again then make it bigger again and all the special sort all the special go away it just pop it just works and what is that Chinese equals meter of magic there will see
this is a poll shares were for option if you don't join equals meter on the left side it will create a sharp corner we just 1 verse verdicts so like the the side of the polygon violate and you greater shop on it the the 1 on the right is the regular buffer the what happens is that you
should do all this shrinking and grow and shrink growing stuff with the polygon with the regular of buffer you get the what is the color is different from my screen of light green and dark green or light green is what you get if you use regular buffers the dark green is what you get when you use the joint equals meter so it preserves all the vertices on the same place so they stay in the same place and all the spice go away so it's pretty
simple if you actually do it on your database it's gonna look something like there's lot more testing and stuff like that it and the next
recipe above invalid geometries so far up on the left we have a polygon that's actually an invalid polygon because divergences were supposed to be counted clockwise and the size range suppose to be crossing so it's invalid and what happens if you run ST buffer all ever does everybody here know the bullshit basics 0 0 OK Oh so if you run it through st buffer that's what you get for me that was completely unexpected I knew it something good will come out of it but that's what came out
OK so what if you run it through st make valid this function is available since both stressed 1 point 5 I think and that's what it will blue with that that doesn't mean it has fixed the polygon because you you're not sure if that's what the user meant the so it may be the use of man of a square a jilted square but they put the vertices in the wrong order and it cannot came out like this but this is what the computer can do automatically that's sort of estimated deaths and it creates upon
a multiple of sorry and which a politician from the polygon and then the buffer comes up right so that's a very simple recipe has to make valid doesn't guarantee you're polygon is the what your user romance but it will be about
OK but what about valid validity there's another very simple recipe that's that will have your database tell you where the mistake is so if you run this function is very detailed you get a set you don't get a single value datasets and for if she extract location from the set it's a point so that's the specific 1 did have a of line crossing there and if you run the I should take out the other member of the set which is a reasons why why is that invalid it will be in this case it will give you that that there is a ring intersection OK next recipe
3 holes in the union if you have for example several parcels and you're going to make a union of them tribute a bigger oligomer some sometimes is as low quite right in have holes inside because of the sides of the polygons inside the Birch exactly so what do you do this is really simple and actually speaks computation because and in between the computations that there a lot last so if you will for each 1 of the the the apostles then you do as union the buffered them back sorry but that is a native buffer so it had the same kind of passion with this 1st recipe we we spoke about to drop everything about the new shrink everything out of it but does not preserve boundaries so if you also like we spoke before if you join equals meter it's better and
what if it's not a union so you have a rock polygonal you will fill in the holes have you do this this is a simple way to do
it if it's a polygon doesn't work fits on multiple so you extracted steering in make a polygon of it so all the interior region we showed the holes are going to be thrown away so you get of sorry you get so Polly the
if it's a multi polygon it gets hairy because so actually and it's it has you have to have a query inside your other and then you down the geometry breaking apart and you do that for each polygon so that's the recipe of I don't know how to have access to this presentation layer but I will try to publish and that the directly was your website or order or that using the 2 people who hold power that can come to but all of
that if you actually write it in your colder some guessing there should do In the other recipe of like to speak about is a speed up of large data so if you have a only here with a lot of vorticists anyone rendered that the thing is both both of SVO 9 . 3 has a new functionality called materialized views so what is that you create you create a query wary of all year In the post escrow will internally create another table just like is a world of create table on ASR select whatever you want to but it's not at at a simple table it's so far materialized view so you can create indexes on and you can create value you can a refresher but if an update automatically is not to run for every query you run dematerialized table so that's really good for and in our high density maps at lower scales so if you do not materialize view of very heavy map that's what you have so
if use I you have to tune the parameter of the simplification so she mean you'll see that breaks down but at the end of a larger scale I'm most-est slaughters mostly what if you shoot people fought for the right you get a lot of nice closed lines and you have done this simplification just once on the server and it works well for data there is more or less static and it will render much faster just let me show this cool again you
you you have low ST simplified there you can improve their ST simplify by making it preserves the borders and stuff like that but uh since you're gonna interative ever a different scale doesn't matter much
it it OK so this is the route the 2nd root of all evil for the preservation of the forest root of all evil is lack of metadata that's what I experienced in my in my my work there but that's the 2nd 1 so people don't check the data before committing it to the database so the leanest possible check that I could think of was that 1 so you edit check so post we're still has this structure you edit check in policy is valid on the geometry just reek of putting on the on the record and if it fails it won't commit record to the table there are lot our algorithms for tracking data from our production there are very complicated to show here but they use triggers so they actually modify the data once you put it for example if somebody just modify the of center of a it will edit the polygon and make the point and move to the new center that the person wrote so blue of funky stuff with the data but of this issue custom into complicated to show here but that's the idea and this is mostly not not it does doesn't doesn't concern both shares concerns posterior skill so this is the holy grail with everybody had checks on the data it will reduce greatly the amount of errors that creeps into the databases and
the OK are we got all this crazy data we have these recipes to do with that but we would like it to not exist in the 1st place actually would like the data to exist but what we would like it to be saying not crazy and so on large datasets wire decreasing because they're big so that's 1 of our definition of presents when his Joe big for us it's crazy and mechanics sector can expect that you can mitigate the problems you can reduce simplified duplicates do stuff like that but it's always on your problem lack of validation but what does lack of addition generates generous lots of topological wares and it but except that referencing so that is mostly for legacy data right now what do public institutions in Brazil have very strict restrict of regulations on how to produce is that this doesn't happen on new data much but it when you get legacy data that's a problem OK what is re-projection have to do with greater the reprojection just translate collection all weights of it's a problem when you get the you when you have topological likely should have the drawing here but I of just mimic OK let's say you have a parcel square partial displayed on the middle so it's topologically correct its 4 vertices in the 2 apostles in Uberon extraverted spare vertex on 1 of 2 possible and doesn't have a counterpart in the other partial OK so my re-projected at that point on the middle off the the the edge will not be in the middle of the the other apostles edge the other part so had to have a counterpart to that point it so she reproject you will make that you will detonate the the bond that published a set and geometric operations OK so when people simplified our polygons because it is the map is OK so all our our showed in
the create here OK but that map was pretty well on closed all the borders were matching but when I do this a simple simplified for and everything open up the absence because it is not a topological representation both has that but I don't have time to do with it got to show it to you right now but if it if you don't have topology mind that's what will happen if you just simplified of the point the the diversity of sources so if you have like 850 data sources year bunch to have very different formats for different conventions and there that's very difficult for us to handle but doesn't concern bullshit so I don't have a recipe for that so legacy databases have awkward data formats like DXF or stuff like that that's on several DXF using differently and means for every for example or river data on 1 DXF is therefore only other ones muriate so you can just put them together and imprecise definitions for example if you have a lot we will actually have that on our production data the legacy data it asks that there's a pointers system next best what is the coordinates of your forensic report so the person doesn't know if it's the court of the kicks and they place that was examined the cortical hot desk so in some by people put a corners of the desktop but sometimes people put the coordinates of the unit they went you in a very crazy stuff so we this this had to be more so more
precise the OK white writers of matters forests are production data is shared on the on his work on this web portal its internal it's not available on the internet so this dots over there are differences reports reproduced social that the map but the issue is support
data and intelligence data so that is the kind of very simple of we this is a fraction of the rest 3 major you have from the Amazon and from the the last the more so Over the less populated areas of Brazil were most of deforestation happens or pollution and illegal mining all kinds of that stuff so we have this for reference so it's it's all data on we so it's all data and we we have it true true measure and the amount of damage that was done so before it was like there's no get a new image we fly with UAV on top of that we measured again and then we can have a reference and see how much was damaged so this for support data so we have a lot of data here we we have to process and the go
real nitty-gritty is here it's intelligence data because the forensic expert has should know a lot about the place is going to do we get the data from several sources so we have from unofficial roads in the state of power Chula mining licenses it's like if it's 850 different things in more than 90 degrees 950 views of it so in this specific map we're surely um the mining licenses and enjoy reservations and a the environmental preservation areas and it's a it's a challenge should you have all that because for every source you have have to do 2 different connected and the simulated so we ran all those recipes on top of the data make it same and putting know the peaks
the OK so what about the last time I left and went to foster was of Denver 2011 and putting some of bug reports that the to the west you track and whoever got to this book reports and solve them garden unawareness of a new from Brazil so let's go thank you this is that the short this is the speech change but you know you can see I'm using 1 of these in the he got 1 less time so that's Frank that stuff on all of the that's by from most you you OSM Japan that's a pole and that's the resulting from and who was the president of I have always geo Portugal so this guy's got this over at time some of them on skulls and not because it's old but but they did get the souvenirs is so keep our but if you if you're interested it's mostly huge yeah stuff I I didn't get you to pull the 2 putting all the feature requests yet but if you're interested tune into that of Twitter channel thank you room which will focus on
the of the modes is the most you very a high I have a question on how we handle what might be called version-controlled you have naming conventions for all of your intermediate steps or how do you do you keep the old you know bad data what they never thought of that but I do so here like the 0 1 2 3 4 5 of them I I keep all steps and I use the I don't know I don't use version control because it's some as across I don't dance servers and stuff and then you just kind of know what that the highest number is this site version you can adjust all tables in posts Jonathan no actually my scriptural Python the connectors are all Python we run these recipes on the database so have a station database by a world everything on this my process at this here then I II up will just a difference to the mean of production thanks you the it of and thank you company
Fehlermeldung
Gemeinsamer Speicher
Taupunkt
Kombinatorische Gruppentheorie
Computeranimation
Fehlermeldung
Fehlermeldung
Subtraktion
Gemeinsamer Speicher
Schlussregel
Quellcode
Kontextbezogenes System
Kombinatorische Gruppentheorie
Dateiformat
Kontextbezogenes System
Quick-Sort
Division
Computeranimation
Objekt <Kategorie>
Open Source
Metadaten
Existenzsatz
Reverse Engineering
Existenzsatz
Dateiformat
Hilfesystem
Fitnessfunktion
Fehlermeldung
Nichtlinearer Operator
Expertensystem
Algebraisch abgeschlossener Körper
Bit
Shape <Informatik>
Computerforensik
Datennetz
Datenhaltung
Bildanalyse
Sichtenkonzept
Polygon
Quick-Sort
Computeranimation
Datenhaltung
Physikalisches System
Quellcode
Logarithmus
Flächeninhalt
Offene Menge
Ordnung <Mathematik>
Computerforensik
Puffer <Netzplantechnik>
Knotenmenge
Green-Funktion
Rechter Winkel
Gemeinsamer Speicher
Meter
Kantenfärbung
Polygon
Computeranimation
Konfiguration <Informatik>
Tabelle <Informatik>
Divergenz <Vektoranalysis>
Softwaretest
Puffer <Netzplantechnik>
Pufferspeicher
Spannweite <Stochastik>
Datenhaltung
Räumliche Anordnung
Räumliche Anordnung
Polygon
Computeranimation
Arithmetisches Mittel
Puffer <Netzplantechnik>
Lineares Funktional
Knotenmenge
Punkt
Quadratzahl
Validität
Computer
Ordnung <Mathematik>
Polygon
Quick-Sort
Metropolitan area network
Lineares Funktional
Unterring
Punkt
Randwert
Datenhaltung
Validität
Einfache Genauigkeit
Computerunterstütztes Verfahren
Polygon
Rechenbuch
Computeranimation
Puffer <Netzplantechnik>
Randwert
Unterring
Menge
Ablöseblase
Meter
URL
Gerade
Polygon
Web Site
Multiplikation
Gruppenkeim
Abfrage
Räumliche Anordnung
Ordnung <Mathematik>
Kombinatorische Gruppentheorie
Polygon
Räumliche Anordnung
Computeranimation
Leistung <Physik>
Zentrische Streckung
Lineares Funktional
Parametersystem
Sichtenkonzept
Indexberechnung
Abfrage
Abgeschlossene Menge
Sichtenkonzept
Computeranimation
Dichte <Physik>
Mapping <Computergraphik>
Rechter Winkel
Server
Räumliche Anordnung
Gerade
Tabelle <Informatik>
Tabelle <Informatik>
Zentrische Streckung
Subtraktion
Wald <Graphentheorie>
Rootkit
Gemeinsamer Speicher
Datenhaltung
Nebenbedingung
Indexberechnung
Routing
Biprodukt
Sichtenkonzept
Räumliche Anordnung
Computeranimation
Metadaten
Datensatz
Algorithmus
Ein-Ausgabe
Räumliche Anordnung
A-posteriori-Wahrscheinlichkeit
Datenstruktur
Fehlermeldung
Tabelle <Informatik>
Subtraktion
Gewicht <Mathematik>
Punkt
Mengentheoretische Topologie
Selbstrepräsentation
Kombinatorische Gruppentheorie
Räumliche Anordnung
Polygon
Computeranimation
Datenhaltung
Eins
Netzwerktopologie
Quellcode
Knotenmenge
Einheit <Mathematik>
Operations Research
Rootkit
Zeiger <Informatik>
Regulator <Mathematik>
Addition
Nichtlinearer Operator
Kraftfahrzeugmechatroniker
Fehlermeldung
Datenhaltung
Validität
Boolesche Algebra
Quellcode
Physikalisches System
Biprodukt
Dateiformat
Mapping <Computergraphik>
Quadratzahl
Menge
Offene Menge
Mereologie
Dateiformat
Verkehrsinformation
Bruchrechnung
Subtraktion
Wald <Graphentheorie>
Biprodukt
Computeranimation
Internetworking
Data Mining
Mapping <Computergraphik>
Benutzerbeteiligung
Flächeninhalt
Verkehrsinformation
Bildgebendes Verfahren
Einflussgröße
Expertensystem
Sichtenkonzept
Mathematisierung
Sprachsynthese
Quellcode
Programmfehler
Data Mining
Mapping <Computergraphik>
Polstelle
TUNIS <Programm>
Minimalgrad
Twitter <Softwareplattform>
Flächeninhalt
Code
Ablöseblase
Programmierumgebung
Verkehrsinformation
Aggregatzustand
Leistung <Physik>
ATM
Fehlermeldung
Subtraktion
Web Site
Computerforensik
Prozess <Physik>
Datenhaltung
Versionsverwaltung
Zahlenbereich
Biprodukt
Computeranimation
Kombinatorische Gruppentheorie
Arithmetisches Mittel
Arbeitsplatzcomputer
Dezimalsystem
Server
Expertensystem
Tabelle <Informatik>

Metadaten

Formale Metadaten

Titel Crazy data: Using PostGIS to fix errors and handle difficult datasets
Serientitel FOSS4G 2014 Portland
Autor Miranda, Daniel
Lizenz CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/31600
Herausgeber FOSS4G, Open Source Geospatial Foundation (OSGeo)
Erscheinungsjahr 2014
Sprache Englisch
Produzent FOSS4G
Open Source Geospatial Foundation (OSGeo)
Produktionsjahr 2014
Produktionsort Portland, Oregon, United States of America

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Inteligeo is a system that stores a lot of information used by the Brazilian Federal Police Forensics to fight crime, initially in the environmental arena with a later expansion to other types of crime. During the construction of the database a lot of problems appeared for which PostGIS was the key to the solution.This presentation describes problems encountered by the team while loading 850+ shapefiles into the database, linking with external databases and building 950+ views of the data.Although the content of the recipes is very technical, the general concepts will be explained in an accessible language and correlated to real world cases.Topics:*Definition of crazy data in our context*Quick recipes- Spike removal- Invalid geometry detection and fixing- Filling holes- Raster image footprints- Hammering data into correct topologies- Speeding data visualization with ST Simplify and PGSQL 9.3's materialized views- Rough georeferencing using an auxiliary table- Creating constraints*How is crazy data generated and our experience in handling each case- Large datasets- Lack of validation- Reprojection- Geometric operations- Topological errors- Imprecise definitions- Legacy databases- Bad georeferencingWe will also discuss why is handling crazy data important for the Brazilian Federal Police, our efforts in cleaning up data at the source and the implications of geographical data in general for fighting crime.
Schlagwörter PostGIS
PostgreSQL
invalid geometries
spike removal
simplification
topology

Ähnliche Filme

Loading...
Feedback