Taming Rich GML With stETL, A Lightweight Python Framework For Geospatial ETL
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 95 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/15578 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Nottingham |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSS4G Nottingham 201377 / 95
17
25
29
31
32
34
48
50
56
58
68
69
70
82
89
91
00:00
Software frameworkValue-added networkIndependence (probability theory)Open sourceLocal ringGeometrySoftware frameworkResultantIndependence (probability theory)Open setVideo gameMultiplication signBitOpen sourceGeometryLocal ringComputer animation
00:47
Level (video gaming)IdentifiabilityMUDBuildingAddress spaceMaxima and minimaUniform resource nameEmulationMeasurementScripting languagePointer (computer programming)Element (mathematics)Metropolitan area networkMultitier architecture3 (number)CAN busInterior (topology)SineMaizeImage warpingTransformation (genetics)Model theoryComputer fileObject (grammar)MereologyInstance (computer science)Element (mathematics)Address spaceModel theoryLevel (video gaming)NumberComputer fileSoftware developerComputer programmingTransformation (genetics)Computer configurationPolygon meshContext awarenessNumbering schemeCartesian coordinate systemWindowObject (grammar)DatabaseMassCASE <Informatik>Flow separationProjective planeMultiplicationEvent horizonSpacetimeTerm (mathematics)Endliche ModelltheorieBitForm (programming)QuicksortSystem callWeb-DesignerComputer animation
04:39
GeometrySoftware engineeringLevel (video gaming)Inclusion mapGraphical user interfaceNichtlineares GleichungssystemCurve fittingStreaming mediaMereologyMatching (graph theory)Level (video gaming)DatabaseDegree (graph theory)Scripting languageComputer fileProjective planeFlow separationPlanningQuicksortBookmark (World Wide Web)Numbering schemeProcess (computing)Context awarenessMixed realityView (database)Transformation (genetics)Shape (magazine)ScalabilityRow (database)Observational studyGastropod shellScaling (geometry)Barrelled spaceInstance (computer science)Library (computing)Graphical user interfaceMultiplicationElectric generatorBitComputer animation
08:38
State of matterContext awarenessMappingBuildingLevel (video gaming)Address space
09:01
Process (computing)ChainoutputSinguläres IntegralModel theoryDegree (graph theory)Shape (magazine)Digital filterFunction (mathematics)SupremumSummierbarkeitDedekind cutCAN busCuboidComputer fontAbstract syntax treeUniform resource namePoint (geometry)Mountain passStiff equationScripting languageSoftware configuration managementPort scannerTwin primeGrand Unified TheoryCountingTelecommunicationFermat's Last TheoremInterior (topology)Scalable Coherent InterfaceNewton's law of universal gravitationPrinciple of maximum entropyMaxima and minimaMetropolitan area networkOnline chatEmpennageLoop (music)Pointer (computer programming)Meta elementMoving averageNumbering schemeEndliche ModelltheorieGroup actionState of matterPRINCE2Degree (graph theory)Transformation (genetics)QuicksortHeat transferStreaming mediaCASE <Informatik>Infinite conjugacy class propertyFigurate numberMathematical analysisBitScripting languageRight angleLatent heatDimensional analysisField (computer science)Connectivity (graph theory)Computer fileBuildingGeometryFunction (mathematics)Instance (computer science)Fiber bundleoutputModule (mathematics)Filter <Stochastik>RandomizationAnalytic continuationComputer animation
13:17
Function (mathematics)Process (computing)Singuläres IntegralShape (magazine)outputChainSocial classDigital filterScripting languageComputer fileComputer fileSocial classIdentifiabilityChainQuicksortCASE <Informatik>Point (geometry)Sheaf (mathematics)Connectivity (graph theory)Function (mathematics)Parameter (computer programming)outputExecution unitTransformation (genetics)Configuration spaceTranslation (relic)INTEGRALComputer programmingLattice (order)Wave packetMathematicsAxiom of choiceComputer animation
14:33
Shape (magazine)outputDigital filterFunction (mathematics)Process (computing)ChainSinguläres IntegralComputer fileSocial classScripting languageControl flow graphValue-added networkConditional-access moduleInstallation artData Encryption StandardKeyboard shortcutLibrary (computing)System callComponent-based software engineeringLine (geometry)Streaming mediaConstructor (object-oriented programming)DeterminantSheaf (mathematics)ParsingInformationNetwork topologyClefLogarithmEuclidean vectorString (computer science)Data structureDegree (graph theory)DisintegrationWeb servicePersonal digital assistantParameter (computer programming)MereologyFile formatDatabasePasswordElement (mathematics)Substitute goodGeneric programmingTable (information)Drop (liquid)outputBlock (periodic table)Connectivity (graph theory)Computer fileScripting languageInstallation artNetwork topologyStokes' theoremMathematicsElement (mathematics)Streaming mediaDegree (graph theory)PredictabilityCompilerINTEGRALLatent heatMachine visionGroup actionSubject indexingElectronic program guideSemiconductor memoryWindowData storage deviceCellular automatonLibrary (computing)MultiplicationConfiguration spaceTransformation (genetics)Standard deviationDatabaseResultantWritingPoint (geometry)DataflowElectric generatorQuicksortInformationBitRevision controlCASE <Informatik>Social classSpacetimeComputer configurationEndliche ModelltheorieInstance (computer science)Flow separationFile formatParsingMereologyFunction (mathematics)Reading (process)Extension (kinesiology)TopostheorieCodeComputer animationXML
19:44
Programmable read-only memoryVolumenvisualisierungMathematical singularityProcess (computing)View (database)Mathematical analysisDatabaseWordDegree (graph theory)Web serviceCASE <Informatik>Address spaceFeedbackComa BerenicesMatching (graph theory)FeedbackComputer configurationRevision controlWindowPortable communications deviceDirectory serviceWave packetFlow separationLine (geometry)Module (mathematics)Insertion lossFunction (mathematics)Streaming mediaChainTransformation (genetics)Point (geometry)Element (mathematics)Validity (statistics)Instance (computer science)Software developerBitDegree (graph theory)CodeContent (media)Template (C++)Data managementExtreme programmingTranslation (relic)Model theoryResultantInternational Date LineLibrary (computing)Workstation <Musikinstrument>WhiteboardScalabilityNumbering schemeSoftware testingSelf-organizationInternetworkingConnectivity (graph theory)outputMultiplication signArithmetic meanScripting language1 (number)Infinite conjugacy class propertyComputer animationXML
Transcript: English(auto-generated)
00:00
Oh yeah, taming rich Gmail, I thought I would be stealing from the rich Gmail and give it to the poor. The results of it could be an interpretation. Okay, but I'll be telling you something about Staddle, which is a lightweight Python framework for geospatial ETL. A little bit about myself, I'm an independent, open-source geospatial professional, and
00:23
in daily life I'm also the secretary of the OSGEO local chapter in the Netherlands, and a member of the Dutch Open Geo Group, a corporation of independent professionals in providing open-source geospatial support, and this is some of the things I like to do
00:44
in my spare time playing with mobile and GPS. But of course you always start any project because you want to solve a problem. In this case, I guess we have a problem, and the problem is the rich Gmail problem, and
01:06
we say rich Gmail, and it's a term I coined together with Markus Schneider, the lead developer of Degree, because people always talked about complex Gmail, and that sounded a bit negative, rich Gmail. So probably you have guessed what a rich Gmail is about.
01:23
Rich Gmail, complex mess, you could say, so think of application schemas, and of course they are designed very neatly in tools like Enterprise Architect, and then with the push of a button some schema is generated.
01:42
So probably you are aware of several of the schemas, and mostly they deal, for instance, with Inspire. You probably know the annex schemas, but also many of the Dutch or the national data sets in several countries use application schemas, if you're lucky, or some other form
02:03
of complex XML. So in the Netherlands we have national data sets, in Germany there's national data sets, and I just learned that the UK OS master map is also a form of XML, Gmail, and apparently
02:20
more complex than I thought. So to give an impression, what I've talked about, this is Dutch addresses, and this isn't even an application schema, this is what I would call semi-GML, it's XML with some gml namespaces, and lots of overheads, and what you see also is a sort of arbitrary
02:42
XML, like you could have multiple elements of the same elements, nested elements, implicit x-links, and if you look at Inspire for instance, this is part of Inspire, I won't
03:02
explain every element here, but this is the street name only of an Inspire address. So the Inspire address model is one of the more complex models, so I mean an address would be like a street, a number, and a place, but somewhere over here is actually the street
03:22
name, and the rest is all overhead, I could say, nah, it's part of the model. But we want to do something useful with this, let's say make a map, make a geocoder of the addresses, so we have to deal with complex model transformations, and not only
03:47
are the models complex, but there are huge files, like we talk about, gigabytes of files, gml files, so this is part of the Dutch address, when you download it you get all these XML files, that means there are millions of objects, and maybe ten millions
04:09
of elements, so to transform this, to do something useful, like putting it in a database and making a map, we need spatial ETL, so what are the options, how can we do this,
04:29
one approach is of course to write a program for each dataset and try to do that, in some cases I've seen that working, maybe it works, but of course if we look at the open
04:42
source geospatial world, we have several high level tools, high level I mean tools for the GUI where you can sort of set out the transformation, so geocattle may be known to some of you, talent geospatial, and this week I learned also about hail,
05:01
I knew the project from a couple of years ago and it was a little bit shaky, but it seems to be very much improved, so if you're sort of on a search, also try these, I've also tried these, but I'm a sort of old Unix command line hacker, and I like to stay close to the iron,
05:24
and then these are some of my favorite tools, so let's say if you have to transform a shape to post GIS, I would use let's say OGR to OGR, maybe a shape to PGSQL, but each of
05:41
these tools in XSLT, some are not familiar with XSL, so I won't have to explain, so that's to transform XML to another XML schema or anything else in post GIS, but the problem is each of these tools is very powerful but cannot do the whole thing, if you have random XML you
06:01
cannot just use OGR to OGR, I mean you have to do, that's why I said that you need multiple transformation steps, so and this also came out of some years of dealing with this, so the question is how to combine these individual very powerful tools,
06:22
actually Frank Warmadon will be somewhere in the next room just now talking, so and this came out of earlier research in some of the Inspire projects I did in SDN Eurogeographics context, and several people are even here in the room, like Frank
06:43
Anser, so what we did there was this multi-step approach, let's say we took cadastral data that was exported into a shapefile or mapinfo, we used OGR to OGR to produce a simple feature gml file, and a simple feature gml file could be translated via XSL,
07:02
and then we could generate Inspire NX1 gml, and then we use FSLoad which is part of the degree toolset to load it into let's say an Inspire database, but that was sort of ad hoc and a little bit of scripting, it was a bit hacky and it didn't scale up, so from that I thought
07:26
how to combine these tools and the answer is basically add Python to the equation, I done a lot with shell switching but then I said well Python is ideal, and Python makes a lot of sense in the geospatial world because it integrates with all of the existing libraries that are
07:45
there, so this is really what Stedl is about, in the sense it combines the basic tools and the abbreviation is, so now it's written like this, I think in the abstract
08:03
it's still with capitals, but it's about simple streaming spatial and speedy ETL, that's what what it tries to stand for, so it's basically from barrels and buckets of gml, for instance loading into PostGIS and then using QGIS to make beautiful maps, because I should show
08:24
a map, it's a new spatial conference or a geocoder, but Stedl is not just about loading gml into PostGIS, that's one of the scenarios, so I should show a map here, this guy is
08:41
amazing, he's also here in the map contest, Ja Willem van Ahls, so he uses some of that tooling to produce topographic maps of cities in the Netherlands, combining topographic and address data, building data, and with QGIS of course. So what are the Stedl concepts?
09:05
Actually it's quite simple, if you have multiple transformation steps, you go from one source to a destination, a target, so it's set up like, well you need some input and then several filters to process the data, but this is still quite abstract, so for instance an input could be
09:28
a gml file and the output PostGIS and then it would go through several, one or more filters and something to produce output, so this is one sort of trivial example,
09:41
so you could have some kind of gml reader module and then it would send its output to an OGR to OGR output module and it would output to PostGIS, so let's take this Inspire model transform
10:00
I showed earlier. For instance, data could also be already in a PostGIS database, so OGR to OGR is just a model which command line, which is integrated, which takes out the data, it produces simple features as a continuous
10:24
XML stream and then XSLT would be used to create complex features, so that would produce for instance gml files, so it's not just about reading stuff into PostGIS. And I will get more specific, it's still a little bit abstract, so instead of, and that's the,
10:46
it's a little bit like like Lego, you could just connect anything to anything as long as of course as the inputs and the outputs are compatible, so this writer to a gml file could then be, for instance, be replaced by a degree writer, which is sort of specific module which
11:04
writes into, in this case, a degree blob store, or there's also an output writer for WFST to publish directly to degree or to geo server, I just learned. So how does this work? Input, filters, output.
11:25
Let's take an example step by step. So we have some random XML file here, we apply an XSLT filter and we use OGR to OGR output to produce a shapefile. So sometimes you get this kind of random XML, so it's not, let's say, a feature type,
11:46
it's just XML, it has some names and some coordinates, and so you couldn't run OGR to OGR, maybe with some very clever command line filling, but so you need some way to convert that.
12:01
We have an XML input module, or I should say component instead, and then we produce an XSLT filter to create, to transform this to simple feature gml. I know there's some criticism on XSLT, but it's very, very powerful when you have to
12:25
transform one, basically three, XML schema to another schema. So this simple XSLT script will just take the points out of it and produce
12:42
an OGR feature collection, basically a simple feature collection. We find the same places, Amsterdam, but then it's through gml, so this into this. That basically comes out of the XSLT filter. Once we have simple featured gml,
13:08
we can apply OGR to OGR, and then you could produce basically any output, and in our case it's a shape. So how is this all glued together? Steadle is based on configuration files, so you don't have to program basically nothing,
13:25
you only have to configure the transformation, and the transformation is, so this whole chain is specified in a configuration file, and it's a simple file format,
13:41
any file, you can find it in Windows, but also in Python it's used a lot. So basically the whole chain, there's a special section called ETL, and you can have multiple chains, in this case one chain, it's an input XML file,
14:00
going through transformer XSLT, and then an output OGR shape, and these are sort of identifiers which point to sections further up in the file. So input XML file points to this section, and as you can see the specific component processing the unit is identified by a class name,
14:25
and then there are specific parameters for that class, and the class is really a component. So do I have a block like XML input is a component, and it's specified here as input XML file, and you see the file path pointing to the specific input file.
14:46
Later on you see how we can parameterize that as well, and the transformer XSLT is another component here, and it needs a script, and that script is in an XSL file, and for OGR to OGR you can just apply your regular OGR to OGR command,
15:03
which is nice because the syntax of this is all known, but basically you glue together these different tools in this simple configuration file. So while configuration files could be more extensive of course, this is a very simple example, and to run this thing there's a command line tool called Staddle,
15:25
and then you specify the configuration with the minus C option, and then it will produce, and of course we use Cuckoo's design, the result as a shapefile. So Staddle is in Python, so it can be installed via the standard Python
15:47
space like Python package index, so that's sudo pip install staddle. It's not yet for Debian, or not yet other packages, and there's of course some dependencies on Linux. This is very trivial to install these dependencies. I know it's
16:05
somewhat harder on Windows. So I talked about speed as well. What also is part of Staddle is this whole streaming thing, because you cannot, let's say if you have a few hundred gig, a few hundred gig, a few hundred megabyte
16:24
XML file, you just cannot parse that in memory and then pass something like a document. So Staddle is based on streaming without intermediate storage, and also Staddle calls upon all the native libraries, the C libraries, so libxslt, libxml2,
16:44
those are standard libraries in Linux. So it's speed optimized, going native, and for each of these input filters, output components, there are several options now,
17:00
but you're also able to write your own filter. So this is a little bit of Python, maybe I should show some code, but if you want to add your own component here, let's say a filter, you can specify your class name in the configuration
17:20
file, and then there's a trivial example that just prints some standard output, and there's of course standard APIs, so your filter always needs to implement an invoke method, and then you get a packet, and the packet contains the data and the status. So okay, what is exchanged between these components? What Staddle doesn't do is make
17:49
its own internal feature model. It stays very close to the feature information that comes out of these different tools, and where necessary, the one format is
18:04
translated to another format. So for instance, an E3 doc is a Python version of an XML document, but a stream can split a very large document in an XML stream,
18:22
and then you can specify at which point and how many features you make a document. So you can split a huge XML file into multiple smaller documents, or you could use an E3 element array so you get individual features. And there's lots of several components to deal with,
18:45
degree integration, especially to write to degree, for instance blob store output, or FSLoader, that's a tool set of degree, or a very standard, just your WFST. So there's two sort of main cases where Staddle is applied. It's for
19:03
in SPAR transformation, so to generate harmonized data, and the other is to read national GML data sets, mostly into PostGIS. This is a more extensive example, for instance Topteen
19:22
NL, which is the national Dutch Topo data set, and this is a more extensive file, and also you multiple changes, for instance for initialization, setting up a database, you can also use Staddle, and all these parameters, they can be substituted on the command line with the Staddle command, so
19:41
it's not hard coded. And again, of course we should show maps, and also recently we did a BGT, maybe took a couple of hours, and then we could read the BGT into PostGIS, and this is more extensive. Probably I won't go into the details here, but
20:06
this actually has been used in PDOK, I should say this, but recently they tried to switch over to FME, and I say recently, that was one and a half year ago, and they're still struggling, so
20:23
the status, I mean it's not yet a full-fledged product, it's still in development of course, and that's why I'm also presenting here to show some of the results and get some feedback there, but you could install it already via PyPy, there's documentation, this would read the docs,
20:44
and yeah, several real-world transformations have been done, so is it solved? I can't say the definite answer, but I hope to have helped a little bit solving this problem,
21:01
so thank you very much. Questions or feedback? Stunned. Totally stunned. Well you said it's installable or tricky to do in the Windows environment, would you try?
21:26
And the point is not as much Staddle, but the supporting libraries, and we always find this problem when installing something with Python on Windows, but there's several options there, I think in the documentation I've pointed to portable GIS, I don't know, and that's actually
21:47
made by Joe Cook, who's around here, she's in the organization, and it's a USB stick with all the Windows versions of the basically OpenGL stack, and you can run that without installing that,
22:03
and that's very, very powerful, so you can copy that to a directory for instance, and then initialize that once, and so that's the first step, and then you already have GDAL, and GDAL Python bindings, and even Apache, I think even Postgres, PostGIS, and so you don't have to install
22:24
it, and it's called portable GIS, and of course there's several options, you could use OSGO for Windows, there's something called USB GIS as well, but
22:42
now portable GIS, we have some good experiences. Anyone spotted more than two animals?
23:04
Two animals. The train has all the elephants, and the python is the trainer. And I'll hear him. I'm just wondering if you have any SSLT from GML to WFST? From GML to? WFST?
23:22
SSLT you mean? I didn't see that. WFST is basically a container. Yeah, I know. What?
23:41
Okay. No, actually the WFST output module is two or three lines of code, of course it's Python, but it's basically a template, and in the other transformation steps you produce regular GML, just like you would send to a file, but then the last step, the WFST writer,
24:06
will take those GML features and put that in a template, and that's just basically a container for the WFST insert feature. It just does them all into one insert. One insert.
24:22
That's one of the things with Staddle, it's okay, you have a stream, but you can hack the stream, or hack the stream, partitionate the stream into manageable elements. So you could say, I do WFST for ten features.
24:40
But you could also set up a stream to, and that's a very powerful degree, of let's say a gigabyte of GML. So there's several options here, but it's another approach maybe as hail, and it's totally dedicated to streaming, and the output is basically independent of the input,
25:06
how the WFST module is unaware how the other modules have produced the GML. It gets a document now, remember, but a document could be arbitrary, large or small.
25:25
But maybe you can talk offline if I didn't understand the entire question. I think it's best if you go for coffee. Not right now, of course. Last question.
25:41
If you wanted to incorporate validation steps in your life, where would you suggest to do that? Usually there's actually an extra... Oh sorry, the question is, if you would like to do validation, where would you put that step into this chain of processing?
26:03
Actually there's an XML validator component, and it's usually placed after the XSLT step. So that produces full documents in the target schema, for instance.
26:20
The nice thing about the validator, it initializes once to get all the XSDs out from the internet, and then it validates each... But that's usually only done in testing, because it takes some performance to validate each and every. That's actually how I tested a lot, to see if it all worked.
26:45
Validation is very important. Okay, thank you. Thank you, yes.