Kart: Practical Data Versioning for rasters, vectors, tables, and point clouds
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 266 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/66453 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSS4G Prizren Kosovo 2023146 / 266
10
17
23
44
45
46
47
48
49
50
52
53
80
84
85
91
110
116
129
148
164
167
169
173
174
181
182
183
186
187
199
202
204
206
209
215
241
248
265
00:00
Data managementCoordinate systemComputing platformMusical ensembleProcess (computing)Decision theoryMobile appBuildingLecture/Conference
00:34
MappingEndliche ModelltheorieCartesian coordinate systemSoftwareMultiplication signMeeting/InterviewLecture/Conference
00:56
SoftwareCodeNeuroinformatikSoftware developerTask (computing)MultiplicationDifferent (Kate Ryan album)Projective planeBranch (computer science)Lecture/Conference
01:15
Group actionOpen sourceCodeLecture/ConferenceMeeting/Interview
01:35
Software developerQuicksortMeeting/Interview
01:58
Chain8 (number)Projective planeOpen sourcePurchasingChainExpected valueMultiplication signComputer-aided designCartesian coordinate systemWeb-DesignerData conversionData integrityFrictionWeb 2.0Lecture/ConferenceComputer animation
03:19
Open setDigital photographySoftware developerBuildingComputer-assisted translationDatabaseConnected spaceSoftwareBlock (periodic table)WindowDecision tree learningMereologyPhysical systemCoordinate systemSoftware maintenanceComputer animation
04:03
Software repositorySatelliteCuboidDemo (music)Gene clusterSoftwareMathematicsDecision tree learningComputer-assisted translationLecture/Conference
04:36
Repository (publishing)TouchscreenComputer-assisted translationMedical imagingGeometryPerspective (visual)Directory serviceMathematicsWeb browserDecision tree learningSelectivity (electronic)
06:04
Musical ensembleMathematicsDifferent (Kate Ryan album)Repository (publishing)Directory serviceComputer fileMultiplication signGeometryDecision tree learningDifferenz <Mathematik>Computer animation
07:20
Maxima and minimaDemo (music)Decision tree learningPlug-in (computing)BuildingComputer fileGeometryFile formatDatabaseCASE <Informatik>Block (periodic table)Computer animation
07:40
Table (information)Computer-aided designComputer fileVector spaceSet (mathematics)Lecture/Conference
07:58
File formatEndliche ModelltheorieoutputCoordinate systemNumbering schemeMultiplication signData typeComputer animation
08:22
Point cloudRevision controlMathematicsRepository (publishing)BitLecture/ConferenceComputer animation
08:46
Point cloudVirtualizationTesselationPerspective (visual)Set (mathematics)Computer fileComputer animation
09:06
Data storage devicePoint cloudRepository (publishing)Physical systemObject (grammar)Raster graphicsComputer animation
09:27
Repository (publishing)Decision tree learningObject (grammar)Point (geometry)InformationMusical ensembleData storage deviceCoordinate systemTesselationStatisticsLecture/Conference
10:11
GeometryDirectory serviceFile format
10:26
Different (Kate Ryan album)File formatMathematicsPoint cloudRevision controlDecision tree learningBranch (computer science)Repository (publishing)Stack (abstract data type)TesselationRaster graphicsComputer animation
11:03
Plug-in (computing)Repository (publishing)Branch (computer science)QuicksortDecision tree learningComputer animation
11:33
AreaSet (mathematics)CloningComputer animation
11:53
Latent heatView (database)MereologyAreaFilter <Stochastik>Set (mathematics)InformationRepository (publishing)MathematicsMultiplication signBasis <Mathematik>Cloning
12:47
Disk read-and-write headAreaComputer filePoint cloudSet (mathematics)Directory serviceTesselationFigurate numberMereologyPoint (geometry)Repository (publishing)Different (Kate Ryan album)Film editingInformationDemo (music)CuboidCASE <Informatik>Coordinate systemBitComputer animation
16:27
Decision tree learningDifferent (Kate Ryan album)Projective planeSeries (mathematics)Perspective (visual)SubsetView (database)CausalityComputer reservations systemComputer animation
18:09
Different (Kate Ryan album)MultiplicationCompact spaceMathematical analysisRepository (publishing)Server (computing)Video gameProjective planeEndliche ModelltheorieSet (mathematics)Software bugRaster graphicsTesselationReference dataPoint cloudComputer-assisted translationOnline helpRemote procedure callTable (information)AreaDecision tree learningComputer animation
20:35
Computer animation
Transcript: English(auto-generated)
00:08
All right, I am Rob. I'm from Coordinates. We're a geospatial data management platform. And we're trying to crack GIS data out of vendor silos.
00:20
So you can host, manage, share, publish, access, and build apps on top of data. And we really make it easy for professional Mac bringers to find and access geospatial data and get on with their jobs. And that's making decisions, building maps, creating models, developing applications. And on the flip side, we help facilitate publishing data.
00:44
What I want to talk about today, though, is versioning. So how many people kind of consider themselves a developer or spend some time developing software? So quite a lot. And besides having to try and make computers
01:00
do what you want all day, you're pretty lucky. You can actively work across multiple tasks and projects and switch between them. You can have different development and release branches. You can do things like code reviews and pull requests so that your colleagues can collaborate across it.
01:23
And this is what makes open source work as well, right? That we can collaborate efficiently with people who aren't in larger groups and who aren't necessarily all together. And a developer can take for granted that they can always see who changed what and when
01:42
and if they're really lucky, why. And developers use these sorts of things every day. But in respect to data and geospatial data, you just can't do this. And we talk to our users, and they say we want this. And so that's what we're trying to do. And we have some more opportunities as well, right?
02:00
And so we talk about working between different ecosystems is really important in geospatial. So we have our open source ecosystem with projects we love, like PostGIS and QGIS. We have people and colleagues and customers and suppliers who work in the ESRI ecosystem or maybe the application, like the web developers,
02:22
and they don't really do geospatial data. And so in the engineering world, they work in CAD. All these different ecosystems are a little bit disjointed. And when we go between them, we often have to convert data, and our expectations are different. If you're a government agency publishing data,
02:41
you will do it in your national grid because that's what you do. But whoever's using the data might want it in something completely different. And every time we have to do these conversions, it adds friction to getting updates. We can also do things like data integrity and being able to verify that I have the same thing as you
03:03
is really important. And looking at what I've written here is like supply chains. And so I get data from somebody else and maybe other people get data from me and how do we see where this data has come from?
03:21
So what are we trying to do with Cart? So Cart is built on Git. We decided to focus our effort on data and geospatial and we can leverage other people who are in the software world who focus on the underlying building blocks that we're using as part of Git.
03:41
We wanna maintain compatibility, so Cart should be familiar to anyone who's a developer, but it won't necessarily be identical. We wanna make it easy to install, so we include all the batteries. And it works on Windows and Linux and Mac OS. Coordinate system handling works,
04:01
database connections work. We try and make it work out of the box. And we wanna make it for practical day-to-day use. So this isn't really a solution for people who have their own satellite clusters producing terabytes of data every few hours. They have software teams and they can develop tools specific to them.
04:22
This is for the rest of us. So I'm gonna do a very quick demo now. I'm gonna look at what a Cart repo actually is and then we're gonna have a look at making some changes. So what we've got here is QGIS.
04:41
And if we look up on the left-hand side, then we can see our data browser and we can see some layers that have been added. And from QGIS's perspective, a Cart repository is just a geo package or just a directory with some images in. But, my mouse cursor, okay.
05:04
But what we can do is make some changes to these things and then we can look at history. So, over here we've found a problem with our data. And what we're gonna do is select it. I'm gonna say, who added this weird island thing?
05:21
Select it over here. And then we're gonna delete it. See if the delete key works. I'm gonna hit it first. This is really hard on the giant screen. Wrong layer, of course.
05:42
There we go, yay. You can do it. Yeah, delete it. Okay, so we can save our changes in QGIS. And so we'll see what happens next.
06:01
From QGIS's perspective, this is just a layer, right? So, over here we can go to our repository. Let's have a look at what our repository is. And so I've cloned this from a workshop that we did earlier in the week. And so we've got a few files here. So, we've just got our geo package, dot g package.
06:21
We've got our terrain directory which has some diffs in. And we have a VRT file that CART automatically generates. And what we can do now is just do CART status. And CART can tell us that, da da da da da da,
06:44
one delete has happened and we can commit it. Okay, so we've made a change to our repository. And that's great. So, we can also see what the history is.
07:01
So, we can see that I've deleted a change. So, that was my most recent change. And then Hamish maybe made some changes a few days ago which might have been adding that. And we can go back through time, kind of see what changed. And we can view differences as well. So, I'm gonna go on and talk about our CART plugin,
07:23
our QGIS plugin a little bit later on. But that's a really easy introduction to some of the underlying building blocks. So, we try and build on top of existing file formats. So, in this case, what we're working with here is a geo package and some TIFF files. And if you're in a different ecosystem, if you work in the ISRI worlds,
07:40
it would just be a file geo database. Or if you work in a CAD world, not that we have it yet, but DWG files and stuff. So, I talked about vector and table datasets we support. And so, we have kind of zero to 100 gig, which is pretty big. And we try and follow a SQL model.
08:01
So, we have a schema, and you can change your schema over time, and that's okay. But you kind of have to assign data types and columns. And we pull in all this stuff. And we input from many OGR formats. We know about coordinate systems. And one of the cool things that we can do is re-import from a snapshot.
08:21
So, if somebody sends you files every few weeks or every few months, you can keep loading into the same repository and you build up a history of versions that you can then compare and see what's changed. We support point cloud data. And we build our point cloud data support on cloud optimized point cloud.
08:41
And I'll talk a little bit about that later on. And again, it's kind of zero to tier right sized datasets. And we do things like use the brand new support for creating virtual point cloud files. So that from QGIS's perspective, you can just open the dataset
09:01
and all the tiles are treated as one tile for styling and performance purposes. Now, support rasters as well. And that's built on top of cloud optimized geotiffs. And for both point clouds and rasters, we don't store the data itself in the repository.
09:23
It lives in like an object storage system, like S3 or something else. And the reason we do that is to enable what's stored in the repository itself is the information about the tiles. So where they are, what the coordinate system is,
09:40
how many bands they have, maybe some statistics. And then it allows cart to selectively alter stuff without having to make big duplicate copies. And one of the things that we're going to be able to do soon is to be able to point cart repositories at object stores that already exist.
10:03
So you don't have to copy your data into cart. So we have this concept of working copies. And that's where you work and edit your data. So before, when we were in QGIS, we saw that we had a geo package file,
10:20
we had a directory, and we have different working copy formats for different places. So you can put data from your cart repository straight into purchase, and we'll keep updating it as you switch around between branches and revisions. You can do the same thing with Microsoft SQL Server,
10:41
MySQL, I already talked about Isri. And we've started using the cloud-optimized formats for point clouds and rasters. And the idea is we can start serving tiles and stack and other things from the repositories. But being able to do any revision back
11:00
through the history of all the changes. We've got a cart QGIS plugin. And the QGIS plugin is the panel you can see on the right and that allows you to navigate the history of the repository. It allows you to make commits from within QGIS. You can roll back, you can switch branches.
11:23
So you can do this sort of stuff natively from within QGIS without having to use the command line. Something else we support is spatially filtered clones.
11:41
And what we're doing here is working only with your area of interest. And so if I'm a data publisher, I have a national data set, that's how I wanna publish my data, right? But if I'm a local user working in a specific city or town,
12:00
that's probably the only area that I'm interested in. And I shouldn't have to either work with a much larger data set that has lots of information that's not relevant to me. But at the same time, I shouldn't have to divorce myself from that data set. So what we try and do with spatially filtered clones and working with spatial filters
12:22
is to be able to stay part of the larger repository, including all its history, editing, updates. But what you're actually working with locally on a day-to-day basis is just a filtered view of that. And so when I make changes,
12:42
I'm pushing and pulling to the full data sets. But when I open it up in my software, in my PostGIS or in my QGIS, then I only see the area that's kind of relevant for me. And so this is obviously something that we've built
13:03
that isn't really relevant to Git or something else. And so this is what I mean about we can focus on features like this that are important to our users rather than kind of reinventing parts of Git that other people have already done.
13:20
And so I've got a small demo of that. We'll see if we can make this work. So what I'm gonna do here is clone a data set.
13:50
If you're a Git user, you kind of recognize what I'm doing. This is the point where you're like, I didn't realize I was gonna be holding a microphone,
14:00
so I'm like one finger typing a really long repository name. Now this is a USGS point cloud data set. I'm gonna hit the button and it's gonna clone down. And the first thing it does
14:20
is collect all the information about all the tiles in the point cloud data set. And then it kind of figures out what it needs. And so what it's gonna do is go away and figure out that it needs to transfer quite a lot. I think it's about 10 gig, 146 files. This is never gonna work. It's okay. We have spatial filtering to the rescue.
14:42
So I can specify a spatial filter when I'm pulling a clone down. And we can do it in different coordinate systems. We can do it as WKT. We can do it as vials.
15:00
So in this case, I've got some WKT in a file that defines a little box. And whoops, they're back up there. So I added a special filter
15:21
and it's doing exactly the same thing it did before. But instead of pulling down 146 files or whatever I decided, it's going to just grab two, maybe three. And this is gonna go a bit quicker.
15:41
And the WiFi's working, this is great. And so you can imagine for like a national or continental size data set, this is gonna make a huge difference. So let's go, go into here. So each data set for point clouds on REST is a directory.
16:02
And in this case, it's the Agua Blanca Fault in Mexico. And we can see that we have three COPC LAS files and we have a VPC. And so given that special filter, it's decided that that's the only thing that's relevant to the area I want.
16:22
And so cuts just pulled down the data it needs. And we can go into QGIS. And we will go into our folder. Way.
16:44
And we've got a, I'm gonna create a new project cause it's gonna be, and we can,
17:06
knows about CRS, we're gonna set the project CRS from that. We can create a 3D view.
17:20
Here we go. And as we zoom in, we can, QGIS will now quite happily load the tiles. Incrementally, we can do all our QGIS stuff and we're working with a smaller subset of the larger data set.
17:42
The really cool thing about VPC and the same with VRT is that, and really good work to Hobu and the Lucha guys for adding it, is that we can keep from the desktop perspective, we're working with one layer in QGIS and CART can update it in the background. If you change your filter, it will pull down some more data
18:00
or throw away some data. But from QGIS's perspective, we can treat it as one thing for styling and just get back here. So I guess what's coming up in the last year, we've been steadily trucking away. I said we added raster support. Point clouds were very, very new last year.
18:22
We finished off a bunch of work around documentation. We now have much better help on the command line. The tools is faster generally. We fixed a lot of bugs and we're kind of making steady releases with new capabilities and just general improvements.
18:40
So I talked about before, we want to reference data from existing S3 buckets without copying. So if you have a S3 bucket with lots of Cloud Geotiffson, what we want to be able to do soon is just point your repository at those tiles and you can treat it as a CART repository. We already set up multiple datasets in our repository.
19:02
So these are for basically different layers and different tables. And you can add in obviously rasters and point clouds as well and keeping them together as a project. But what we really want to get to is looking at interlinking datasets from projects
19:21
because often the data you're coming from is coming from different suppliers. The data you're getting is coming from different suppliers. And so we want to interlink it so that you can have a nice compact project repository with the layers you're interested in regardless of where they came from and to be able to pull in updates from other repositories really simply and easily.
19:43
We know how to do that and we just need to build it up. We want to be able to blend local and remote raster and point cloud datasets. So if you've got data that's, as I said, like a national raster layer, you might want to have some tiles locally
20:00
because it makes your day-to-day life a lot quicker and easier. You can run analysis or do models on them for the relevant areas. But maybe you want to be able to at least see the other datasets by pulling them directly from the cloud as well. And we want to be able to serve tiles and APIs like stack directly from repositories
20:21
and supporting all the history so that we can look at different tags, we can look at different branches and different commits in this history and treat it all sensibly via stack and tiles.