We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Kart: Practical Data Versioning for rasters, vectors, tables, and point clouds

00:00

Formal Metadata

Title
Kart: Practical Data Versioning for rasters, vectors, tables, and point clouds
Title of Series
Number of Parts
266
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Kart is addressing the lack of versioning tools in the geospatial field, providing an open and practical solution for managing datasets efficiently and improving collaboration. This tool offers features such as a QGIS plugin and supports various data types, including raster and point cloud datasets. By demonstrating Kart's capabilities, including versioning, spatial filtering, and data access techniques, users can understand how it streamlines dataset management, eliminates data duplication, and ensures compatibility across different formats and ecosystems. Ultimately, Kart enhances collaboration within teams and organizations, enabling easy tracking of changes and optimizing time utilization in geospatial projects.
Data managementCoordinate systemComputing platformMusical ensembleProcess (computing)Decision theoryMobile appBuildingLecture/Conference
MappingEndliche ModelltheorieCartesian coordinate systemSoftwareMultiplication signMeeting/InterviewLecture/Conference
SoftwareCodeNeuroinformatikSoftware developerTask (computing)MultiplicationDifferent (Kate Ryan album)Projective planeBranch (computer science)Lecture/Conference
Group actionOpen sourceCodeLecture/ConferenceMeeting/Interview
Software developerQuicksortMeeting/Interview
Chain8 (number)Projective planeOpen sourcePurchasingChainExpected valueMultiplication signComputer-aided designCartesian coordinate systemWeb-DesignerData conversionData integrityFrictionWeb 2.0Lecture/ConferenceComputer animation
Open setDigital photographySoftware developerBuildingComputer-assisted translationDatabaseConnected spaceSoftwareBlock (periodic table)WindowDecision tree learningMereologyPhysical systemCoordinate systemSoftware maintenanceComputer animation
Software repositorySatelliteCuboidDemo (music)Gene clusterSoftwareMathematicsDecision tree learningComputer-assisted translationLecture/Conference
Repository (publishing)TouchscreenComputer-assisted translationMedical imagingGeometryPerspective (visual)Directory serviceMathematicsWeb browserDecision tree learningSelectivity (electronic)
Musical ensembleMathematicsDifferent (Kate Ryan album)Repository (publishing)Directory serviceComputer fileMultiplication signGeometryDecision tree learningDifferenz <Mathematik>Computer animation
Maxima and minimaDemo (music)Decision tree learningPlug-in (computing)BuildingComputer fileGeometryFile formatDatabaseCASE <Informatik>Block (periodic table)Computer animation
Table (information)Computer-aided designComputer fileVector spaceSet (mathematics)Lecture/Conference
File formatEndliche ModelltheorieoutputCoordinate systemNumbering schemeMultiplication signData typeComputer animation
Point cloudRevision controlMathematicsRepository (publishing)BitLecture/ConferenceComputer animation
Point cloudVirtualizationTesselationPerspective (visual)Set (mathematics)Computer fileComputer animation
Data storage devicePoint cloudRepository (publishing)Physical systemObject (grammar)Raster graphicsComputer animation
Repository (publishing)Decision tree learningObject (grammar)Point (geometry)InformationMusical ensembleData storage deviceCoordinate systemTesselationStatisticsLecture/Conference
GeometryDirectory serviceFile format
Different (Kate Ryan album)File formatMathematicsPoint cloudRevision controlDecision tree learningBranch (computer science)Repository (publishing)Stack (abstract data type)TesselationRaster graphicsComputer animation
Plug-in (computing)Repository (publishing)Branch (computer science)QuicksortDecision tree learningComputer animation
AreaSet (mathematics)CloningComputer animation
Latent heatView (database)MereologyAreaFilter <Stochastik>Set (mathematics)InformationRepository (publishing)MathematicsMultiplication signBasis <Mathematik>Cloning
Disk read-and-write headAreaComputer filePoint cloudSet (mathematics)Directory serviceTesselationFigurate numberMereologyPoint (geometry)Repository (publishing)Different (Kate Ryan album)Film editingInformationDemo (music)CuboidCASE <Informatik>Coordinate systemBitComputer animation
Decision tree learningDifferent (Kate Ryan album)Projective planeSeries (mathematics)Perspective (visual)SubsetView (database)CausalityComputer reservations systemComputer animation
Different (Kate Ryan album)MultiplicationCompact spaceMathematical analysisRepository (publishing)Server (computing)Video gameProjective planeEndliche ModelltheorieSet (mathematics)Software bugRaster graphicsTesselationReference dataPoint cloudComputer-assisted translationOnline helpRemote procedure callTable (information)AreaDecision tree learningComputer animation
Computer animation
Transcript: English(auto-generated)
All right, I am Rob. I'm from Coordinates. We're a geospatial data management platform. And we're trying to crack GIS data out of vendor silos.
So you can host, manage, share, publish, access, and build apps on top of data. And we really make it easy for professional Mac bringers to find and access geospatial data and get on with their jobs. And that's making decisions, building maps, creating models, developing applications. And on the flip side, we help facilitate publishing data.
What I want to talk about today, though, is versioning. So how many people kind of consider themselves a developer or spend some time developing software? So quite a lot. And besides having to try and make computers
do what you want all day, you're pretty lucky. You can actively work across multiple tasks and projects and switch between them. You can have different development and release branches. You can do things like code reviews and pull requests so that your colleagues can collaborate across it.
And this is what makes open source work as well, right? That we can collaborate efficiently with people who aren't in larger groups and who aren't necessarily all together. And a developer can take for granted that they can always see who changed what and when
and if they're really lucky, why. And developers use these sorts of things every day. But in respect to data and geospatial data, you just can't do this. And we talk to our users, and they say we want this. And so that's what we're trying to do. And we have some more opportunities as well, right?
And so we talk about working between different ecosystems is really important in geospatial. So we have our open source ecosystem with projects we love, like PostGIS and QGIS. We have people and colleagues and customers and suppliers who work in the ESRI ecosystem or maybe the application, like the web developers,
and they don't really do geospatial data. And so in the engineering world, they work in CAD. All these different ecosystems are a little bit disjointed. And when we go between them, we often have to convert data, and our expectations are different. If you're a government agency publishing data,
you will do it in your national grid because that's what you do. But whoever's using the data might want it in something completely different. And every time we have to do these conversions, it adds friction to getting updates. We can also do things like data integrity and being able to verify that I have the same thing as you
is really important. And looking at what I've written here is like supply chains. And so I get data from somebody else and maybe other people get data from me and how do we see where this data has come from?
So what are we trying to do with Cart? So Cart is built on Git. We decided to focus our effort on data and geospatial and we can leverage other people who are in the software world who focus on the underlying building blocks that we're using as part of Git.
We wanna maintain compatibility, so Cart should be familiar to anyone who's a developer, but it won't necessarily be identical. We wanna make it easy to install, so we include all the batteries. And it works on Windows and Linux and Mac OS. Coordinate system handling works,
database connections work. We try and make it work out of the box. And we wanna make it for practical day-to-day use. So this isn't really a solution for people who have their own satellite clusters producing terabytes of data every few hours. They have software teams and they can develop tools specific to them.
This is for the rest of us. So I'm gonna do a very quick demo now. I'm gonna look at what a Cart repo actually is and then we're gonna have a look at making some changes. So what we've got here is QGIS.
And if we look up on the left-hand side, then we can see our data browser and we can see some layers that have been added. And from QGIS's perspective, a Cart repository is just a geo package or just a directory with some images in. But, my mouse cursor, okay.
But what we can do is make some changes to these things and then we can look at history. So, over here we've found a problem with our data. And what we're gonna do is select it. I'm gonna say, who added this weird island thing?
Select it over here. And then we're gonna delete it. See if the delete key works. I'm gonna hit it first. This is really hard on the giant screen. Wrong layer, of course.
There we go, yay. You can do it. Yeah, delete it. Okay, so we can save our changes in QGIS. And so we'll see what happens next.
From QGIS's perspective, this is just a layer, right? So, over here we can go to our repository. Let's have a look at what our repository is. And so I've cloned this from a workshop that we did earlier in the week. And so we've got a few files here. So, we've just got our geo package, dot g package.
We've got our terrain directory which has some diffs in. And we have a VRT file that CART automatically generates. And what we can do now is just do CART status. And CART can tell us that, da da da da da da,
one delete has happened and we can commit it. Okay, so we've made a change to our repository. And that's great. So, we can also see what the history is.
So, we can see that I've deleted a change. So, that was my most recent change. And then Hamish maybe made some changes a few days ago which might have been adding that. And we can go back through time, kind of see what changed. And we can view differences as well. So, I'm gonna go on and talk about our CART plugin,
our QGIS plugin a little bit later on. But that's a really easy introduction to some of the underlying building blocks. So, we try and build on top of existing file formats. So, in this case, what we're working with here is a geo package and some TIFF files. And if you're in a different ecosystem, if you work in the ISRI worlds,
it would just be a file geo database. Or if you work in a CAD world, not that we have it yet, but DWG files and stuff. So, I talked about vector and table datasets we support. And so, we have kind of zero to 100 gig, which is pretty big. And we try and follow a SQL model.
So, we have a schema, and you can change your schema over time, and that's okay. But you kind of have to assign data types and columns. And we pull in all this stuff. And we input from many OGR formats. We know about coordinate systems. And one of the cool things that we can do is re-import from a snapshot.
So, if somebody sends you files every few weeks or every few months, you can keep loading into the same repository and you build up a history of versions that you can then compare and see what's changed. We support point cloud data. And we build our point cloud data support on cloud optimized point cloud.
And I'll talk a little bit about that later on. And again, it's kind of zero to tier right sized datasets. And we do things like use the brand new support for creating virtual point cloud files. So that from QGIS's perspective, you can just open the dataset
and all the tiles are treated as one tile for styling and performance purposes. Now, support rasters as well. And that's built on top of cloud optimized geotiffs. And for both point clouds and rasters, we don't store the data itself in the repository.
It lives in like an object storage system, like S3 or something else. And the reason we do that is to enable what's stored in the repository itself is the information about the tiles. So where they are, what the coordinate system is,
how many bands they have, maybe some statistics. And then it allows cart to selectively alter stuff without having to make big duplicate copies. And one of the things that we're going to be able to do soon is to be able to point cart repositories at object stores that already exist.
So you don't have to copy your data into cart. So we have this concept of working copies. And that's where you work and edit your data. So before, when we were in QGIS, we saw that we had a geo package file,
we had a directory, and we have different working copy formats for different places. So you can put data from your cart repository straight into purchase, and we'll keep updating it as you switch around between branches and revisions. You can do the same thing with Microsoft SQL Server,
MySQL, I already talked about Isri. And we've started using the cloud-optimized formats for point clouds and rasters. And the idea is we can start serving tiles and stack and other things from the repositories. But being able to do any revision back
through the history of all the changes. We've got a cart QGIS plugin. And the QGIS plugin is the panel you can see on the right and that allows you to navigate the history of the repository. It allows you to make commits from within QGIS. You can roll back, you can switch branches.
So you can do this sort of stuff natively from within QGIS without having to use the command line. Something else we support is spatially filtered clones.
And what we're doing here is working only with your area of interest. And so if I'm a data publisher, I have a national data set, that's how I wanna publish my data, right? But if I'm a local user working in a specific city or town,
that's probably the only area that I'm interested in. And I shouldn't have to either work with a much larger data set that has lots of information that's not relevant to me. But at the same time, I shouldn't have to divorce myself from that data set. So what we try and do with spatially filtered clones and working with spatial filters
is to be able to stay part of the larger repository, including all its history, editing, updates. But what you're actually working with locally on a day-to-day basis is just a filtered view of that. And so when I make changes,
I'm pushing and pulling to the full data sets. But when I open it up in my software, in my PostGIS or in my QGIS, then I only see the area that's kind of relevant for me. And so this is obviously something that we've built
that isn't really relevant to Git or something else. And so this is what I mean about we can focus on features like this that are important to our users rather than kind of reinventing parts of Git that other people have already done.
And so I've got a small demo of that. We'll see if we can make this work. So what I'm gonna do here is clone a data set.
If you're a Git user, you kind of recognize what I'm doing. This is the point where you're like, I didn't realize I was gonna be holding a microphone,
so I'm like one finger typing a really long repository name. Now this is a USGS point cloud data set. I'm gonna hit the button and it's gonna clone down. And the first thing it does
is collect all the information about all the tiles in the point cloud data set. And then it kind of figures out what it needs. And so what it's gonna do is go away and figure out that it needs to transfer quite a lot. I think it's about 10 gig, 146 files. This is never gonna work. It's okay. We have spatial filtering to the rescue.
So I can specify a spatial filter when I'm pulling a clone down. And we can do it in different coordinate systems. We can do it as WKT. We can do it as vials.
So in this case, I've got some WKT in a file that defines a little box. And whoops, they're back up there. So I added a special filter
and it's doing exactly the same thing it did before. But instead of pulling down 146 files or whatever I decided, it's going to just grab two, maybe three. And this is gonna go a bit quicker.
And the WiFi's working, this is great. And so you can imagine for like a national or continental size data set, this is gonna make a huge difference. So let's go, go into here. So each data set for point clouds on REST is a directory.
And in this case, it's the Agua Blanca Fault in Mexico. And we can see that we have three COPC LAS files and we have a VPC. And so given that special filter, it's decided that that's the only thing that's relevant to the area I want.
And so cuts just pulled down the data it needs. And we can go into QGIS. And we will go into our folder. Way.
And we've got a, I'm gonna create a new project cause it's gonna be, and we can,
knows about CRS, we're gonna set the project CRS from that. We can create a 3D view.
Here we go. And as we zoom in, we can, QGIS will now quite happily load the tiles. Incrementally, we can do all our QGIS stuff and we're working with a smaller subset of the larger data set.
The really cool thing about VPC and the same with VRT is that, and really good work to Hobu and the Lucha guys for adding it, is that we can keep from the desktop perspective, we're working with one layer in QGIS and CART can update it in the background. If you change your filter, it will pull down some more data
or throw away some data. But from QGIS's perspective, we can treat it as one thing for styling and just get back here. So I guess what's coming up in the last year, we've been steadily trucking away. I said we added raster support. Point clouds were very, very new last year.
We finished off a bunch of work around documentation. We now have much better help on the command line. The tools is faster generally. We fixed a lot of bugs and we're kind of making steady releases with new capabilities and just general improvements.
So I talked about before, we want to reference data from existing S3 buckets without copying. So if you have a S3 bucket with lots of Cloud Geotiffson, what we want to be able to do soon is just point your repository at those tiles and you can treat it as a CART repository. We already set up multiple datasets in our repository.
So these are for basically different layers and different tables. And you can add in obviously rasters and point clouds as well and keeping them together as a project. But what we really want to get to is looking at interlinking datasets from projects
because often the data you're coming from is coming from different suppliers. The data you're getting is coming from different suppliers. And so we want to interlink it so that you can have a nice compact project repository with the layers you're interested in regardless of where they came from and to be able to pull in updates from other repositories really simply and easily.
We know how to do that and we just need to build it up. We want to be able to blend local and remote raster and point cloud datasets. So if you've got data that's, as I said, like a national raster layer, you might want to have some tiles locally
because it makes your day-to-day life a lot quicker and easier. You can run analysis or do models on them for the relevant areas. But maybe you want to be able to at least see the other datasets by pulling them directly from the cloud as well. And we want to be able to serve tiles and APIs like stack directly from repositories
and supporting all the history so that we can look at different tags, we can look at different branches and different commits in this history and treat it all sensibly via stack and tiles.