500+ Billion Points: Organizing Point Clouds as Infrastructure
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Teil | 161 | |
Anzahl der Teile | 193 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/20366 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
PunktwolkePunktMultiplikationsoperatorPunktwolkeAutomatische IndexierungMusterspracheComputeranimationVorlesung/Konferenz
00:22
PunktW3C-StandardBrowserDatenkompressionPunktwolkeSchnittmengeFokalpunktMultiplikationsoperatorBitElektronische PublikationBrowserAutomatische IndexierungBenutzerbeteiligungComputeranimation
00:48
PunktVisualisierungEinfache GenauigkeitMathematische LogikRechenwerkHierarchische StrukturGasströmungAuflösung <Mathematik>MAPMultiplikationsoperatorElektronische PublikationSchnittmengeAuflösung <Mathematik>RechenwerkQuick-SortPunktwolkeSelbst organisierendes SystemAbstraktionsebeneDienst <Informatik>DatenkompressionPunktClientVisualisierungBitOrdnung <Mathematik>Einfache Genauigkeitsinc-FunktionGüte der AnpassungTesselationDifferenteHierarchische StrukturZoomDynamikExogene VariableHyperbelverfahrenComputeranimation
03:14
NebenbedingungROM <Informatik>Quick-SortGebundener ZustandAutomatische IndexierungTesselationMeta-TagMAPNebenbedingungHalbleiterspeicherElektronische PublikationAbfrageComputeranimation
03:45
Puffer <Netzplantechnik>FreewareTotal <Mathematik>NebenbedingungROM <Informatik>VisualisierungCloud ComputingSkalierbarkeitDistributionenraumQuellcodeAuflösung <Mathematik>EinfügungsdämpfungOrdnung <Mathematik>TopologieEmulationZeiger <Informatik>ImplementierungEinfache GenauigkeitKurvenanpassungGleitendes MittelAutomatische IndexierungRichtungDatenstrukturPunktInnerer PunktProzess <Informatik>SoundverarbeitungZentrische StreckungSchnittmengeTesselationDatenstrukturRuhmasseImplementierungGraphVisualisierungAlgorithmusAuflösung <Mathematik>DifferenteLinearisierungTopologieOrdnung <Mathematik>ZeichenketteResponse-ZeitMapping <Computergraphik>NebenbedingungMAPDateiverwaltungRaumfüllende KurveGebundener ZustandPunktWärmeübergangKartesische KoordinatenSkalierbarkeitDimensionsanalyseFehlermeldungZeiger <Informatik>MinimumPhysikalisches SystemCloud ComputingCodierung <Programmierung>DatenparallelitätKurvenanpassungOrtsoperatorThreadRechenschieberRechter WinkelZahlenbereichGeradeUnendlichkeitMultiplikationsoperatorQuick-SortMultiplikationVirtuelle MaschineElektronische PublikationAggregatzustandOktalbaumMereologieBitMathematikSpezifisches VolumenAutomatische IndexierungAbfrageEinfache GenauigkeitVerschiebungsoperatorGanze ZahlAdditionKategorie <Mathematik>Transformation <Mathematik>ZeitrichtungMetadatenFormale SemantikFrequenzKontrast <Statistik>EinfügungsdämpfungNatürliche ZahlMechanismus-Design-TheorieNummernsystemTypentheorieParallelrechnerHeegaard-ZerlegungExogene VariableVersionsverwaltungHyperbelverfahrenMinkowski-MetrikOktave <Mathematik>PunktwolkeParallele SchnittstelleNichtlinearer OperatorComputeranimation
11:45
VisualisierungPunktAbfrageMultiplikationsoperatorPunktDatenstrukturEinfache GenauigkeitAuflösung <Mathematik>LinearisierungFaserbündelOktalbaumXML
12:11
AbfrageMultiplikationsoperatorMathematikSpannweite <Stochastik>Lokales MinimumPunktGlobale OptimierungSchwach besetzte MatrixHeegaard-ZerlegungMultipliziererDichte <Physik>Ordnung <Mathematik>SchlüsselverwaltungOrdnungsreduktionGrößenordnungProgrammbibliothekGanze ZahlROM <Informatik>Mini-DiscAutomatische IndexierungTotal <Mathematik>InstantiierungVirtuelle MaschinePunktGebundener ZustandInstantiierungResultanteGrößenordnungMathematikEindringerkennungBitrateTesselationMultiplikationsoperatorViereckAbfrageInformationsspeicherungLokales MinimumAusdruck <Logik>Zentrische StreckungZahlenbereichSchnittmengeHeegaard-ZerlegungSchlüsselverwaltungHierarchische StrukturMereologieHalbleiterspeicherQuadratzahlTeilbarkeitMatchingProgrammbibliothekElektronische PublikationDatenkompressionEinsSkalenniveauPhysikalisches SystemRechenschieberUnendlichkeitMAPInteraktives FernsehenCASE <Informatik>Array <Informatik>Spannweite <Stochastik>KurvenanpassungGlobale OptimierungSpeicherabzugAutomatische IndexierungComputeranimation
16:05
CAN-BusDreiEuler-WinkelFünfHypercubeServerUnendlichkeitFunktion <Mathematik>AnwendungsdienstanbieterVerschlingungVideokonferenzGeradeSchnittmengeDienst <Informatik>VolumenvisualisierungQuaderProjektive EbeneThreadAuflösung <Mathematik>Familie <Mathematik>Elektronische PublikationAbstraktionsebeneAbfrageVersionsverwaltungEin-AusgabePunktwolkeGraphfärbungPhysikalisches SystemClientGamecontrollerGebäude <Mathematik>MAPVerzeichnisdienstBrowserPunktCodierungAutomatische IndexierungGrößenordnungServerFunktion <Mathematik>Dynamisches SystemTesselationGüte der AnpassungAutomatische HandlungsplanungAnalogieschlussTypentheorieMetadatenWellenpaketZahlenbereichFormation <Mathematik>EchtzeitsystemKoordinatenAttributierte GrammatikCodeDatenstrukturSichtenkonzeptBenutzerbeteiligungComputeranimation
19:41
VerschlingungCoxeter-GruppeCloud ComputingAnalysisAutomatische IndexierungFourier-EntwicklungComputeranimation
20:26
Computeranimation
Transkript: Englisch(automatisch erzeugt)
00:09
and we're ready for the second talk of this session. Connor Manning, who's gonna tell us about point clouds as infrastructure. Yes. Okay, I'm Connor.
00:20
I'm gonna talk about indexing very large data sets, point cloud data sets for use as infrastructure. I'm gonna focus mostly on the reorganizing portion and the indexing, and I'll get to some of the infrastructure and future work towards the end. So the initial problem that drove what I'll be talking about today
00:40
was can we put Iowa's LiDAR in a web browser? And a little bit about the data set. It's 37,000 files, so this is a statewide LiDAR collection. About 170 billion points, uncompressed, it's four and a half terabytes or so. Compressed, it's about 400 gigabytes.
01:00
Very good compression, it's all L-E-Z. And the data is organized as full resolution tiles. This is actually the data set. Yes, Iowa's very flat. But, so these are the point clouds colored by which file is which. And some problems with this layout
01:22
is that it's difficult to access, especially in the ways that clients want to access the data for such a large data set. With tiles, there's really no way to get something like a low resolution overview. There's no hierarchical structure or level of detail. Everything is full res all the time.
01:40
It's difficult to manage because it is thousands and thousands of files, so it's difficult to treat the set logically as a single unit, which is how we would like to think about it. So we'd like some sort of abstraction where we can deal with this data set as a contiguous unit rather than constantly being reminded that we're dealing with
02:01
the tiles that make up the set. And as far as visualization, unless you reorganize the data somehow, you can really only do it in pieces. You can download these tiles, then maybe you can sparsify them that way, but you still have to fetch full tiles from somewhere to do anything.
02:22
So what we really want is a point cloud service to do all these things I just said. So we want to provide hierarchical access to the data so you can get different zoom levels, you can get dynamic resolution, we want to be able to go anywhere very quickly, and we want it to be flexible. So we did, we aimed for visualization first,
02:41
but it's not visualization only. So the visualization first mindset gets us into a quick response, we have to be able to respond quickly to queries, so on the order of milliseconds. And that drives, that drove our solution a little bit, so I'll talk about that a little bit later. And the service aspect is that we want to be able to deal with it as a single unit,
03:01
so that's why we want to create some sort of service to abstract whatever organization we come up with. So we need to reorganize the data essentially. Obviously, uncompressed four and a half terabytes, it's a lot of data to chew through, we'd rather not touch it. If we could build some sort of meta index on top of it, I mean there's all sorts of those, you could keep the, they're tiles,
03:22
they fit nicely, you can keep their bounds, you can go to the bounds that you're interested in and select overlapping tiles, but you can't do level of detail queries, and again you're always, you always need to consider the fact that you're dealing with tiles when you want to get to the data.
03:40
So some constraints, memory, so you can't keep four terabytes in RAM, you probably can't keep four terabytes in RAM and swap. You could provision lots of EBS volumes, but we didn't really want to go there. So we want to, one of the important constraints we put on the solution is we want it to be lossless,
04:03
and that means a couple things. We don't want to throw away any points, and a lot of, especially visualization-oriented indexes will do that because you don't necessarily need the very fine resolution points when it's good enough to stop at some point at some fixed resolution. We didn't want to do that because we don't want
04:20
to require you to keep the original data around and have to refer back to that for the full detail. So we wanted to theoretically be able to reconstruct the original set from whatever we do in this transformation. We want to be able to add data, that we don't want a finalization step because these big sets might come in pieces of time, we want to be able to add to it.
04:41
And of course, as I said, visualization, we're focused on a fast response time, so we want things on the order of milliseconds. Some assumptions, we did things the cloud way. So we assumed that we had scalable cloud computing, like EC2 or Google, there's all sorts of things you can use, but we assumed that we could use that.
05:02
And the third one there, distributed file system. The main point there is that we're assuming that the IO can scale with the processing. So if we scale up to, we want to be able to scale across multiple machines, but if there's some bottleneck in the IO, that's not gonna help, that's not gonna be feasible.
05:21
And we're gonna assume that the problem is parallelizable. And obviously, most problems are parallelizable, but practically, for example, if those files weren't tiles, they were just files of randomly spread points throughout the entire state, that would make it more difficult to parallelize. Luckily, tiles are very common, and they're quite natural for parallelization.
05:43
So what we came up with is that we want a massive octree. And there are a few reasons for that, and I'm gonna go over why this is what we want. And it's not necessarily that we need the traditional octree that you see in CS101, it's that we want something that acts as an octree, and the acting part is what we'll get to
06:01
when I talk a little bit about the implementation. So one reason that we want something that acts like an octree is because there's a natural mechanism for visualization. So you can see there, these are tiles of increasing depths of an octree layered on top of each other. So if you go down a depth, you increase the resolution. So there's a natural, I mean, it fits with
06:23
general 2D tiling schemes, and it's very natural. Another thing about octrees, as opposed to other types of trees, is that they're spatially distinct. This is an example of how a quadtree splits, which is the two-dimensional version. Just looking at it, it just screams parallelization, especially when you consider the tile datasets that will go into it, is that you can naturally,
06:42
you can split things geo-spatially and work with confined sets and maybe merge them later, or what we do is we just merge some metadata so we don't actually have to touch the data again. Here, by contrast to that last diagram, is a KD tree, where the order of insertion drives how things end up being shaped up.
07:03
And the reasons we don't want that are we want to avoid expensive things like rebalancing a tree. An octree is very stable, that kind of grid structure you see that naturally maps to level of detail is very natural, whereas this doesn't map so well. And in general, most other trees don't necessarily
07:22
do the things we want it to do, although they have many other benefits. So there is an octree, that's what you'd see in intro to octrees, is that there's a node at the top, it represents the whole bounds, the next bounds have up to eight child nodes bisected into quadrants or octants for octaves
07:42
for an octree, quadrants for a quadtree. So to make this structure, it would take 10 terabytes, it was actually 12 or 13, just to hold the pointers for the amount of data that we're dealing with with Iowa. And the reason I mention this is that any waste we introduce in the algorithm
08:01
has really large effects at this scale, so we're specifically targeting these hundreds of billions of point sets, and this is not even including the data, so obviously we want to keep things thin wherever we can. So the first thing we do is linearize the tree, and many of you are probably familiar with
08:22
a Z-ordered curve, you can see a space filling curve down at the bottom, and for a fixed depth of a quadtree, it lets you map out those nodes into a one-dimensional structure, it makes it a lot easier to work with, it gets rid of some pointers, basically it removes the edges from the graph
08:42
of your algorithm. So typically that's done per depth, so you go to a fixed resolution depth, and you turn it into that nice grid. What we wanted to do was something different, because we wanted a global linearization, and typically the way you do that is,
09:00
you see these multiple depths here, is that you do zero, one, two, three, and then you do it again at this depth, starting with zero, and you end up with some nice encoding structures, so you can encode these as strings, and you can figure out things about the depth, but the problem is, zero is not equal to zero, zero,
09:20
in that structure, so you can't do math across levels, so these are encodings, not numbers, and what we wanted is numbers, because you can do math on numbers, and that'll come into play when I talk about the querying later. So what we want to do is, we want to attach all these lines together as one, and treat it as a single, infinite array.
09:44
So first, we're gonna assign the numbering, so this is just what I mentioned before, all we're doing is assigning numbering, so we haven't done anything really yet, and this would be an example of how you would traverse this theoretically infinite array, mapping the single dimension to two dimensions,
10:03
so this is a quadtree, so we're talking two dimensions here. So you can see these arrows down at the bottom, and that's the key thing, so these can be considered the edges of that graph, and what we want to do is remove that from our data structure, and the obvious question now is,
10:21
okay, well how do I figure out what that arrow is? How do I figure out where to go next? And it turns out to be quite simple, so with a bit shift and two integer additions, you can figure out the next spot in the tree, so if you're checking a position of that array, and there's already a point there, you can figure out what to check next very quickly.
10:43
And some properties of this, of using this method, so here you can see the layered method, so what we're doing is expanding that out into a single one dimensional structure, and that's great because you get a lot of concurrency benefits, operating in single positions in a long data structure
11:03
is very easy to throw more threads at, for example, whereas structures like this are not quite as easy. It simplifies the data structures, so all you ever have to do is look at a single point in that long data structure.
11:20
The next slide I'll show you something about point spacing and a lot of things you can calculate from this structure, so you can, it goes back to the millisecond response time, we'll show you some things that you can calculate and make that happen. So here's an example of point spacing, hopefully you can kind of see it, so there's, on the left we just use
11:41
the first point that goes into a node, and on the right we, when a new point comes in, instead of just using is there a point here already, we swap them if one is closer to the center, so, and the reason I say this is because a lot of other structures trying to mimic an octree will use buckets of points, and it makes it very difficult to do something
12:02
that needs a single point resolution, whereas our linearization we can do that trivially. So I'm gonna run through this really quickly, it's just an example of what can you do with this math that I've been talking about. So for example, how do I query this tile at some depth, some depth, and you can see that from the global bounds you can bisect twice, so at depth two you get,
12:23
you can figure out, you can figure out those bounds at the depth, at the nominal depth, which is what I call it, which is how many times you have to bisect to get to the query, and that formula I showed you earlier, you end up with 17, the number doesn't really matter, but the cool part is that you can do,
12:42
you can trivially calculate all of this, at each depth of that hierarchy layer we saw earlier, that pyramid, so we're querying a 16th of the square in that pyramid, and you can, by continuing to apply that simple two-step formula to the numbers,
13:02
you can get the span of numbers at each depth that are gonna match that query. You can also get the max number of points that'll come back, and the cool thing about that is, if you split the, so obviously we don't have infinite arrays, unfortunately, so we have to split it up at some point, and if you do that splitting intelligently,
13:21
and you use these IDs in your key storage system, you can respond to a query of, give me that tile that we saw in the last slide, at depth 10, by using a range query. So select all IDs that match this range, and you magically get only the points that match your query. So, and that's one example of the things you can do.
13:43
So I mentioned, we don't have infinite arrays, so we have to split stuff. A couple things we learned about that, so it's trivial to bound the chunks spatially by using special points along the edge of a zero curve, but one thing that we learned is,
14:01
whenever you're doing chunking for an octree, and you're doing the splits, traditionally you split, you bisect the quad every level, or into octants every level, and at some point, they start to get really sparse, and what you end up with in these giant data sets is that you get hundreds of millions
14:20
of one or two point chunks if you keep splitting. So if you can, but if you can guess the number of points within a factor of four or 16, and you stop splitting there, that's a huge optimization, because you get an order of magnitude fewer files, and less IO, better compression, et cetera.
14:41
We have a bunch of other tricks that aren't too important. We ran a lot of data, so it's very heuristically tuned. We have a couple custom libraries for memory pooling. But the important thing is the results. So I'm gonna talk about, so what do you get from this? So did it work? So I'm gonna talk about a bigger set than Iowa
15:00
for the results. So here's the Netherlands set, the HN2 data. 41,000 files, 640 billion points, so about four times as large, and that one's one and a half terabytes on disk, and about almost eight if you decompress it. And here's a shot of the indexing rate. So we got up to 74 billion points per hour.
15:23
So it took us about, here's some stats for it. We used EC2 and S3, and we reprojected the data using 28 of Amazon's big instances, and we were done in about nine and a half hours. So each of those instances does about 2.6 billion points per hour.
15:40
So, and the scaling of that is very linear. So if your machine is 15, if your desktop is 15, 15 cores, 30 gigs, which would be pretty good, you could expect to get about half of that pace on your desktop machine. So you can go, that's just how quickly you can do it.
16:01
So what can you do with what you get? I'll see if I can play this video. So this is the data set I was just talking about. So this is, obviously we're starting zoomed in very far, so you can see the lines of the laser scanner. So I'll play video.
16:23
So this is being, so those points don't have colors, they're XYZ only. So these colors right now are being, are being painted on in real time by this renderer. So because we reprojected to Web Mercator, and this is why we reprojected to Web Mercator,
16:42
the renderer is fetching the point cloud tiles and Mapbox tiles, and then overlaying the Mapbox tiles on the points. So try to keep, that was a famous museum we started at. I can't really see it anymore. And just to give you an idea of the magnitude of the amount of data we have here,
17:00
and the different coloring there is because Mapbox coloring isn't consistent across all resolutions. And now we're finally completely zoomed out. So that's 640 billion points in your browser.
17:25
So that's an example of what you can do with this. So what you end up with is a bunch of files in the structure. So we built an abstraction on top of that. So we mentioned point cloud service. So we built something called Greyhound, which is a simple HTTP server.
17:42
And these are the kinds of queries you can make. So you can say, give me the metadata for this point cloud, so for Iowa, for example. And you can get things like number of points, what point attributes does it have, does it have RGB, what coordinate system is it in? And you can make these read queries. And this is what the render you were just looking at
18:01
is doing. So it's asking for a bounds, a depth, and that's pretty much it. So it's a simple tiling system that the client can control. So the project that does this building is called Entwine. There's just an example of how you'd run it.
18:20
You give it an input, that could be a directory, or you could just give it a file. You give it an output. You can reproject, and you can set the threads, and a few other things that I'm not gonna show. So these are some of the related works. So Entwine is up there. The render I just showed the video of, it's the dynamic version of Plazio. Some of you may be familiar with Plazio,
18:41
where you can view LAS and LAZ files in the browser. So the dynamic version of that is Speckly, which is what I was just showing you. Entwine uses PDAL for its point data, so it doesn't care about, it doesn't care and doesn't know about the types, all the types of point clouds,
19:01
but for free we can read them all and write them all, because it's abstracted away, which you just watch the talk for that. There's also a, some of you may be familiar with poetry, and so at the Code Sprint in Paris a few months back, we made a fork of poetry that communicates with Greyhound.
19:22
So when we build one of these indexes on these data sets, you can view it in poetry, you can view it in Greyhound, and that's all I have. So thank you, and I'll take any questions.
19:42
So thank you very much, Connor. Are there any questions or actions? You don't actually need cloud computing to build an index, do you? You can just run this on your desktop if you want to. Right, that was the example of the CLI invocation I showed.
20:06
Any other questions? Nothing, proves how clear your presentation was. Okay, that leaves us about 10 minutes to relax, to switch rooms if you want to,
20:22
before we begin our last talk.