LERC, an innovative compression algorithm for raster data
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/69208 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Local GroupSoftware developerExpert systemSoftwareSlide ruleComputer animation
00:16
Local GroupSoftware developerExpert systemMultitier architectureRaster graphicsDressing (medical)Computer-generated imageryMatrix (mathematics)LengthCellular automatonPhysical systemCoordinate systemRotationBasis <Mathematik>Data structureMathematical optimizationBlogFamilyInsertion lossMultiplication signDataflowMetric systemAxiom of choiceAlgorithmHydraulic jumpData compressionFlow separationFehlerschrankeRaster graphicsSequence1 (number)Set (mathematics)Row (database)Matrix (mathematics)InformationChainTable (information)WebsitePoint (geometry)Bit rateDecimalOpen sourceCASE <Informatik>FlowchartNormal (geometry)Line (geometry)Presentation of a groupCartesian coordinate systemFile formatDifferent (Kate Ryan album)Projective planeTwitterIntegerMedical imagingDampingExpert systemData typeRight angleSoftware developerGroup actionComputer fileWaveletWeb 2.0GeometrySquare numberInformation technology consultingWave packetSource codeRotationLengthMeta elementStandard deviationGoodness of fitComputer animation
07:34
Presentation of a groupSource codeComplex (psychology)Revision controlLocal GroupOvalFehlerschrankePixelBlock (periodic table)AreaNumberFehlerschrankeSpacetimeNeuroinformatikMetreOperator (mathematics)Multiplication signIntegerTesselationVisualization (computer graphics)Perfect groupVarianceAlgorithmMaxima and minimaDivisorData storage deviceData compressionProcess (computing)CurvatureComputer fileAdditionInsertion lossInheritance (object-oriented programming)Programmer (hardware)BitRepresentation (politics)String (computer science)Point (geometry)PixelWell-formed formulaFunctional (mathematics)Roundness (object)Slide ruleRight anglePlanningDecimalExecution unitMathematicsRevision controlView (database)CASE <Informatik>Moment (mathematics)1 (number)GeometryDifferenz <Mathematik>Binary codeMixed realityComputer animationSource codeXML
14:49
Local GroupOvalData compressionRight angleExtreme programmingInformationBlock (periodic table)FehlerschrankeRevision controlSquare numberData compressionAreaProfil (magazine)MetreComputer animation
15:51
Local GroupSoftware testingData compressionBlock (periodic table)AlgorithmBenchmarkComputer fileRight angleDifferent (Kate Ryan album)BitCellular automatonDampingPoint (geometry)Musical ensembleStandard deviationCASE <Informatik>FehlerschrankeLine (geometry)Bit rateHydraulic jumpSet (mathematics)Greatest elementLevel (video gaming)MetreSoftware testingProfil (magazine)NumberDemosceneDiscrete element methodBounded variationEngineering drawingDiagramComputer animation
20:25
Local GroupRaster graphicsData compressionAlgorithmComputer animation
Transcript: English(auto-generated)
00:00
Awesome. All right. My slides are 4 to 3. That's my issue, that they are smaller. It's not a technical issue. Everything's fine. Yeah, hi. It's nice to be speaking in front of people again. I'm also quite exciting. So I will talk to you about Lurg. Oh, my slides moved. I can take over now. Let's see. Yes, that works.
00:21
Hi, I'm Hannes. Hannes Krüger from Hamburg, Germany. I work at Rare Group. We are a medium-sized GIS shop in Germany, doing FOSS GIS development, consulting, training, and so on. And yeah, I'm a developer, expert, and trainer as well. I'm also super professional. So I'm on GitHub as Hannes and Kartikalups on Twitter.
00:42
And that's it for the introduction and advertisement for us. The rough agenda for today, well, I want to talk about some compression algorithms. So it's quite nerdy. I will briefly talk about what is raster data again, how and why we want to compress it,
01:02
and then go into this Lurg or L-E-R-C algorithm. I try to explain how it works, how it performs, and then try to give you some conclusions. And we will see how that works out in the end. Right, so raster data. I guess you all know raster data, right? It's usually some kind of an image.
01:22
Or you could also think of a table as some kind of raster data. Here I put some random values between 0 and 50. Just put them in some square, five by five. It's kind of a matrix. So yeah, that's a raster image. That's a raster data set, right? We could add some geo-referencing. So we could turn it into a geo-spatial raster
01:42
by just saying where there's some, where we need some kind of geo-referenced point in that raster. Where is this located, site length, rotations, and so on. And then we have a normal geo-spatial raster data set. You know them, for example, if you have some terrain elevations, or wind speeds,
02:01
or aerial imagery. These are all raster data sets usually, or in many cases at least. And you probably know they get quite big, so it's always a good idea to compress them. Right, there might be geo-tips, JPEGs, whatever. I made some funny German flow chart sometime, and said yes, you should always compress your rasters.
02:22
Let's just quickly go through it. So if you have some data where your actual values are important, right? If you want to, if you imagine a zip file, you want to zip a text and it comes out with missing text, it's not good, right? So for geo-data you also need sometimes the exact values. If you still need them, you need to, or you should compress your data losslessly, right?
02:42
You don't want any information loss. But if your data is something that's usually looked at more visually, so the exact really super exact values are not important, yeah, just use some lossy compression. What I just want to say is always use compression. It makes sense, that's the main gist of this.
03:04
If you want to compress, there's several choices for algorithms, and now it gets more nerdy and nerdy as the time goes on. I won't go into too much detail here, don't worry. Pretty much the most common, it's okay, I will use it. Pretty much the most common algorithm family
03:22
or LZ-based algorithms, whatever that means. The idea is some algorithm looks through your data and tries to find reoccurring values. So if the value one, two, three, four, five appears and the rest are here and here and here and here it will say oh, I found it four times, let's just store it once and make a reference to these other times
03:41
and by that it compresses your data. There's several algorithms for that. Some are old, some are new, some are fast, some are slow, some compress better, some less good. But yeah, there are lossless algorithms. You can use them in GDAL for example. And yeah, they have different kind of applications.
04:02
In the end, that's done, it is a nice algorithm. It's fast, it's quite effective. But yeah, that's something that existed already for a long time. There's other algorithms like RLE-based, so run-length encoding-based algorithms. I'll come to that later. There the idea is if you have a chain of characters like one, one, one, one, one, one,
04:21
you say hey, it's 10 times a one, right? You don't write 10 ones, you just say 10 times one. Much easier. Again, it's a compression immediately. These algorithms usually look at the data really stupidly. They don't know about spatial relationships. They don't know that it's a 2D or even 4D or whatever the data set may be.
04:43
So they just look at the linear row or the sequence of values, which yeah, works well, but it might work better if they were smart about spatial relationships. Of course, there's lots of other algorithms like JPEG, WebP, some wavelet-based ones and so on,
05:02
which work completely differently, but I want to compare to LERK and these other algorithms or rather the competitors to this LERK algorithm. That's enough for some rough and basic jump into raster data, right? It's some values and some kind of metrics. We want to compress it
05:21
and there are already several algorithms that exist and now there's a new algorithm. LERK stands for limited error raster compression. So raster compression, it's compression for raster data. And this limited error means that there's a controlled loss so we as a user can decide,
05:41
oh yes, please compress this. Please, you're also allowed to lose some data, but please keep it accurate up to this decimal number. So like, please keep the centimeters but throw away the nanometers. That's the idea in a LERK. And if you say, oh no, you might not introduce any error, it's lossless in the end.
06:02
LERK is fast. LERK works for any data type. So you can use bytes, you can use integers, you can use flow values. And the main idea was to use it for floating point, a high bit rate floating point data. And it's processing the data in 2D chunks, right? So it knows about the neighbors in the other line
06:22
if you look at it like a book. LERK was developed by the well known and beloved Environmental Systems Research Institute. You might know them as ESRI. I'm not from ESRI. Luckily, the recent open source, it's patented but it's on the Apache license
06:42
so we can use it in our open source projects perfectly. Fine, it's not incubated in OSGEO but it's on the website as well. And thanks to GDAL, it's also available in all our tools, at least if you use GDAL, which many do, of course, QGIS and so on. And you can use it in GeoTIFF, for example,
07:01
or in the MetaRuster format. It's an older algorithm. I think in ArcMap, it's available for 10 years but now it came into GDAL in the last, I think, two years so it's quite new and I thought, it's interesting, let's make a presentation on it. I'm not affiliated, I'm not paid by ESRI. I'm just here because I like this kind of stuff.
07:24
Lots of sources, it's complicated and the last line is important, right? I just tried to figure out how it works. I read the patent, I read some documentation, which is so-so. I'm not a super programmer or something so I might not have gotten some things correctly. In doubt, just look into it
07:42
or if there's additions or corrections, you're welcome to notice them after the talk. But let's look into how it does work because I found it quite interesting. It's a new kind of idea to compress this kind of data. So in the first step, LERQ can look at the RESTR file
08:02
and tries to use these spatial relationships so it tries to find areas with low variance like if it was elevation data, if you have a lake or flat plain, like all the values are pretty similar if it was elevation data. You might have mountains and ridges and there the variance would be quite high.
08:21
LERQ uses LERQ later and so it tries to look at the RESTR and kind of divide it into areas that are already kind of similar to each other and then store, for example, a big lake in one big block. These blocks are then looked at independently and LERQ tries to find out how I can compress the bytes
08:41
in this block perfectly. If you know geotiff, these are not geotiff tiles, that's one step ahead. For LERQ, each tile is kind of a RESTR dataset. So let's look at one of these blocks basically. This is a four by four block, right?
09:01
Four pixels, four by four pixels or values rather, they are all around 1,200 units. Let's just say it's terrain elevation, so it's meters and let's say we don't care about the millimeters, we just want an accuracy of centimeters.
09:23
So we actually have two decimals, too many here, right? We could throw them away, it would be fine. We could still work with it. You also see some empty themselves. We might look at them later if I manage it on time, maybe not. So in the first step, as I said, LERQ looks at one of these blocks. It already tried to find blocks that are pretty similar
09:42
and now it tries to make the data storage needed for these values as small as possible. In the first step, it looks at the minimum value or tries to find a value that it can use to then subtract from all the others, which we'll see in a moment. Just remember here, it's like four numbers before the dot,
10:02
four behind the dot. We find the minimum value and now we can subtract that value from all the other values. So the value itself becomes zero and the other values became quite small already, quite there, now relative to each other. And already we seek some kind of compression
10:20
if you want to think about it. We can just restore the values by adding the value again. So it's a reversible operation. This is nothing special. This is done in many other algorithms as well, but now comes the magic of LERQ. Let's say we said that there might be an error
10:40
of plus minus one centimeter. So there's an error of zero dot zero one. LERQ now takes these values and applies some formula and divides them by two times the maximum error, whatever, there's some math behind it. You see that the values change again,
11:00
they become quite big, but it's just an intermediate step so it doesn't matter. In the next step, LERQ uses these values and rounds them to the next integer. And this is the magic inside this algorithm. I can't explain it well, you just have to believe me. This is the step where there's a loss inside and because before we divide it by this maximum error
11:23
and we now round to the next highest integer, we introduce this loss and now we have integer values, which is important. Before we had floating point numbers and if you know computers, these numbers can take a lot of space. Integers are kind of easier to handle and that's the next step that will follow.
11:42
The numbers are big, but that's not an issue. Now we become computers. So we look at the numbers like they were binary, zeros and runs. And maybe you see already, right, there's lots of zero, zero, zero, zero, zero and then comes some ones.
12:00
If we talk like Germans, I could say, hey, you owe me zero, zero, zero, zero, one, two euros and you would give me 12 euros, right? The zeros are redundant, we don't need that. But computers are kind of weird. They store the data in bytes and there's, well, some byte alignments and so on. So in this case, we have like 32 bits.
12:21
So it's stored like that. Maybe we can throw away the zeros. Well, English. And that's now possible because we turned it to integer values. We can now look at these values and try to find out
12:41
how many zeros can we throw away from each of them? How many leading zeros are really redundant and unnecessary for us? And for that, we check for the biggest number, right? The biggest number needs the most bits in space. So we find out it's the one at the left column. We can also check in the binary view again. So you see it's like 12 characters,
13:02
if you want to look at that, or 12 bits. And all the zeros in front of that, we don't need them. We can just clip them away and you see the slide got much fainter, less data on it and now we are happy. So first, we made the values relative.
13:20
Then we introduced this magic lossy function. Then we got integers and then we could throw away stuff that wasn't needed anymore. And that is now just put together into one big string of bits and then we are done basically. If we look at the step before we threw away the zeros, this could be like the representation of this block.
13:42
So almost half the slide. After this magic step, we just have way fewer ones and zeros. I tried to make it a bit visual. And we got about 40% of the original, just by applying this step-by-step process to this.
14:01
Yeah, so yay, we have compressed our data. Yeah, that's basically it. There's lots of more to it, of course. Right, there's this nice idea to somehow divide by this max error and then round up and then throw away the unnecessary zeros.
14:20
But that's basically the idea that's used here. Yeah, there's lots on top. LURK has like three versions, four versions, I'm not sure. So they do lots of other stuff as well to make it more optimal, to find the optimal blocks and so on, but basically that's it.
14:42
So what does it look like? Let's skip this because I don't have the time set here. I made a very, very high compression on this example. So the left side is some elevation data uncompressed and on the right side, I made a very, very
15:01
compressed version of it. I said like you can make it to 10 meters and with accuracy, I don't care how it looks like, I don't care how bad it is, just to look at what happens actually. And you see there's some kind of plateaus, right? There's fewer values, fewer distinct values and you might also see some kind of small blocks
15:21
in some corners like this here, some square areas. Basically, LURK tried to look at these blocks and in these blocks, adjust the values so they are kind of similar to each other. And if I say, yeah, you can throw away a lot of information, it makes them very similar so you get artifacts like these.
15:41
As I said, this is an extreme example, you shouldn't do that, LURK is perfect if you use it with an error that matches your data. We can also look at this in an elevation profile so I used some one meter cell size but centimeter elevation accuracy. DM of Germany looked at some dike here
16:02
in the northern coast. There is a green line but you can't see it because there's also a blue line in front of it. The blue line is at a accuracy of one centimeter so it matches the original pretty well. There's a pink line which you probably cannot see. Nah, maybe a bit. It's like 10 centimeters but it also made a line
16:22
or a compression at one meters of accuracy and you see these plateaus again, right? So you need to be really careful with your data if you want to do some hydrology or something like that. You should use a value that matches, of course, the needed precision of your data.
16:42
Again, don't look at this and think LURK is bad. I just made these to show you really what happens in the background. Right, you can't talk about algorithms without looking at benchmarks. I managed to not do any more benchmarks. I just looked up other people's benchmarks
17:02
and did one small one. As we of course did some to show how great LURK is. It's six years old. They only compared to LZV so a very basic algorithm. It's very fast, LURK. It's even faster than LZV in some points and it can be smaller than even this basic algorithms in some examples.
17:24
Coco Alberti did some nice benchmarks in his article on GeoTIFF compression. And mostly he said, yeah, it's nice. Try it, use it, it's fast and make it make your files very small. And then there's a very nice block by on GPXZIO which is also not only for the LURK example.
17:43
It's very useful. He also did a very recent benchmark and also came to the conclusion that, yeah, LURK is a very good competitor to the other algorithms. So yeah, check it out, try it and definitely to use it if you have some data that is in a very high precision,
18:01
like yeah, float 32, float 64. So you have, I don't know, femto meters if it was elevation. LURK works very well in that case. If you say it's okay if I take millimeters, I don't need the rest. Yeah, my tests were a bit varied. I just included one here. So I just used GDAL, standard levels, predictors,
18:23
all the usual stuff. And I took two data sets and used LURK at the actual precision of the data. I had some DEM of Hamburg which is a precision of about one centimeter. And you can see at the bottom here I used an error rate of half a centimeter
18:42
so it was allowed to jump in that precision. And I just listed the size. I didn't care about the speeds here. I just wanted to make the smallest files available. And you can see in this case, LURK easily won. It made the file very small, smaller than deflate, smaller than Z standard. So a very nice example here.
19:02
You can also apply extra compression on top of that if you want, that's a detail. The other example is some Landsat scenes, so byte data which has a precision of one if you want to look at that, right? It's just RGB bands between zero and 255.
19:20
And even there, LURK is not far away from the others so there's not that much compression going on or that much difference between the algorithms but LURK is just there with the others. So yeah, it's not something you should ignore. You should try it and have fun with it. Right, that's basically it.
19:41
A quote from Esri, right? You need to do your own benchmarks. You need to test your data on your own. I tell you, try it, so try it. It's a nice thing. Right, this was quick. That was complicated. There was lots of numbers inside.
20:00
I hope it was interesting. My conclusion was it's an interesting algorithm. It's worthy trying to use it. You can play around with it. It would be great if the documentation was better. And yeah, if someone's sitting here and saying, oh, that was totally wrong, please correct me right now. Right, I'm not here to be right. I want to share some ideas and stuff so don't be afraid to not ask questions
20:22
but also add corrections. And that's it. Thank you very much.