Add to Watchlist

Compress Me, Stupid!


Citation of segment
Embed Code
Purchasing a DVD Cite video

Formal Metadata

Title Compress Me, Stupid!
Title of Series EuroPython 2014
Part Number 77
Number of Parts 120
Author Haenel, Valentin
License CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
DOI 10.5446/20045
Publisher EuroPython
Release Date 2014
Language English
Production Place Berlin

Content Metadata

Subject Area Computer Science
Abstract valentin - Compress Me, Stupid! Compression is a technique to reduce the number of bits needed to represent a given dataset. A very common use-case in the distributed digital age is to reduce the size of files in order to reduce the time and bandwidth requirements of sending a file from one location to another. There are a large variety of different algorithms and implementations of so called "codecs" - a term is derived from the fact that programs that implement a compression algorithm commonly constitute of both a compressor and a corresponding decompressor. There are many different special purpose compressors that exploit specifics in the structure of the input data, for example: MP3, Ogg and FLAC for audio data such as music, GIF, JPEG and PNG for images and MPEG for encoding video. Also, there are many general purpose codecs that make no assumptions about the structure of the data, for example: Zlib(DEFLATE), Bzip2(BWT) and LZMA. However, and due to the ever growing divide between memory access latency and CPU clock speed a new use-case beyond faster file transfers and more efficient use of disk-space has emerged: "in-memory compression". Keeping data in RAM that is compressed also means that the CPU has to do more work in order to make use of it. However, if the compressor is fast enough, this decompression overhead could pay off, and applications could work with compressed data transparently, and so not even noticing the slowdown due to the extra effort for compression/decompression. This technique can be very beneficial in a variety of scenarios where RAM availability is critical. For example, in-memory caching systems like Memcached or Redis could store more data using the same resources thereby optimizing resource usage. Another use case is to use compression for in-memory data containers, à la NumPy's ndarray or Pandas' DataFrame, allowing for improved memory usage and potentially allow for accelerated computations. In our talk, we will explain first why we are in a moment of computer history that [in-memory compression can be beneficial for modern applications]. Then, we will introduce [Blosc], a high speed meta-compressor, allowing other existing compressors (BloscLZ, LZ4, Snappy or even Zlib) to leverage the SIMD and multithreading framework that it provides and help achieving extremely fast operation (frequently faster than a plain memcpy system call).
Keywords EuroPython Conference
EP 2014
EuroPython 2014
so welcome to my talk and balancing and I'm going to be talking to you today about a topic called compressed me stupid it's actually about to compressor Codex and that Francis Galton wrote he's sitting in the audience willingness and fj and it's a it's very fast compressor and today I'm going to explain a little bit as to why it's so fast and so here's all the info and this is my Twitter handle you see with 3
underscores and there's a get have your LDA on here and and all of that is licensed under a Creative Commons license so a historical perspective and let's have a look at the memory hierarchy so this is the memory hierarchy
of to the end of the eighties so what we can see here is that at the top we have slow mechanical disks with a very high capacity and in the middle we have the main memory and main memory or RAM i which has sort of medium capacity and then you know that's a bit quicker and then at the end we have this CPU registers in the CPU registers have you know very small capacity but they're extremely fast they're extremely close to the to the central processing engine so there's inside so that there really quick tax so how does the memory hierarchy develop and sort of throughout the nineties and then the 2 thousands and we basically we get this increasing problem that sort of we can access memory quick enough and so what happens is that that the chip designers what
they do is they building catches into the CPU to sort of accelerate this memory access to make sure that it gets quicker and and sort of you know as the CPU's get quicker and quicker and then memory access or related she doesn't include increase at the same rate we're currently sort of in this situation and we have a
mechanical this at the top which is really slow then we have a solid state disk in front of that mechanical this and this is sort of the caching system that sits in front of the disloyal mechanical this sort of accelerate the access and we have the main memory and then we have an ISO 3 levels of caches and the catch is as you can see here always increasing in size so those so this is sort of the the problem that we have right now and and in 2014 basically see user sobbing so what this means is that the memory latency is much slower than the processes and the processes are currently there just waiting for data and a lot of the time so essentially they want to do computation but they can sort of be said and the data that quickly enough and so memory bandwidth is also improving and that's actually a better rate than memory latency and but it's still an a slower than the processes so somebody actually predicted that this was going to happen and in fact it was Richard sites in 1996 and his
infamous article that's the memory stupid and so it basically said is that sort of in the years to come and
speed or computational speed will always be limited by memory access and sort of the solution or 1 of the solutions that we're proposing is if it's the memory stupid well then compress the stupid so now about loss as a lost itself is actually designed for in-memory compression
so if we go in and have a look at the the situation back here so lost actually sets that sort of the interface between the memory and the caches so what the idea is that you keep that data
compressed in memory and then you only sort of decompress the in inside the caches inside the CPU and so that's sort of how we're it addresses the the starting CPU problem
so if you can sort of uh you know move the data to the to the CPU and compressed state you can basically move more data because it's compressed right um so that sort of accelerates things quite a lot and the great thing about lost is that it also works for a for general-purpose scenarios so it's a really fast Kodak and it it works really well as for compressing to disk as well so if you have sort of data that sitting in memory and you wanna serialize it out to disk and that works for that as well and lost itself is written in C but this since this is a Python Conference I'm going to be talking a little bit about the Python interface to 2 boss well and so 1 of the things that Campbell's claims is that it's faster than Y and this is a little chart here little plot from I think
2012 said 12 core and it's has definitely has 12 costs and on exactly how many CPU's this machine has but what you can see here is that sort of this at the start of a line in the middle is that memcpy speed at approximately a gage and sort of that when we use lost with this sort of a modest compression level of like 3 or 4 and we can get you know a compression speed of about 12 gage so basically it's faster to take something in compressive than it is to just copy that in memory and it's even better for decompression so for decompression and we're still at about a case for them and see the wine and but that's sort of the decompression speed here with 6 threads can can reach up to 35 takes a 2nd and so that's that's actually pretty cool and so what about loss so um recently we've been
debating quite a lot so as to what lost actually is and then we reach the conclusion that bosses in medical there that this is sort of the the correct term to describe lost because lost doesn't actually compress anything and the only thing that most does is it takes care of cutting the data into blocks so basically you have a big buffer memory it's chopped into individual blocks and then apply certain filters like filters to precondition the data so that it's easier to compress and reducing the
complexity and it manages the threats and so lost is a multithreaded library and all of the data as basically cut into chunks and then each of those chunks is compressed individually and that sort of this multithreading and the application of these filters a sort of the 2 features that allow us to be so fast and under the hood it uses real Codex I'll talk about the real Codex that we support a bit later and all the filters in the cortex are applied to each block and so this is the blocking technique and so instead of sort of applying the filter to the entire buffer and then compressing the buffer it's actually much quicker to do this block-wise because then you you have a block of data is loaded into the CPU use many operations as possible on that data and then you shift it back out to memory and then the thread level parallelism parallelism obviously works walks and so that the shuffle filter and this
is sort of the the 1 filter that is implemented and lost at the moment is basically a filter to reorganize the bytes within a block and so sort of order them by reorder them by bite significance so what you can see here is basically there's for a light elements and so this is the first one from light to dark and then from light to dark and from light to dark and from light to dark and sort of within a block um lost and reshuffles these bytes so basically it takes and the 1st bite from each element and puts it at the front of the block and then it takes the 2nd byte from each element so this 1 this 1 this 1 and this 1 and puts puts it next in the block after the 1st bite and so on and so forth until you get to the last lights and so this is a a technique that allows you to reduce the complexity of your dataset and so for example imagine you have the following array unsigned integer 8 byte unsigned integer so this is 0 1 2 3 4 um
and if you reinterpret that this is a an unsigned and 8 bit and a byte a sorry and unsigned 1 byte integer so well and this is what it looks like in memory and so there's actually a zeros and then there's a 1 and 7 0 to and another 7 zeros industry and another 7 0 so so this is sort of what it looks like at the byte level and and then what lost doesn't basically reshuffles everything and the result of the shovel filters that the 0 1 2 and 3 bytes so these fights are actually at the beginning of
the block now and the rest of it is basically just zeros and 1 of the things that compresses a really good at is recognizing and sequences within a data buffer that are very similar so essentially the the compressor will go through here and say 0 there's just a bunch of zeros here I can compress that really efficiently and sort of applying the shuffle filter can work really well for certain kinds of datasets and you reinterpret this as a you 64 you can see that sort of all of the actual information so all of the the complexity is actually now in the 1st element um works really well for any kind of multi by data with with small
differences so for example time series like financial
data potentially some neural spike data for example anything that's forests and you exploit the similarity between the individual elements and all of the bytes there like essentially them together and therefore creating longer streams of similar bytes that can be compressed really well and
that the shuffle filter and this is very important for speed is actually implemented with S 2 instructions so it and takes advantage of and some of the modern instructions inside and the CPU and for shuffling but obviously it doesn't work very well on all datasets and so if you have a look at this data set here so
again as you enter 64 basically you have 1 very large element and and it's actually integer max um and if you reinterpret this as a UN takes you can see that there's a lot of uh bytes at the beginning so there's 8 bytes of 255 at the beginning and if you shuffle them then basically you get 255 and then 3 zeros 255 and 3 zeros and so this is this
is not so good for compression right because basically you've taken a low-complexity dataset and you basically increase the complexity so 1 of the things that sort of thing is in the is in the pipeline isn't planning is essentially do like some auto-tuning and but I'll talk about that at the end of the talk and so that was the shuffle filter and knowledge have a look at some of the Codex that are under the hood and so by default there's is a protocol ball
skills at which is derived from fossil that and we also have alternative Codex so third-party Codex and for um which is a really fast codec that's still getting a lot of popularity recently it's actually used in for example the group and and other systems that the deal with big data and it's extremely quick it's extremely well-designed and it's extremely simple and there is a high high compression version of LZ for that we support as well we also supports mapping so this is an and LZ project that has been developed by Google um and lastly we support the infamous at so this was sort of 1 of the 1st Codex that gained a lot of popularity it's a fairly old codec as well um and it's it's really very prevail and it's actually exists on any sort of Linux system that you have today it's really 1 of those sort of most popular Codex out there and that has pretty high compression ratio and but it's fairly slow but um as you'll see in in a minute and we can actually accelerates Edward Said look quite quite a bit with loss and potentially we also have support for other Codex this just needs to be implemented and so some other ones that I'm thinking about itself that LZS quick quick z and that in a and these are just some some of the Codex out there you know that might be possible but someone needs to go ahead and sort of implement the support for this and so was the sex and where X is 1 of these codecs that I just mentioned
and so the 1st thing you can do it can yield higher compression ratios using the shovel filter and it can do faster compression and decompression using multi-threading and that's pretty neat because you know you have to pay for this you know you just sort of activate lost and immediately you know you get better
compression ratios potentially depending on your dataset and you get faster compression decompression time and so now about patent and so to Kodak lost is a codex so
naturally we have compressed and decompressed care that operate on
bites 4 strings and depending on which cumbersome use we can also operate on pointers so this is a I guess some of you might think this is sort of a bit risky you're ugly and but essentially what we can do is and they're certainly as the certain containers like for example number but actually expose the sort of the underlying pointer to their to their memory buffer as an integer and we can actually use this function compressed ptr and we can hand an integer which represents a pointer in memory and then lost can basically do and compression and without having to do any memory copies so basically just tell you know there's a block of memory here contiguous block of memory and please just on compressed we have some nice tutorials and some great API documentation and and it's actually implemented using SC extension using the Python CPI so let's set up some data here ever been important on pi as in the
and import laws and and then we're going to create an online
space basically creates an and an array of floating point numbers work in a converted to string so working converted to essentially a string of bytes and this is the length of the byte so this is just the dataset that we're going to operate on so as you can see on this dataset you know and as quite sort of simple in terms of this in terms of its complexity nice that specifically chose this dataset because the shuffle filter works well on and so what happens when we use it so we compress with have lived with compression level 9 we can see 614 . 7 seconds and then we compress it with loss we say that the types sizes 8 because we have a fight and floating point numbers and we under the hood we use the set the Codex so we're trying to actually accelerators at the coding and we use the same compression levels of compression of online and we can see here that lost actually takes 317 ms from sort sort of an acceleration of yeah like 30 30 or 40 at times faster so what about the compression ratio every runs at again and we're we're run lost and the compression ratio of of z live is about 1 . 5 and the compression ratio of lost is actually almost 80 and so yeah it reduces it to an of of the original size um which is actually almost 50 times better than the original and the original settlers as for decompression and it's not that much faster and so 7 has to work really
hard when compressing because it goes and looks for redundancies in the data and the decompression is actually pretty straightforward so there's a need to do a lot of computation there but lost is still faster and that's the thing mainly due to the to the multithreading and so on many to demystify this example obviously so so works really well for the land space dataset which is why chose it and the shuffle filtering the multithreading are essentially what you know noticed that brings these great results so what happens and if we only use a single thread so lost we can said you know set the number of threads we can say you know we just 1 1 thread and and we want to deactivate the shuffle filters so this is sort of the way the the Python API for plus works and again the same type size and the same coordinates that live and uh the same compression level and we deactivate the shovel filter and we can see you know originally it took 14 14 point something seconds now takes 12 . 9 seconds so really was sort of the the multithreading and the shuffling that did actually accelerate this and for the for the compression ratio we we run it again with the with a shovel filter the number of threads is still set to 1 and we can see that actually what lost produces now it is even a bit longer than what producers and this is essentially due to a certain overhead so each block actually has to contain a very very small header and essentially these headers sort of adult so that mn terms of compression ratio or a loss actually does a bit worse if you run it without the shovel filter so what about other contexts so that by 5 said you know I've announced that their lives differently so good with a good compression ratio let's take a look at that for and so it for is
a is a really fast algorithms and so there is actually a typo here so an I know it sounds so this is the the originals ever that I used and threatened 29 milliseconds and if I run it with sales of 4 instead then I get 20 . 9 milliseconds so that yet another acceleration the you know of that force really quick but this comes at the expense of the compression
ratio and so here as we can see that the the length of the loss compressed buffer is actually sort of you know 0 . 2 0 . 1 7 so it's about you know it's 5 times worse actually so the compression ratio really is knackered but that's because you wells at for aim for a high compression ratio but it aims for high compression speed and so that's the trade we do there and so for decompression again a similar situation and as it forest is faster than in the planes of and to some
extent notes on the on the extension so for those of you
that um are perhaps a little bit familiar with that my fancy API
so this is great function pi bytes resized um and what it allows you to do is it allows you to resize the string after compressing into it so normally and strings and in Python are actually and you but sort of if you just created it in in the C Python C Python API just created you have exposed at sort of to the Python level yet you haven't returned and then you can actually resize that string after creating it and that sort of 1 of the tricks that we use In inside the inside the bindings inside the lost findings also since the velocity is multithreaded library and we actually need to release the the global abroad before compression and decompression and we can do that quite safely and because basically registering operations and we're not actually modifying and the state of the interpreter during the compression and decompression so for installation and compilation and you can install it using that an inside a virtual environment say so lost and the important thing here is that you need to have a C + + and not just a C compiler and so why does it need to be a C + + compilers so essentially SAP is C + + code and if you want to compile snappy then you need a C + + compiler and we also have some experimental binary packages and so these are hosted on been star they can be installed without a condom SMC made major channel here you see is the channel and I have currently I have a package that uses non by 1 . in Python 2 . 7 and so on uh perhaps in the future when when an account and then start you and the public support some kind of
environment where you can you can dump your and your package file and it'll spit out a lot of packages then you know we might have sort of prepackaged binaries for different versions of number different versions of Python and also for for Windows informações sex and so lost actually uses all these these third-party libraries and sort of packaging or compilation is is actually little bit troublesome so when I say
Pippin sold lost and as a user you know I don't wanna you know have like lost like the C library installed and compiled and potentially like download it you know it called get repository runs you make and all sorts of crazy stuff you know I just wanna say that and so lost and of sources and compiles them and in order to do that what we do is we should actually all the sources for it for snappy inference that lived we she's cherishable sources where with loss and so this is sort of In a way this is kind of a bit of the because you know it's it's not really modular and every time LZ for dozen which has been happening happening quite frequently the recently and we need to download the sources and sort of embed them into Oregon repository so it's a it's quite a lot of effort sort of on the maintainers but it's actually really great for the users because you can just say take a pencil lost and you know you're up and running and however there's certain people they're not going to be happy about the situation in particular
packages and so packages what they're usually want is you know boss is used potentially by you know
multiple and
multiple applications on your computer and what you want is you want some kind of pre-compiled library and so we actually have the maximum amount of flexibility here and so 1 of the things you could do is you could just compiled lost using C make and create a shared library and then go on inside that shared library you can even link against the existing Codex so for example if you have like live lives uh how's that for a shared library for that live then you can link against that as well and many if those are not found then essentially lost compliance with the ship sources and for Python lost we have a similar situation so there's an environment variable that you can set and that environment variable says you know there's a there's a precompiled lost shared library here and there and you basically just compile against that or you can just you know compile everything into the Python module so this is sort of the the the amount of flexibility that we give the people and I hope this should be should be quite beneficial for packages so there's a couple of projects
out there that you would use lost already so the first one is by tables and this is an HTS 5 interface and have lots of nifty features and this is word lost sort of originates from this was and the 1st place were lost was used and to use no 1 each 5 files to compress 65 files and it's sort of lost that's a little bit it sort of glossed home-and-away and there's also an application called lost back and which is a simple file format much simpler file format than next 5 and the Python reference implementation and the the cool things about back is that it supports non-player is out of the box so you can just compile and you can just compressed an entire race using boss back this really neat Python API and and it also has a command line interface so if you want to use lost from the command line then boss and provides this feature so you can use the all of the
features of lost all the Codex you can use them from the command line and lastly there is a very nice and library the calls which is essentially an in memory and out of core and compress the re like structure and so there's actually an idiot talk about that but all announcing in a 2nd and so for the future what might be coming so potentially more Codex um
alternative filters so 1 idea is to have a filter which
and there certain elements that have for example a bytes only retains 4 bytes so for example a filter that reduces precision because you make might not need to say that much precision and there's this altered tuning idea at runtime so now we have a bunch of different codecs and we have a bunch of different compression levels and you have a shuffle filter that we can turn on or off so these in wages are kind of hyper parameters for the for the compressor and the question is you know how do I choose the correct Kodak from my data and sort of 1 of the ideas that we have is that you go ahead and you take the 1st block and you essentially optimize and on that 1st block and then you use those setting for the next thousand blocks and then you optimize again and use it for a thousand logs and so on so this is sort of 1 of 1 of the things that we need to deal now that we have you all this flexibility we need to figure out you know what is what is the best choice and also multi shuffling this is an interesting idea and so it actually turns out that if you apply the shuffle filter and more than once and you don't immediately get the result back so you don't you know I don't arrive at the original immediately so potentially you know might be worth applying the shuffle filter once twice 3 times 4 times 5 times and so again this depends on the dataset that depends on the complexity of the dataset and I'm not really sure yet you know how that's gonna work how organ is a term and you know how the how often to apply that shovel filter but it's it's definitely an idea that's what about and also we would like to have a go implementation in single line and how can I help so if any of you are interested in Gloucester curious about you whether this might and might accelerate also compressing your own datasets that you have and you can now run the benchmarks on your hardware so if you have very low hardware big compute nodes all compute nodes and also a high-performance machines so recently we received some benchmarks for blue jeans and please do run the benchmarks on your hardware and report the results so there is a website here and synthetic benchmarks that HTML which has instructions on how to run the run these benchmarks also if you're interested please don't try incorporate lost interapplication physically features you need if you had if there's any questions you have to be pleased to get in touch and so quick advertisements so as Francesca is going to be talking about out of Oracle our datasets and this is on friday
so that's in 2 days and c 0 1 and the big room upstairs and and in
this talk is going to be showing that recent advances for B calls so this is this array like structure compressed in memory structure and he's gonna show that we can actually do and compressed in memory queries faster than condos and so that's that's pretty steady and also there's the PPI data Berlin satellite events and there's going to be talk about data oriented programming from Francesco on saturday and I'll be talking about how to and serialize non-PPI arrays with loss factor on Sunday at 11 so getting in touch this is the last slide and we have our main website at Boston work and there's a good hub organization that contains
all the projects that were currently working on any benchmarks that we have and there's a Google group that you can use to get in touch and if you have any questions or ideas and please to get in touch and the slides for this talk are available from get help thank you thank fit into the while the use of who would you call or to will you also longer than anyone and this was and so the question was and is it
possible to sort of decompressed part of compressed buffer and the the the example was that there's cappic shares and there's certain cap pictures that you're interested in and you want to extract exactly sort of the white furry cat with a with a nice look on its face and so the way boss works is it has these as I explained it has these blocks under the hood and if you know um where your element is what you can do is you can go in and just the compressed that particular block and so that's not exactly the element that you want but it is sort of a very small part and which does contain the elements of the overhead is fairly small so you don't need to decompress the entire thing but you can just decompress the block to retrieve the element that you're interested in and lost actually has an API for doing that I think you know sometimes so what so the question was amazed by bytes resizing implementation detail and have I tried to run it on pipeline and so I haven't tried to run it on pipeline and as for pi bytes resize it starts with an underscore yeah so that should answer the question a little so it is 1 of the all and so lost itself our sorry and then repeat the question so the question was uh um with regards to storing the data on disk and how stable is the compression format so 1st storing data on disk and Bosc itself is stable and will continue to remain stable so we don't have any plans for introducing backwards incompatible changes anywhere in the near future so it's pretty much in terms of and public API it's pretty much feature ready and further compression format it depends on how you're going to serialize uh serialize your data to disk so for example for non arrays and not only do you need to store the compressed data buffer but you also need to store the metadata like the shape of the array the type size here whether it's in the urine and Fortran ordering and so last class that does this it has like a metadata section where you can embed the number I metadata and that's how published back in but implements this and and that format is fairly stable although I wouldn't guarantee it and if you want to use lost from for example from tables in H D A 5 files than I would say it's pretty much stable 1 would be so specifically what would need to be done is to write see if if I bindings for its from what I understand and yes that is and this is something that i've thought about in the past and I would say that that is something we be interested in 1 of these things the all I I'm not sure actually sorry the question was what does the Python API have the ability to decompress certain blocks so no the Python API doesn't have this capability but you know you this we
and the question is if non-PPI functions can assume both data consumer so basically an yeah I don't think so and not right now although there are a certain I mean so that the question you're asking is whether you have some kind of evaluation ability on gloss compressed buffers is perhaps a way to rephrase this and so far for that the answer is to look at the calls which has an evaluation engine and the idea there is that you have uh in memory compressed blocks and then you can operate on those on those blocks and like numerical functions for example so so how would that be uh and so the question is clearly improves throughput but and is there any possibility or does lost improve the latency and so no because the latency is defined you know you're requesting some kind of memory access and and then that's the time until the data arrives at the CPU you unit you all signs that your
data is holidays and so
the question is is there like is there a minimum size through data and where if your data is smaller than this there's no point compressing and might you know want which the people right from the start with you and so the answer was it depends on and the way you access your data and if you executed sequentially then it makes sense and to have big blocks and if you access your data in a random fashion then maybe not uh um but having said that I think there is actually a minimum this I think there's a minimum size for compressed buffers which is actually an implementation and implementation-dependent thing I'm not sure about the exact and lower bound is k so you know if it's if it's below 8 K then I think you can even you can compress for you get compressed yeah we we have to look into the secret but I know that this this this was so the question is age 5 my also has a shuffle filter and whether the shovel filter is comparable to the 1 in loss and I don't know how the Haitian age 5 by
shuffle filter works um but my guess would be yes answers in the the long it is so on the and this for OK so the answer was for the video stream the answer was that conceptually the to shuffle filters are the same but the 1 and also it is faster because it uses as a C 2 instructions under the hood right
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation


  484 ms - page object


AV-Portal 3.8.0 (dec2fe8b0ce2e718d55d6f23ab68f0b2424a1f3f)