We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

New Trends In Storing Large Data Silos With Python

00:00

Formal Metadata

Title
New Trends In Storing Large Data Silos With Python
Title of Series
Part Number
38
Number of Parts
173
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceBilbao, Euskadi, Spain

Content Metadata

Subject Area
Genre
Abstract
Francesc Alted - New Trends In Storing Large Data Silos With Python My talk is meant to provide an overview of our current set of tools for storing data and how we arrived to these. Then, in the light of the current bottlenecks, and how hardware and software are evolving, provide a brief overview of the emerging technologies that will be important for handling Big Data within Python. Although I expect my talk to be a bit prospective, I won't certainly be trying to predict the future, but rather showing a glimpse on what I expect we would be doing in the next couple of years for properly leveraging modern architectures (bar unexpected revolutions ;). As an example of library adapting to recent trends in hardware, I will be showing bcolz, which implements a couple of data containers (and specially a chunked, columnar 'ctable') meant for storing large datasets efficiently.
Keywords
51
68
Thumbnail
39:40
108
Thumbnail
29:48
Projective planeSoftware developerOpen setComputer scienceVideo gameComputer architectureView (database)1 (number)WordProof theoryMultiplication signOrder (biology)MereologyOpen sourceDuality (mathematics)PhysicistOcean currentProcess (computing)Limit (category theory)Parameter (computer programming)Arithmetic meanTable (information)Water vaporWave packetRight angleLecture/Conference
Computer hardwareComputerArchitectureReal numberEvoluteLibrary (computing)Computer programResultantReal-time operating systemTwitterDifferent (Kate Ryan album)Open sourceExtension (kinesiology)ComputerOrder (biology)Computer architectureData structureComputer hardwareQuery languageRight angleField (computer science)Form (programming)Suite (music)Physical systemSign (mathematics)System callVideo gameKey (cryptography)Inheritance (object-oriented programming)Computer animation
ArchitectureComputerCodeKey (cryptography)Point (geometry)Library (computing)Order (biology)Gene clusterVideo gameInsertion lossNatural numberComputer architectureLaptopSystem callArithmetic meanComputerSign (mathematics)Medical imagingPhysical systemReal numberServer (computing)CodeLecture/ConferenceComputer animation
ComputerGame controllerLaptopServer (computing)Transport Layer SecurityComplex (psychology)Read-only memoryEmulatorMemory managementCodeVirtual realityBlock (periodic table)LaptopService (economics)Semiconductor memoryTransport Layer SecurityComputer architectureTask (computing)Water vaporDifferent (Kate Ryan album)Cache (computing)Server (computing)Computer animation
LaptopServer (computing)Complex (psychology)Transport Layer SecurityRead-only memoryComputerComputer hardwareField (computer science)PhysicsOrder (biology)Computer architectureData structureCartesian coordinate systemLevel (video gaming)Sign (mathematics)TwitterSemiconductor memoryView (database)Computer animationLecture/Conference
Read-only memoryBefehlsprozessorCycle (graph theory)BefehlsprozessorSemiconductor memoryMultiplication signComputer architectureDifferent (Kate Ryan album)Physical lawEvoluteCycle (graph theory)Diagram
Read-only memoryBefehlsprozessorCycle (graph theory)ComputerArchitectureFingerprintTime evolutionSemiconductor memoryComputer architectureMiniDiscBefehlsprozessorKey (cryptography)EvoluteComputerContrast (vision)Right angleVirtual machineComputer animation
ComputerTime evolutionRead-only memoryData transmissionSolid geometryState of matterMiniDiscHypermediaBlock (periodic table)Equals signTransmitterLocal ringVirtual machineMultiplication signCASE <Informatik>Position operatorComputer programDifferent (Kate Ryan album)Programming paradigmOrder (biology)Semiconductor memoryNumberData transmissionMathematicsLevel (video gaming)BefehlsprozessorBlock (periodic table)Cache (computing)Computer animationProgram flowchart
MiniDiscSolid geometryState of matterHypermediaBlock (periodic table)Read-only memoryEquals signTransmitterDynamic random-access memoryHard disk driveMultiplication signOrder (biology)MereologyServer (computing)Key (cryptography)Semiconductor memoryHeat transferState of matterRoboticsBlock (periodic table)HypermediaProduct (business)Insertion lossSolid geometryNP-hardTwitterPermanentData transmissionComputer animation
Dynamic random-access memoryHard disk driveBefehlsprozessorDisintegrationVector graphicsCommunications protocolQuicksortOrder (biology)Interface (computing)MultiplicationLatent heatState of matterSolid geometrySemiconductor memoryBoss CorporationField (computer science)Hard disk driveBus (computing)LaptopBefehlsprozessorMiniDiscVector spaceProcess capability index
BefehlsprozessorVector graphicsDisintegrationVector spaceINTEGRALGraphics processing unitMultiplication signBefehlsprozessorTwitterEntire functionOrder (biology)Computer animationLecture/Conference
Computer hardwareSimilarity (geometry)Type theoryRead-only memoryFeasibility studyMiniDiscResource allocationData compressionComputer architectureImplementationRight angleMultiplication signWebsiteHydraulic jumpMathematical analysisContent (media)RankingTable (information)Service (economics)Type theoryData compressionDynamical systemBitMeasurementLibrary (computing)Similarity (geometry)Data storage deviceOperator (mathematics)Slide ruleComputer animation
Data compressionComputerArchitectureCache (computing)Block (periodic table)Line (geometry)Read-only memoryTable (information)Reading (process)Structured programmingTable (information)BefehlsprozessorCASE <Informatik>Row (database)InformationSemiconductor memoryComputerArithmetic meanCircleComputer animation
Structured programmingTable (information)Cache (computing)Reading (process)Data compressionData transmissionHeat transferBefehlsprozessorMiniDiscRead-only memoryComplex (psychology)Semiconductor memoryComputerCache (computing)TransmissionskoeffizientHeat transferMiniDiscData compressionData transmissionMultiplication signBefehlsprozessorSummierbarkeitSet (mathematics)Computer animation
Data transmissionData compressionHeat transferBefehlsprozessorRead-only memoryMiniDiscThread (computing)Block (periodic table)Dressing (medical)Library (computing)DemonLocal GroupArchitectureQuery languageDecision tree learningCASE <Informatik>BenchmarkUniverse (mathematics)Electric currentProcess (computing)Condition numberComputer hardwareSoftwareComputerMultiplication signState of matterSemiconductor memoryVideo gameRule of inferenceDecision theoryContent (media)Fitness functionWordQuery languageData structureLibrary (computing)Series (mathematics)Metropolitan area networkCASE <Informatik>Computer architectureFunction (mathematics)PlotterQuicksortoutputProjective planeSoftware developerSoftwareComputer hardwareSolid geometryBeat (acoustics)Group actionMiniDiscTwitterComputer animationComputer animation
Data compressionSoftwareComputer hardwareElectric currentProcess (computing)Condition numberDecision theoryComputerMathematicsDivisorDemosceneComplete metric spaceCartesian coordinate systemCondition numberData compressionRight angleProcess (computing)Hand fanCodeRevision control
Decision theoryComputerDivisorDecision theoryDivisorMathematicsMultiplication signComputer scienceRevision controlComputerComputer animation
Pairwise comparisonRange (statistics)Interface (computing)Meta elementOrder (biology)Graph (mathematics)Ocean currentPattern languageTwitterComputerSimilarity (geometry)Different (Kate Ryan album)Goodness of fitFront and back endsSoftware maintenanceMobile appCartesian coordinate systemDatabase1 (number)VelocityMusical ensembleNormal (geometry)Data compressionNumberExterior algebraAnalytic continuationMappingLine (geometry)Observational studyQuantumWebsiteArithmetic progressionData structureRight angleMoment (mathematics)MathematicsDigital rights managementData conversionSequelOperator (mathematics)Sinc functionBoss CorporationSystem callDemosceneLecture/Conference
Transcript: English(auto-generated)
Good evening. As Mirene said, I would like to share my views with you on how to store and analyze large data silos, having in account how the modern computer architectures are evolving. So, a few words about me. I am a physicist by training, a young computer scientist by passion, and I do believe in open source.
The proof is that I spent a long part of my life doing open source development. Probably the project that I invested the most is PyTables, where I spent almost 10 years with it.
Although my current pet projects are a blog, because I am going to talk quite extensively about the last ones during my talk. So, why open source? Well, in my opinion there is a big duality between dreams and reality.
Many times we, the programmers, think that some things can be improved, right? The thing is to try to find time in order to implement that. I am of the opinion that, or I share the opinion with Manuel Ultra, for example,
that the art is in the execution of an idea and not in the idea itself. Because there is not much left on an idea. So, open source allowed me to implement my own ideas, and it's a nice way to fulfill yourself while helping others as well.
So, I am going to talk, first of all, to introduce the need for speed, because the goal is to analyze as much data as possible, using the existing resources that you have. Then, I will talk about new trends in computer hardware, because I think that seeing the evolution
of the computer hardware is very important in order to design your data structures and your data containers. And I will finish showing you big goals, which is an example, just an example, of data containers for large datasets that follow the principles of these newer computers, the forthcoming newer computer architectures.
Okay, so why we need speed, of course. Let me remind you the main strengths of Python. Of course, I think one of the most important things about Python is that there is a rich ecosystem of data-oriented libraries.
And most of you will know about NumPy, Panda, SciQ-Learn, a lot of different libraries. And also Python has a reputation of being slow, but probably most of you also know that it is very easy, well, it
is feasible at least, to identify the bottlenecks of your programs and then use or make C extensions in order to reach C performance. Using excellent tools like Cython, Suik or F2Py or others.
But for me, particularly, the most important thing about Python is interactivity. The ability to be able to interact with your data and see your filters, what's the result of your filters, the result of your queries, almost in real time.
This is the key thing about Python, for me. But of course, if you want to handle big amounts of data and you want to do that interactively, you need the speed, right? Because if not, this is a no-go.
But designing code for the storage performance depends very much on the computer architecture. And that will be the main point of my talk today. In my opinion, existing Python libraries need to invest more effort in getting the most out of existing and future computer architectures.
Also, let me be clear about the meaning of my talk. I mean, I am not going to talk about how to store and analyze data in big clusters or farms of big clusters. In my opinion, this is not exactly the niche of Python.
I mean, the real workhorse of Python is being able to work on big servers, maybe, but especially and mostly in laptops. A lot of people are using Python on their own laptops. My goal is to try to help them in order to work with more data using laptops or big servers.
But trying to optimize for laptops or servers doesn't mean that this is going to be a trivial task. Because these laptops, modern laptops, modern servers, are very, very complex beasts.
We have to leverage or we have to understand how the architecture is designed, how to access memory, how the different caches work, a lot of things. So let's have a look at the current architectures and see how these architectures should be leveraged in order to design new data structures.
The new trends in computer architecture are mainly driven by the nanotechnology.
I think it's very interesting to see how Richard Feynman predicted the nanotechnology explosion as soon as like almost 50 years ago. So I think it's nice for you to check this talk.
Anyway, so I think the most important thing with memory architecture nowadays is the difference in speed, the evolution in speed between memory access time versus CPU cycle time. We know that the CPUs are getting faster and faster.
And in fact the speed grows up almost in an exponential way, not exponential but close. So the Moore law. But in contrast the memory speed is increasing very slowly, very, very slowly.
And this is creating a big gap, a big mismatch between CPU speed and memory speed. And this is a very important key on the evolution of the architectures. So if we see the evolution of the architectures, we can see that in the
80s for example, the memory architecture of the machines, of the computers was very simple. Just a couple of memory layers, memory and the mechanical disk. Then in the 90s or 2000s, vendors realized this problem between the mismatch between the memory and the CPU speed.
And they started to introduce two additional levels in the CPUs of cache. And nowadays in this decade it's very useful to have up to six layers of memory.
So this is a big change in the paradigm and it's not the same thing to program for a machine in the 2010s than for a machine in the 80s. So in order to understand how we can adapt better to the new
architectures, it's important to know the difference between reference time and transmission time. So let me explain. So when the CPU asks for a block of data in memory, the time that it takes from the CPU request until the memory is starting to transmit, the data, it is called the reference time.
Others call it latency as well. And then the time that it takes once the request has been received and the transmission starts until it ends, it is called the transmission time.
The thing is if you have a big mismatch between the reference time and transmission time, you are not doing an optimized access to your data. So the interesting idea is that the reference time and the transmission time should be more or less in the same order.
But of course not all storage layers are created equal. That means for example that in memory, which has a reference time typically of 100 nanoseconds, we can transfer up to one kilobyte in this amount of time.
But for solid state disks, where the reference time is 10 microseconds, we can transfer up to four kilobytes using the same time. And for mechanical disks, this block, typically the reference time is around 10 milliseconds. And the transfer block, transmission time allows to transfer you up to one megabyte.
So the thing is that the slower the media, the larger the block that should be transmitted in order to optimize the memory access.
And again this has profound implications on how you access the storage as we will see soon. Let me finish this part with some trends on storage. The clear thing is that as we have seen, the gap between memory and permanent storage, hard disks, is large and it's growing right now.
And that means that in order to fulfill this gap, vendors are not creating just SSD devices that have the same interfaces than typical hard disks.
Vendors are starting to create or to put solid state memory in buses like PCI. And also new protocols or new specifications are being started to be introduced in order to put all this solid state memory in laptops.
So in your own laptop, we'll be able to access solid state memory at PCI speeds, which is very different to access solid state disks via the traditional SATA bus.
And also the trends on CPUs is that we are going to see more cores, of course, wider vectors for doing simple extraction of multiple data. And we are going, well, we are seeing already integration of the GPUs and the CPUs in the same time. And these are the trends that we should have in mind in order to define or in order to produce our new data containers.
So what I'm going to do is to show you an example of implementation of data containers that leverages these new computer architectures.
So because it provides data containers, it's a library that provides data containers that can be used in a similar way than NumPy, Pandas, Diant or others.
And in because data storage is chunked and continues, and chunk can be compressed. And there are two flavors. One is CRA, which is meant to host homogeneous types and indimensional data. Multidimensional data and then C table for heterogeneous types in columnar way.
I am going to skip some slides because I am a little bit short of time. Don't be worried. The important thing that I want to transmit will be the consequences of using these containers.
So I am not going to explain in big detail the difference between contiguous and chunked. The only thing that is important is that chunking is nice because it allows efficient enlarging and shrinking, compression is feasible. And in addition, chunk size can be adapted to the storage layer.
Do you remember that depending on the storage layer that you are going to use, the chunk size should be different, right? So a chunking storage allows you to fine-tune the chunk size for your own needs. So it has other advantages like appending is much faster.
You don't need a copy when you are doing an append operation on a vcalls container. Less memory travels to CPU. And also the table container implemented in vcalls is columnar. So columnar means that the data in columns are next in memory.
So when you want to fetch a record or a column, the only information that you need to transfer... So this is the case of a table that is row-wise, stored in a row-wise fashion.
And if you are interested in this column, in 32 for example, you are going to transfer much more data into the CPU just because of architectural reasons. This is how a computer works right now. On a memory column, on a column-wise table, if you are interested in just
one column, you are going to grab only that column and transfer it into the cache. So that means also less memory travels to the CPU. Also, why compression? Well, the first thing is that it allows to store more data either in memory or on disk.
Another goal is that if your data is compressed, maybe it would be better to have this data compressed in memory or on disk. Transfer the compressed data into the cache and decompress it.
And maybe the sum of this transmission time and this decompression time could be faster in some situations than the time that it takes to transmit the original data set into the caches. And that's the goal of BLOSC, which is the compressor that uses big holes behind the scenes.
BLOSC, the goal is to be faster than M-CPY. It uses a series of techniques that I'm not going to describe, but basically leverages new architectures. In this case, for example, we can see BLOSC decompressing up to five times faster than M-CPY.
I'm not going to describe how BLOSC works. There are other talks about this. The main place to use BLOSC is basically to accelerate the input-output, not only mechanical disks, but especially on the solid state disks and main memory.
BLOSC is a library made in C, and it is widely used, and especially it has been used, for example, in OpenVDB and Houdini, which is a library for producing 3D animation movies and maintained by DreamWorks.
Thank you. There are a series of projects using big holes already. For example, Visual Fabrics, BigQuery, which is meant to produce out-of-core group buys, but on disk, not in memory, because big holes support both containers on disk and in memory.
Also, ContinuousBlaze is using big holes. Quantopian also has, they are very excited about using big holes. I'm going to skip that. These are plots where people are showing how big holes can beat Mongo or SDF5 for their own use cases, of course.
I'm going to close the talk, saying that, well, chances are that there is
a data container that fits your needs, and these containers will be already out there. My advice is always for you to check the existing libraries and to choose the one that fits your needs. Sometimes you can be surprised, and depending on the structure that you are using, you can get much more performance.
Not because of the algorithm, but because of the data structure or data container. Also, you should pay attention to hardware and software trends and make informed decisions
about your current development, which, by the way, will be deployed in the future. So it's important that you are conscious about the new computer architectures, because you are going to use them, or your application is going to use them. Finally, in my opinion, compression, well, I think many people have seen that already.
Compression is a useful feature, not only to store more data, but also to process data faster under the right conditions. Okay, so let me conclude my talk with my own version of a code by Isa Kasimov, which I was a huge fan when I was a teenager.
So it is change, continuing change, an inevitable change that is the dominant factor in computer science.
And in my opinion, no sensible decision can be made any longer without taking into account not only the computer as it is now, but the computer as it will be. Okay, so thank you very much.
Questions? Okay, I also have an announcement. I will be talking about continuing… We get a question, we get a question. Yeah, just maybe, because there are some graphs about this, but just to know, for example, it was some comparison with Mongo, there are some,
like for example, Blusk, very snappy, so I don't know to which reference it is, just because I haven't heard so much about it before. Yeah, if you, what you know or what you see as difference for similar patterns that were tried
by other persons, just if you have some comparisons, if there are some advantages of the technologies you presented. So you mean that Mongo is using the snappy, right? Yeah, for one storage engine, for example, they're using snappy or Zlib, but snappy is faster
for compressing, for example, but there are other things, there's also RocksDB that's used many things. That's a good question, and because, as I said before, it uses Blusk behind the scenes, Blusk is, I said that it is a compressor, but it was an oversimplification, Blusk is actually a meta compressor. So Blusk can use different compressors, and in particular, it can use snappy behind the scenes, it can use Zlib, LZ4, which is the
kind of new trend in compression, because it's very fast and compresses very well as well, and it also has support for Blusk LZ. So you have a range of compressors that you can use in order to tailor or to fine-tune for your applications.
Maybe a silly question, I've just been to a talk on Numba, and they claim to speed up Numpy and stuff like that. Does bcalls work with Numba?
Yes, I mean, yeah, bcalls is only providing the data layer, the data structure, right? On top of the data structure, it provides very few machinery, it just provides some SAM, for example, the SAM function, but very little.
So the idea is to use bcalls, for example, and on top of that, you can put Numba, for example, for doing computations, but you can also put Dask, for example, which is a way to do computations in parallel as well.
bcalls is providing a generator interface so that other layers on top can leverage that. You are not bound to use bcalls infrastructure, bcalls machinery for doing computations, but it only provides the storage layer, so to say.
I've got a related question. Can you use Pandas with bcalls as the storage engine? Sorry, can you repeat? Can you use Pandas with bcalls as the storage engine? With us? A comparison with Pandas. No, not a comparison, but can you use Pandas, like all the API of Pandas, and still have bcalls as the storage engine?
As the storage engine, yes, exactly. That's another application, for example. For example, I've seen some references by Jeff, I don't remember his name, the current maintainer of Pandas.
He's trying to see, for example, Pandas can support different backends, like SQL databases or SDF5, and bcalls can be another backend for Pandas itself. So it can be, but it isn't now? No. In my knowledge, there is no backend for Pandas yet, but it could be done.
Any other questions? No? Okay. This was everything. Thank you very much. Just to let you know that this was the last talk of this room. Now there will be lightning talks at quarter past five, and that's everything.
Please go into the app guidebook in your mobile phones and read the talks you are attending to. Before leaving, just a quick reminder, I will be driving a tutorial on Wednesday.
I will be talking more about all these data containers and doing comparisons between bcalls, Pandas, storage layers, NumPy, things like that. If you're interested, please come with us. Okay, thank you.