Add to Watchlist

New Trends In Storing Large Data Silos With Python


Citation of segment
Embed Code
Purchasing a DVD Cite video

Formal Metadata

Title New Trends In Storing Large Data Silos With Python
Title of Series EuroPython 2015
Part Number 38
Number of Parts 173
Author Alted, Francesc
License CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
DOI 10.5446/20114
Publisher EuroPython
Release Date 2015
Language English
Production Place Bilbao, Euskadi, Spain

Content Metadata

Subject Area Computer Science
Abstract Francesc Alted - New Trends In Storing Large Data Silos With Python My talk is meant to provide an overview of our current set of tools for storing data and how we arrived to these. Then, in the light of the current bottlenecks, and how hardware and software are evolving, provide a brief overview of the emerging technologies that will be important for handling Big Data within Python. Although I expect my talk to be a bit prospective, I won't certainly be trying to predict the future, but rather showing a glimpse on what I expect we would be doing in the next couple of years for properly leveraging modern architectures (bar unexpected revolutions ;). As an example of library adapting to recent trends in hardware, I will be showing bcolz, which implements a couple of data containers (and specially a chunked, columnar 'ctable') meant for storing large datasets efficiently.
Keywords EuroPython Conference
EP 2015
EuroPython 2015
so the meaning of old smears the like which might be useful if you don't call tool the story and analyze large data silos having an account holder more than computer architectures are evolving so far the water will mean I'm a process's training in computer science is my question and I do believe in open source and the proof is that I spent the on 1 1 part of my life doing open source development only the project invested the most is supply tables where I spent almost 10 years with within although my current that projects are balls and because I am going to talk about quite extensive extensively about the last 1 limit so I open source well my opinion that is being glottal between dreams and reality the and many times so we the parameters thing that something some improved right the thing that to to try to find time in order to implement that kind of the opinion that the why should opinion with mineral for example that the artist and execution of an idea and not engage itself because there is not much left 1 right here so the open source called me will implement my own ideas and it's a nice way to cool for yourself what you so I'm going to talk about 1st of all to interview the need for speed because they're always relies as much as the best possible using the existing resources that you have then I will talk about new trends in computer hardware because I think that seemed evolution of cooperative from what it is it's it's very important in order to the sign your data structures and you think combined and I will finish showing you the calls which is an example just an example of the data formalized larger datasets that follows the principles of these newer computed the forthcoming where a computer architecture OK so why do we need a speed of course well but let me remind you of the main for of Python of course I think what was 1 of the most important thing so about Python is that there is some 384 system of data oriented libraries and most of you will know about by and their secular a lot of different libraries and also Python-specific reputation of being a small but probably most of you also know that it is very easy all it is this is lovely is to identify the below acts of your programs and then you use made of C extensions in order to treat super forms using examples like sigh Suite 4 2 5 but for me it is particularly what the most important thing about life and is OK the ability to be able to interact with your data and your field parents so what so what the result would you feel there's the result of your queries almost in real time this is the key the key thing about about the only but of course if you want to handle the commands of data and you want to do that and that they believe that you need to speak right because if not that this is not a novel the but the signing call for the storage performance depends very much on the computer architecture and that that would be the main point of my talk that image in my opinion expressed in Python libraries need to invest more effort getting the most out of the system temperature picture computer also note that maybe yeah about the meaning of my talk I mean I am not going to talk about whole pool restored analyze data that because you know because plasters performance of the clusters in my opinion this is not exactly the nature of life and I mean that that that that reality work or loss of life on it's been able to only maybe about especially mostly in laptops OK a lot of people is you see in Python and there's their own laptops and my goal is to try to help them in order to rule out more we furthermore they use we using laptops for weeks but trying to optimize for laptops service doesn't mean that this is going to be a difficult task because the slot of smaller block other service very very complex beast and we have to leverage we have to understand all the architecture is used to design AltLex's memory of all the different precious water other things so this of absolutely quadrants of architecture at and see if the 2nd lecture short the the level it's not to the sign you the structure of the new trends in computer architecture are mainly driving by the view of nanotechnology and I think it's very interesting to see here hold Richard Feynman predicted the the nanotechnology disclosure as soon as are like almost 50 years ago so I think it's it's nice for you to detect this this talk so I think that the most important thing with memory architecture knowledge is there difference in the speed the pollution a speed between memory access time this was if you like about we know that the CPU is is what are you getting faster and faster and in fact the the speed of the would also have almost in this is born 1 inch away but expressive but close so the more than but in contrast to the memories fade is increasing with slowly very little and this is creating a big gap becomes much between CPU speed and memory speed right and this and this is an and very important a key on the evolution of the architecture so we we see you we see that there were some of the architecture we can see that they're in their eighties for example the architecture of the men would that of all the machines of the order of of the computers was very simple they just for a couple of memory layers than the main memory and the mechanical this then in the nineties so thousands of vendors utilize this problem with wind and that the mismatch between the memory and and the city being and they start that include use the 2 additional levels in the in the interviews of that and no other cases is the case is very useful to catch up to to 6 layers of memory OK this is the change in the paradigm and is not the same thing to program for up to a machine in the in the the 2 thousand and then the number of machines in days so know that understand how we can adapt their company Wikitecture so it's important to no the difference between reference time on transmission insulin is going so when this if you ask for a for for local data in memory the time that it takes a From the this if you request and the LiveMemories something to transmit the data it is called the reference that what others only latency as well and then the time that it takes once the request has been received and the position of school starts and on and on until end this call it mission by the thing is if you have a big mismatch between the reference time and as we sometimes you kind of got not doing an optimized access to data so they different and idea is that the reference time of it as we sometimes should be more or less in the same order but there are not all the storage layers of created equal that means for example that the memory in which has reference time typically off 100 nanoseconds we can transfer up to 1 kilo bytes in this in this amount of time but also the state is with where the reference time is 10 microseconds we can transfer up to 4 the robots OK using the same the same by and for mechanical deists the server log the because of of time is around 10 milliseconds and the transferred and that the transfer of the different book of all the emission a loss to transfer you when up to 1 megabyte so the thing is that there's the media the larger the block that should be transmitted in order to optimize the and again this has profound implications products as storage as we will see for or that we finish this part with friends and store it the key of think is that as we have seen the gap between between memory and the amount of storage is is is is larger and it is it's right now and that means that in order to fulfill all cool field is vendors about creating just as as the devices that are known to have the same the same interfaces that and to become obvious when the sort of starting to create multiple solid state memory in the in the boss like this this year BCI and also a it's newer protocols are new and to specifications had been instructed to create tools will be introduced in order to put all the solid state memory in not so you're all not so will be able to access with this the memory at the CIA speeds in which is very different to access as they used to be at the center of of the original the last and also the princess of users that that we are going to see more cost of course and why vector for going to most starting multiple data and we're going about what we have seen already in degradation of the GP using this diffusion this that and these are the trend that most of the research that in mind in order to be fine tool to useful where new data entire so what I'm going to do school show you an example all for implementation and that the different that leverages the user this the this new computer architecture so because it said it provides the coupling dynamics at a library that provides that the companies that can be used in a similar way than the pipeline as or all and in because they decided to stand up and he was content can be compressed the and there 2 flavors 20 service which is meant to host use types and in dimensional data what is a measure of the tendency table for it a genius types in England and Wales the but as I am going to escape from the slightest lights because in I am going little to bit travel time Monday warranted important thing that they want to transmit will be there the consequences of using these so this convention so I'm not going to explain in the details of the difference between continuous and check that the only thing that's important is that exciting it's important that it's it's a nice because it allows for efficient large amount of ranking compression is special and in analysis tent sites can be adapted to the sports later we remember that depending on the spot it's clear that you are going to use the time size of the different right so signed in the story it follows you to fine-tune the jump size for you only so if that's on the other entities like pending is much much much faster you need a copy when you're doing and up and cooperation on because container this memory travels to superior and also the table on dynamic implemented in the course of the circle forgotten that means that the data in columns of makes the memory especially when you want to fetch recall what the on the only information that you need to transfer so this is that it at all the case of table that this role reliance prehistoric in are always function and if you're interested in this column in 32 for example you're going to profit much more data listen you just because of what architectural reasons this is called computers work right now when memory called the on the column-wise stable if you are interested in just 1 goal is going to be that only that column and transfer it to the and is also less memory trials for this also I cooperation 12 the 1st thing is that it's also known to store more data is that in memory or on the but of another another goal is that you're that this compressed maybe maybe it would be better to have this they can compress the memory on this gesture the compressed data into the cache and the complexity and maybe it is the sound of this as we sometimes and the the canvas and then would be faster in some situations and understand dictates the transmitter the audiologist into introduction and that the goal most which is the compressor that is used the because because this is the last of the goal is to be faster than a man to be what it uses set as as a series of techniques that they know what not going to describe what basically Limenitis new architecture In this case for example we can lost the compressing 5 up to 5 times faster than them into the I'm not going to this describe the rules they're all the talks about this and the main the main place to produce lost is basically to accelerate the input output not only mechanical these but especially on the solid state is from main memory most states that it's our library making seeing and and the it is quite widely used and especially this it has been used for example and open the debate and new which is sort of a library of properly using animation 3 D animation movies amend the mighty works thank you serious of projects using because already so for example racial politics the query which is meant to do to produce a lot of according bias but but on this not in memory because because supports both and then this income this kind of memory also continues place you simply calls on doping which travesti they have been excited about using the culture we want to skip that I mean this plot where people only showing how because the among will always applied for they all use cases of course the and I'm going to close the talk were saying that the world does this and that there is so that the companion the fits your needs old and and this is contents of you over the all of the from a life is always for you to check existing libraries to choose the 1 that fits your name and sometimes you can get you could be surprised and the world depending on the such that this structure that you use in your using a bang it much more performance not because of the ongoing but because of the data structure but available data contained also use of the occasion to have word and so what trends and make informed decisions about your current developments which by the way be deployed in the future so it's important that the dual in our conscious about the new computer architectures because you're going to use them 20 applications is going to use them and finally in my opinion complete compression of I think I think many people at the scene that already compression is a useful feature not only to store more data but also to process the faster under the right the right condition OK so and we conclude my talk we found my own version of a cold like which I was too far time when I was a teenager so but it is changed continuing changed and you inevitable change that is the dominant factor in computer science and they may been in a sensible decision can be made any longer without thinking in and book into account not only the computed as it is now but the computed civil look so thank you very much that's only questions but yeah the size of the unknown management I will be talking about the continuation we get which are going and yet just maybe um because the rest graphs about this but just in general for example it was in comparison with normal and there are some like for example velocity there is the uh as I don't know to which reference that is just right because I haven't heard so much about it before um yes if you want you know what you this year's difference for a similar patterns that were tried by other persons just comparisons if you just might and ages of technologies you present feeling that moment was using a mapping right always applied for 1 storage and euphemism reducing study alternative but it's not just faster for compressing for example of delta since there's also works to be for of use many things but what question and because as I said before use is the boss behind the scenes lost is I said that this is the the 2nd about it was an oversimplification what is actually a matter conversion so Boston use different compression and a particular use not be the kind of things you can use it later 4 which is the kind of new trends in compression because very fast and progress is really well as well many of these also has support for about lost LZ so you have high rates of compressive that you can use in order to tailor all tools they're fine tuned for your applications in of them may be so the question of just into a talk on number of and they claim to want to speed up NumPy and stuff like that so does the close or creates a number yes I mean know because it's solid provided that the data later on with the data structure right and the public structure it provides very few emotionally just provide some we sound principle some function but that illegal so they DAC to use because for example and on top of that you can put a number of for example for in for doing computations but you can also put that for example which is a way to move do operations in parallel as well and you can what that because is providing a generator of interface so that all the all the layer from above leverage that you are not bound to use because infrastructure because the machinery for doing competition but it only provides the storage like so I didn't related questions and can and as with because of the Storage Engines further and Europe Denise vendors with because as the story changes we 1st the comparison with the that the of centers like 48 PA vendors and still have an ecosystem so engine what is presented yes yes exactly that's another application for example and for example I've seen ensemble preferences by just a remember his name the government they know of oneness that he's he's trying to see for example understand support different but like a sequel databases or is 5 and because can be another market for 4 planets itself yet so it can be but it isn't enough no and I mean there is no and this in my In my knowledge there is no market for Canada but he could he could be done and of course you know OK so as this is everything that we might see and since you know that this was the last some of these new now there will be a lane intensive quantum past the site and does everything in pieces the lines in the thankful any yeoman farmers and breeders thoughts you're attending to something that's all before leaving just linear we remind that I will be driving a pictorial ones that I will be be talking more about all this but the unbiased and doing comparisons between the calls by NASA costliest known I you know that if you are interested music along with us about and you
Complex (psychology)
Suite (music)
Source code
Real-time operating system
Water vapor
Solid geometry
Medical imaging
Sign (mathematics)
Physical system
Programming paradigm
Product (category theory)
Block (periodic table)
Software developer
Interface (computing)
Arithmetic mean
Data storage device
Order (biology)
Pattern language
Arithmetic progression
Point (geometry)
Open source
Dynamical system
Graph (mathematics)
Rule of inference
Latent heat
Computer hardware
Energy level
Data structure
Contrast (vision)
Metropolitan area network
Computer architecture
Form (programming)
Series (mathematics)
Pairwise comparison
Key (cryptography)
Content (media)
Line (geometry)
Cartesian coordinate system
Limit (category theory)
System call
Table (information)
Query language
Library (computing)
Musical ensemble
State of matter
View (database)
Decision theory
1 (number)
Insertion loss
Function (mathematics)
Parameter (computer programming)
Complete metric space
Data transmission
Video game
Data compression
Central processing unit
Data conversion
Extension (kinesiology)
Position operator
Boss Corporation
Process (computing)
Moment (mathematics)
Fitness function
Entire function
Proof theory
Exterior algebra
Vector space
Computer science
Normal (geometry)
Right angle
Data type
Data management
Sinc function
Read-only memory
Computer programming
Server (computing)
Service (economics)
Observational study
Transport Layer Security
Virtual machine
Gene cluster
Heat transfer
Field (computer science)
Wave packet
Revision control
Natural number
Operator (mathematics)
Analytic continuation
Hydraulic jump
Task (computing)
Condition number
Inheritance (object-oriented programming)
Projective plane
Mathematical analysis
Plot (narrative)
Cache (computing)
Communications protocol
Local ring


  497 ms - page object


AV-Portal 3.8.0 (dec2fe8b0ce2e718d55d6f23ab68f0b2424a1f3f)