Add to Watchlist

Scale your data, not your process: Welcome to the Blaze ecosystem


Citation of segment
Embed Code
Purchasing a DVD Cite video

Formal Metadata

Title Scale your data, not your process: Welcome to the Blaze ecosystem
Title of Series EuroPython 2015
Part Number 163
Number of Parts 173
Author Doig, Christine
License CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
DOI 10.5446/20096
Publisher EuroPython
Release Date 2015
Language English
Production Place Bilbao, Euskadi, Spain

Content Metadata

Subject Area Computer Science
Abstract Christine Doig - Scale your data, not your process: Welcome to the Blaze ecosystem NumPy and Pandas have revolutionized data processing and munging in the Python ecosystem. As data and systems grow more complex, moving and querying becomes more difficult. Python already has excellent tools for in-memory datasets, but we inevitably want to scale this processing and take advantage of additional hardware. This is where Blaze comes in handy by providing a uniform interface to a variety of technologies and abstractions for migrating and analyzing data. Supported backends include databases like Postgres or MongoDB, disk storage systems like PyTables, BColz, and HDF5, or distributed systems like Hadoop and Spark. This talk will introduce the Blaze ecosystem, which includes: - Blaze (data querying) - Odo (data migration) - Dask (task scheduler) - DyND (dynamic, multidimensional arrays) - Datashape (data description) Attendees will get the most out of this talk if they are familiar with NumPy and Pandas, have intermediate Python programming skills, and/or experience with large datasets.
Keywords EuroPython Conference
EP 2015
EuroPython 2015
FIL that afternoon welcomed tool my talk get your data and repressive welcome delays a consistent and celebrate interjections I'm a
scientist Candyman analytic we have a of their so we any anytime
in the week you want to contact me I'll be there most of the time I am from Barcelona but I'm currently living in Austin Texas and so also if you're from Spain document Spanish Catalan by the Knight of the mean english or german and this is what my
website and have a couple of the talk that given at the life and conferences you can also check them out and just a little bit of this thing about them continuum analytic you mention company in his keynote today we offer a free Python information called anaconda was very popular in the sci-fi community for libraries that have and Fortran bindings makes it very easy to install them and we are all very integrating the open source community we have responses several projects and conduct ladies that's OK number and where approached one-third of of Python conferences in Europe I found high coincide data anywhere on hiring and so we're going to be tomorrow and the hearing event I've anyone's interested also interferes with the blues that the words a little bit about the stock on it and then the organizing in 5 prime to 3 different areas of 1st a little bit about what what the defines is and what does that mean sending into the brings to the defiance of the communities and the knowledge that the what they call the offense try here with about that later and then inside the laser and there are many projects and I'm gonna believe me talking about work they initially thought and death and how each 1 related to each other and you can follow the flight online if for some reason I don't know will to see it from back there and it also Digital Repository where have them and died in the need we present in the example that I'm showing my slides so you can also try so 1st area 5 areas of the from so many people have their own definition of what the difference me and for me redefines more debt than just machine learning and stuff actually data science into the granting of feels coming together to solve the 2 problems and a lot of people in the scientific community community has already been solving large large scale analytic problems of science scientists deal with large amounts of data and so they have already work with it and then there's this group of machine learning and that people the analytics community with the databases and grace of wet and we find a lot of the data nowadays and then there's a distributed systems opportunity with other built and spot that are trying to scale of problems to if we don't try to find what personas are working in each of the different view there's some terms right the people going a scientist people that are in the machine learning that and that action was more concerned on modeling we have people in the data business analysts our web developers of industry systems of of thing engineers and then in the 2nd sentence containing all research in computational science if you have to find 1 word of of what each of these personas cares about maybe these ones are a proposal of words that identified and so machine learning people that scare about models of finding the right model for solving problems people in the analytics unique army of concerned with reporting like building whether the reports the matrix in the web development building an application of in terms of relation with the science applications that portray apparently the grid problem in the distributed system concerned your architecture that you're building in the scientific computing world in the other so if we use more than 1 word what vocabulary of those people in
those areas with the data scientist and his words like model supervised unsupervised clustering dimensionality reduction of cross-validation Indiana links were people concerned with joining with databases with the findings of filtering getting summary statistics in the web we have scraping crawling to gather data that the information of things like interactive data visualizations in this you this is this year read systems and we have all the bills bark system and working out clusters of string processing etc. in the scientific computing people around with GPU with graphs with some problems with computer computation power and what are the tools that each of these personas working so in the machine learning you find find on the unknown psychic learned in the analytic community it all the database SQL community of people working with Excel to in the web and would have on the web framework I would have these 3 we have OK lesions I would have a great 1st of all we have a way to to share with and better notebooks In systems as I mentioned we have a lot of headroom we have read you have have got we have all these obstacles being built around in the scientific computing we have about the course of a lot of but a lot of the libraries that are used by the machine learning instead of and libraries like number like 5 by I X-ray played cables CI phone number so this for pride a general picture of what the the the status of the defined at the system is recognized so if we take a look at those schools what
what 3 read it is that they're bringing together there's 3 things there's data out there the computational engine behind them and that this expression this fraction is how ask what are you asking for so they have a lot about metadata of information on that data and how to the and containers the meaning the house of the data stored in the U. memory orders and we then have engine that the competition what what gets executed and then we have expression meaning the API the syntax of the language called OWL region that want to express what you want to so what we're looking for in each of these edges of humanity we're we're looking for semantic in in storage of name containers compression and accessibility to your data in out energy that we're looking for performance being able to do that as fast as possible and expressions for 1 when simplicity want to be able to express what we want to do in a language that's all that's very close to lower so just like have not an example of 1 of all those things mean in other languages and for example we have a great file formats right 85 know and next to the yeah decency is the as you like equals but we also have memories of containers like and of the different on non but in terms of semantics we have a lot of what we have types we have fields we have mainly have description of data that we have a relationship between the lives of your data in terms of computation we have different different combination engines that performed those like spot like sigh on for trend and Python itself or the libraries that are built on top of in in terms of the API syntax language I'm talking about the size of the sample API tennis and the binding that have to other libraries that allows us to express that in an easy way we don't have many of the SQL dialect detector so in the course all those libraries and my condensed databases and spot have so I have some how each of those 3 edges in the and that's it let's look at a simple example imagine number so number I have a number of flights that allows us to express the type of the fields you know you know we have to have a way to contain that the for number right with number and the arrays the number by itself need to compute things next compute what the user asking for and and it has no binding to C and Fortran of also Python and in terms of the API we have the I right that's how you express what someone had expressed that the fact that you want to read and so all system they're happy and like you know sad faces of number and Henderson mainly on limited but by the memory of your on your laptop or your your your device and but people scientists like to spread present things with the ratings on a gas has attracted a lot of attention in the scientific community of data scientists analysts also really
dependency the I'd like to begin with the frames that were women in the database world well we have a lot of them dialects as well there's a lot of overhead to set up some and on the bottle yes it has come to light expanded into the system to more than that of scientists and people that are on the way of the engineering of site but it hasn't quite yet 3 direct that to help you in all the cases of data in smaller states you still have a lot of overhead to be performed so
much because of what the latest episode and brings to to his emphasis system that just mention so late started out as how do we expand number like intended to to the 2 out of court really to not be limited just by the ran the latter has and from there there's several spin of projects that have come along the way with things that would learn so 1st we need to expand some of them by independent limitations of expressing all the metadata the that's in them and that's what we're deviation from can't make this up on the description language that's more general than what number him distinct from then when we had a diet which is a dynamic multidimensional arrays with library written in C and that has Python bindings we also found that there was a lot of money to more data around people the assigned is were working with different for file formats and there there wasn't always so an easy way to move from 1 format to another 1 from 1 place to the other 1 so that we're all going a spin of a project that came out of that place the repository and within half a number of which is a good optimization just-in-time compiler inside they have what's are called plays as a as a project which has been kind of the court which isn't just an interface to query data from different backgrounds we have guessed that allows us to to bound the zebra little computing and catalytic stored on disk on the repetition and and the because also column store and also query language that allows us to to get the so if we if we face all these projects in this table that we had before here kind of with the word comes from them and then shape is dismayed metadata extraction of expressing your data in different formats but we have diamonds stores data in a multidimensional arrays and we have already that allows us to switch from 1 of container to another container from 1 from the various to and dwell on the bad things that leads users but we have numbered that allows us to optimize the code and that for a closed hearing and ladies thing is common interface that allows us us to query everything in a you know much unified manner without having to learn each of the different API so we put those packages in our final with finding shape as the metadata in the middle the distinction I would have passed through in the storage as at 1 of those and containers although that will allow them to switch from 1 to the other 1 we have under ranging from traditional that the power of paralyzing with and to optimizing the code with numbers and expressing everything with place and then died people's which also are part of data containers and also have computational power to resolve you what everyone need to compute if we're not playing those projects and kind of giving you the overall picture of what where the all the time so FIL analytics people are interested in tabular data formats like like kind of difference so that their famous there for them and they used as a unified query interface we have older that can be used mainly for everyone it does like a utility function to move the around we have support in the scientific computing world in which some of those underlying problem 2 that are used by many of them libraries independent a system for machine learning and that would have been like that solving the the underlying problems and then we are also engaging with them that with the distributed system world with what's called that distributed and then we seamless server which allows us to serve all all the data in different formats through a unified API so I don't like the idea only 1 of these connector to all these different fields in that it assigns meaning and bringing everything in a unified manner so it we just remove and some of the present is the focus of on what we know talk about in this talk where many of them we talk about the core of our project all related that and they should and he uses the word that each of them for do so the 1st 1 is that the blaze length with defined interface to query data on different sorts so from base you important data in the same way on you look data the same way in the that you do with this is the SQL databases longer maybe they saw as being of high whatever it just called data
and pass this your and then you can do all the squares with all those different back and select columns filter operate reduce its goal of speed of might combined approaches like provide and and columns relabel columns the text matching yeah 1 of the features that we just added to ways is the the place is this way some building a unit of a uniform interface that allows the host data in all of these back in through a different Web API that it's the same for all those the so you can write your name file specifying all the data that you want there were there were there were located and can passed the with the next as the addition and just enough to leave server with all of them are there any other and and pointed out to perform all the computation that would just mentioned before through the API so if will look something like this which we have the data available through the API and then from where we can get the field and we can get all the different datasets inside each of the datasets we cannot forget we can compute the same way that I just mentioned so as you can you can just use things like rope we have an expressive language compute something and on and return out there's also the option to use the trees and electric something like request but I'm I'm going to explain the issue 1st so the shape is the way this graph structured data and understand
the throughout and basically what's called unit time and unit types of just they mentioned many times and that what forms the addition that we also can combine those what's called an and or the structure the time which is a record of and then that record extensible language of so even if it's how you we use that and actually related users as to express cover that would be a form and you can actually combine to have more of like unstructured or semi-structured data and an expressive nested fields and things like that and so for example a board and why said that the old always fields and of the the length of all our table which can be known or unknown only and then the types of those of those of feel so In our previous case where we had several ideas datasets in different formats so and we can get the data shape you look something like that we have a database and said that they did have the table the table has this the with the the different of fields In types and the same for all of so what's the connection between blade where mechanism in the initial while they basic uses the shape of this hypothesis so when we come in when we call data Irish it's all we have access to the data shape and we can explore and so now I can go back to server and OK l divided the data shape of of whatever I put in the place server which I can get because we just saw that
occur to the shape of the leadership of that but I can then express my friends and of judges with the quest to query that and the return is that the adjacent with their with whatever I have the competition so that every time the data is going to turn to the shape and the name of the the so a or is there migration which is like a CPU with time for data so it has a very simple API component were older and I just have to put last my source and target so if I want and I guess I did something the file you all it's I received the Iris data on and that it created itself form and that's a very simple case but you know that you can get more complex problem like for instance to me or it's the offense or moving things from 1 high answers be to work the so how the older than that under the hood is that of a network of different formats and conversion
so if I want to go from x to y the most efficient way older computers that can execute that you and return the final thing that you're type so imagine you wanted to put something and you haven't read the book and you will have to go maybe you get a photo of the file and read the tendency is the political spreads over a distance by understanding you know these your eyes where you can also specify your as 3 where in the last 10 years for the rest of the and they answer because it is this to handle the your eyes so the same horizon and valid for blades are also the ones that word order that's we mentioned that enables followed and so in the in data science we have different sides of the right we have things that are around the device that can fit in memory continue laptop and and that but then we moved to the scale of terrabytes right and that not really new memory but can on your desk and you still want compute be able to compute that because it does reading your your optimal whitening you just but and then we have things that are in the petabyte scale where the fits many so in single with single girl learning with and computing the inverse gigabytes scale with quiet during our and we'll share memory and we're going to things that the best it we use an iterative cluster wanting to think that the the amended so not right-handed has some of the single core and that is bringing the Polish condition power the lives of number unemployment and so we have this share memory and that's distributed for this route cluster and inside of share memory we have 2 ways of scheduling problem will be threatening and wood processing so what would that look for an end user so we have some a number I don't like looks like your image and the left eye so we create an imperative once and for her and some kind of competition we return sold that's the lady evaluation so you have to call compute on it to get of your return to return it and you have to specify the chunk size of the race how you won a petition that you have more information on the documentation state of what they're like the numbers are in terms of you know of the megabyte size of a race that you should not target so In this case it is those those 2 changes need to specify the terms used to call compute to actually perform the computation and then you have to to output alright right you
out consider memory so you can just call them the array and cube training in like that or your result doesn't fit in memory you can actually it to this within each of the 5 5 if you more independent user that the family looks a lot like a of the of but in the house to compute things that don't fit in memory that like the change without you having to change much of the flow that you already used to so this and as we load the virus dataset we used no head on and would raise something imagine that cannot do the once he have the button low laudable C is the that don't fit in your don't fit in you work memory you can still do head against still do the queries but you also have to call compute because the the the evaluation and then we have also another another best collection that's called back that allows you to come work with semi-structured data like the sum of the log files and we have imagine we have reached the the 1 a little as the best and you can just call from if like perspective on compressed map to work load the some files and then where it was like take the 1st to computer the user location frequencies and and turn that into a different because you know the result in your Mac so it feels like like when users are already used to and that for a license under hood without you having to worry about that you might now in like the scale of their or things that don't fit in your data and you actually use of cluster of computers so we have distribution for that so you can see the only difference with between using the best of our best in the world you will get thread movie browsing single known
verses that matters that you have for this kind of tell the client where the done well what is located and then when you call computer you have to have to get and then the client that's the best client but and that make the up the computations in your cluster of computers the relationship was between and later is going that can actually be about important engine for place so you can use latest of the crater language and have that's right computation so you make a that's and then rather around the data of the same data of the object that we mentioned in the in the in the section performed competition get the result was so right now my talk was mainly focused on users right you can know more about developers to what would be if you don't wanna use those tools as they then user but show but you actually want developed with them there's there's some good resources and I mention on on place there a good talk on their the real world at high data that that the Cloud Gate which goes of how these work under this also 0 so you actually wanna
know how to build your own converter we have 1 that's not already built as a back in the system and how can we create 1 that's not explain in a little or a lot in both of the place or talk about sci-fi last week I feel sad and then what another 1 and then the land gave that idea that and there's a lot of talks and that the rate of you know 6 6 months old payments sold projects that span of of laser and there's also very good resources from James priced that site by end of my broken by data related and those stocks go more in depth of the implementation details of those libraries there's many talks have been many libraries that are in this in the heavens and mention of the fact I have mentioned but I have not gone into the details of its thing with them and the already developers of the library that have given good talks and then if you are interested in that kind market we give a talk at site also announced in 2 weeks ago I steady segregate quantum number l on accelerating by phone with that budget compiler the that's the the room and then it you're Python we have and 1 all scoring grand for are the number of 1 of the action of the the number king and the damage here all week so if you have questions on how to use number and they'll be happy to help you and in because a Francesco that is also here and he gave a talk yesterday infinite give a tutorial to water so if you're interested in in in the transfer of storing the data data about the containers of memory this is gonna be giving this tutorial so you can take it out so just to summarize from the Gulf of the stopword and reading the term data science instead of being just a machine learning models actually a building the connections with those 5 areas and how we can bring everything together with and other thing in terms of not just 1 library but instead its library without data we have engines and have expression
and encourage you to start giving up any of those places project if it's something that you know if you can benefit from and I'm this family right it's not it's possible thing still a very talented team and all that are working in all of these projects marked when in Diamond and indeed ashamed of band and then a lot of work in our own place and also and the cloud is also obtained the square developers also working that's but that's the mismatch team and later and we also have some connections with the colds keen In the blasting with Cervantean inference so reach out to any of them if you have interest in in any particular library I and that the have some 5 minutes for 4 questions from the 2 minutes for questions so if use of my
questions on the relationship between these projects and the other projects in the centrifugation community like the really sort do you see this on the part of them replacing them or merging them or complementing and uh and have a good you want you should should appear so I there's a lot of work on connecting those libraries inside the other ones we have several already success stories with thing excited meant so that had to request inside image to speed up some of the computations that they were doing so all the different layers of right there is the universally here that is bringing that instead of extending the use cases for the limitations that some of of any user facing libraries have like number so right now number number by cannot solve some of the there that's the problem the phase because of how they built so in that case those say if you are in the size of terabytes of by then the best different and that's great going to be an alternative and an impact on the other side of the developed really call it improving computation that already exist another library so there's the tensions of the In other margin in making the best for example a dependency on those libraries and improving that the performance of the answer your question and then there's also there that was right in the documentation that under that kind that condition that compares task to things in the distributed systems and whether it is an alternative for and I would say that is not the targeting I think different users and so I would think I would say that some of the benefits of having a low overhead to perform this reduces this recomputation it's kind of a good alternative for for things that exist in the world but but still other people you know it's not my I the other thing but I would encourage you to do read that it FIL partial question tasks are distributed loops speed spot to the every diesel data frames the advantage of using 1 of a mother or watching vastly distributed over the 1st part OK so that wasn't has been asked a lot actually more Matlock in road with extended the death documentation because we were as so many times that that present to inspire and so on and that's the end of and of course spark as a more mature project it uses the JVM and any handle the higher overhead of setting up the fact that this is a python library you can convince people that of condoms that I and you have to bring some benefits who the core I found that scientific and machine learning learning libraries and that can use them and and the any user I would say it brings much lower overhead especially for people in the in the Python when you hold the 1 a mess up with setting up on the part of the an Indian with all with all the you know performance but having said that the you know you can also integrate well on the variable blades came In use so spot is 1 of the that back and used by on so if you want to perform a performance compression comparison between that and support years to use case is very easy to that with ladies because you have the same you're calls them look much identical didn't change the string in the you past to your data on the exam but the same in other the extended section in the past documentation that goes into all the details of that comparison any more questions knowledge so thanks thank you
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation


  549 ms - page object


AV-Portal 3.8.0 (dec2fe8b0ce2e718d55d6f23ab68f0b2424a1f3f)