Things I wish I knew before starting using Python for Data Processing

Video thumbnail (Frame 0) Video thumbnail (Frame 1634) Video thumbnail (Frame 2285) Video thumbnail (Frame 3036) Video thumbnail (Frame 7402) Video thumbnail (Frame 8570) Video thumbnail (Frame 9318) Video thumbnail (Frame 10360) Video thumbnail (Frame 11536) Video thumbnail (Frame 12354) Video thumbnail (Frame 13076) Video thumbnail (Frame 13782) Video thumbnail (Frame 15090) Video thumbnail (Frame 16246) Video thumbnail (Frame 17652) Video thumbnail (Frame 18363) Video thumbnail (Frame 19321) Video thumbnail (Frame 19998) Video thumbnail (Frame 20703) Video thumbnail (Frame 22695) Video thumbnail (Frame 23713) Video thumbnail (Frame 24597) Video thumbnail (Frame 25673) Video thumbnail (Frame 27113) Video thumbnail (Frame 27806) Video thumbnail (Frame 28700) Video thumbnail (Frame 30015) Video thumbnail (Frame 30835) Video thumbnail (Frame 32385) Video thumbnail (Frame 33982) Video thumbnail (Frame 35007) Video thumbnail (Frame 36035) Video thumbnail (Frame 37227) Video thumbnail (Frame 38743) Video thumbnail (Frame 40359) Video thumbnail (Frame 41048) Video thumbnail (Frame 41681) Video thumbnail (Frame 42379) Video thumbnail (Frame 43889) Video thumbnail (Frame 45216) Video thumbnail (Frame 45913) Video thumbnail (Frame 47849) Video thumbnail (Frame 48711)
Video in TIB AV-Portal: Things I wish I knew before starting using Python for Data Processing

Formal Metadata

Things I wish I knew before starting using Python for Data Processing
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Miguel Cabrera - Things I wish I knew before starting using Python for Data Processing In recent years one of the ways people get introduced into Python is through its scientific stack. Although this is not bad, it may lead to learn solely one aspect of the language, while overlooking other idioms and functionality included in Python as well as some basic software development good practices. I will share some useful tricks, tools and techniques and software design and development principles that I find beneficial when working on a data processing / science project. ----- In recent years of the ways people get introduced into Python is through its scientific stack. Most people that learned Python this way are not trained software developers and many times it is the first contact with a programming language. Although this is not bad, it may lead to learn solely one aspect of the language while overlooking other idioms, standard and common libraries included in Python as well as some basic software development good practices. This may become a problem when a data science project is moved from an experimentation phase to an integration with technical environment. In this talk I share some useful tricks, tools and techniques and as well as some software design and development principles that I find beneficial when working on a data processing / science project. The talk is divided into two parts, one is Python centered, where I will talk about some powerful Python construct that are useful in data processing tasks. This include some parts collections module, generators and iterators among others. The other I will describe some general software development concepts including SOLID, DRY, and KISS that are important to understand the rationale behind software design decisions.
Electronic data processing Algorithm Coefficient of determination Demon Process (computing) Internet forum Hypermedia Uniqueness quantification Process (computing) Cycle (graph theory)
Area Email Information
Point (geometry) Randomization Observational study Code Direction (geometry) Multiplication sign Virtual machine 1 (number) Student's t-test Protein Wave packet Medical imaging Goodness of fit Object-oriented programming Different (Kate Ryan album) Software Energy level Module (mathematics) Machine learning Information Mapping Software developer Gender Graph (mathematics) Interactive television Mathematical analysis Stack (abstract data type) Universe (mathematics) Right angle Iteration Abstraction
Implementation Machine learning Code
Process (computing) Code Software Code Computer Computer
Scripting language Dynamical system Observational study Cellular automaton Software 1 (number) Parameter (computer programming) Streaming media Endliche Modelltheorie Mereology Code
Programming paradigm Code Attribute grammar Formal language Attribute grammar Object-oriented programming Type theory Object-oriented programming Operator (mathematics) Core dump Matrix (mathematics) Social class Modul <Datentyp> Aerodynamics HTTP cookie Exception handling
Object-oriented programming Personal digital assistant HTTP cookie Social class
Functional programming Time zone Duplex (telecommunications) Orientation (vector space) Constructor (object-oriented programming) Expert system Instance (computer science) System call Template (C++) Object-oriented programming Personal digital assistant HTTP cookie Social class
Type theory Inheritance (object-oriented programming) Key (cryptography) Object-oriented programming Personal digital assistant Multiplication sign Social class
Object-oriented programming Line (geometry) Planning Curve fitting
Degree (graph theory) Single-precision floating-point format Process (computing) Key (cryptography) Object-oriented programming Multiplication sign Interface (computing) Representation (politics) Inversion (music) Physical system Open set Social class
Information Bit rate Personal digital assistant Software developer Combinational logic Parameter (computer programming) Table (information) System call Computer programming Row (database)
Rule of inference Standard deviation Computer file Code Code Right angle Rule of inference Spacetime Task (computing) Self-organization Tendon
Personal digital assistant Configuration space Text editor Maxima and minima
Revision control Cycle (graph theory) Multiplication sign Software Projective plane Video game Code Software testing Software testing Data structure Data structure
Area Programmer (hardware) Descriptive statistics Formal language
Frequency Algorithm Statistics Building Object-oriented programming Block (periodic table) Code Orientation (vector space) Data dictionary Perspective (visual) Library (computing)
Default (computer science) Arithmetic mean Personal digital assistant Operator (mathematics) Real number Integer Generating function Field (computer science)
Electronic mailing list Flow separation Information security Social class
Functional programming Statistics Inheritance (object-oriented programming) Process (computing) Information overload Special unitary group Data dictionary Total S.A. Writing
Context awareness Code Point (geometry) Instance (computer science) Mereology Attribute grammar Mathematics Personal digital assistant Ontology Video game Data structure Quicksort Social class
Derivation (linguistics) Hypermedia Data storage device Ranking Right angle Social class
Graph (mathematics) Different (Kate Ryan album) Online help Letterpress printing Data dictionary Social class
Word Personal digital assistant Directory service Number Element (mathematics)
Socket-Schnittstelle Type theory Object-oriented programming Computer file Bit rate Personal digital assistant Function (mathematics) Data storage device Iteration Implementation Infinity Data dictionary
Functional programming Electric generator Computer file Line (geometry) Sheaf (mathematics) Price index Line (geometry) Data dictionary Mereology Inference Bit rate Personal digital assistant Order (biology) Iteration Local ring Social class
Electric generator Arm Friction Multiplication sign Parameter (computer programming) Generating function Family Number
Functional programming Electric generator Object-oriented programming Semiconductor memory Personal digital assistant Factory (trading post) Multiplication sign Square number Lace Square number Number Data type
Functional programming Electric generator Code Generating function Sequence Computer programming Number
Source code Functional programming Beta function Letterpress printing Generating function Sequence Open set Number Subset Prime ideal Personal digital assistant Reading (process) Physical system
Source code Server (computing) Electric generator Loop (music) Source code Letterpress printing Line (geometry) System call Open set Local ring
Source code Electronic data processing Digital filter Server (computing) Standard deviation Source code Sheaf (mathematics) Streaming media Line (geometry) Streaming media Event horizon Field (computer science) Performance appraisal Data stream Process (computing) Object-oriented programming Semiconductor memory Personal digital assistant Order (biology) Energy level Iteration Resultant Social class
Source code Functional programming Optical disc drive Clique-width Object-oriented programming Iteration Generating function
Medical imaging Process (computing) Arm Autocovariance Closed set Software developer Multiplication sign Iteration Mereology
talk of the session it a cycle the until we should you before you start using Python for that the process is still using from and these will come our next speaker media have few so we're to all uh is
remember this means moderate neuronal um so my dog uses a T they title so but I was really maniac today are so many of the algorithm and will not argue about some things to learn in In the last few years have been working with Python and I that I would prefer to long before starting are using for their so quick introduction on yellow laughter Columbia idea from invariant I work for a company in unique culture assume we do data processing for all those as an I also said be in doing
Python just workable 2 years so it is more like a big year to begin a talk however I think if you're starting with Python birds the data size area you and some some
good stuff from the stock that's my contact information from so the priors for this small world where you where you are so you're relatively new to Python you are used by
the mostly used when the scientific study implies that on you work or you decide to work has the data that were in need for a machine learning you are not necessarily a train so for ending in our so if you are you have yourself so when you're learning experience you probably gonna get in all so this
you you know your prepared to walk away and you 2 minutes back legislative but gonna leave you want so this will be a 1 so who wants to be always a data scientist please raise your hand OK today that analysis lot engineer you know some machine learning developer good OK that's the government like so somebody over yeah so you being more 3 basic really high level so if you already have experience you might be the more of this is really based all title so the gender we're going to talk about some basic concepts and practices them but what object-oriented programming then I will talk about some good use of the collection module so we have something about iterators and terrible as this was like a collection of things I wanted to show us many more things but because of time I do have to be the ones I like the most as there are different things there that different levels of abstraction and so the 1 that we're going to switch to a really high level some cold and we're not going get in in any of those however I will give you some points in some direction when you can get more information so I you like I mentioned talk so let's start with the a story this talk others that is based on my experience but is also based on the use of my colleagues and my interaction with them so let's talk about babies are baby just graduated from university is say map Ph.D. and he was use are in uh matlab and he comes to work for a company where he has produced mostly piping and and she says that the classifier to write code to classify some documents that have for example he uses it to gain and secular Jews I assume you are somehow familiar with thing is that he has rights and really nice IPython Jupiter no sorry with the cold around like graphs and so what is a random image brain get anything from it I but and then he might tell you that you have to integrate the cold In our so he had to go from IPython the will to a really big protein with a lot of that and the that lowest students so on and of course he's lost so he tries the best way to do that we not knowing what's going on and the and the stop writing what some people call spaghetti data size growth
which is same as spaghetti code by for our own so if he has to
integrate the cold it's gonna be really bad for him as well and someone else has the degraded code that person implementation for that almost whatever so How would prevent that happen when we're we are going to start with data size and 1 actually integrate data science and machine learning secular according to on so the thing that there is going back to the basics and In a nutshell I think they are engineers assigned have to become more so for the work in getting to a in the middle point and how you do that well 1st we have to really have a distinction between cold themselves were I think these from a
talk from that you certainly played a doubling this year and as realize how he how he would be so code is something that runs in computer so when you write this creep you right the title of you
probably writing gh goal might have to or follow any comments you know recommendation to just from decoding is just the job is to take our so were in the other hand some people think it's
just the parameters text inside the deliverable by some people think is the whole thing including parts all stream deployment script has been documentation
even customer support technical support are inside the cells were and you want to do during the sulfur studies by dynamic decibel deployable in all the models you can put in answer the question is how the way how would but the way to John for my coding to so I think that this go back to the Python the basic ones Python them from 1 work by the which quite important thing from this this
is I got it from the uh the documentation of like this is by this object-oriented programming language so as a data scientist this you should be able to know what is out there and how to use of so so this is like the 1st thing I'm going to give you a as a Python data scientists with then there our use of and for that I'm I'm going to give you a really quick and dirty Introduction to Art 2 1 hour or so of 3 main concepts and the objects they have data the core attributes and they have some operation on the data that are called a matrix that's been an inch often hobbled cannot get looting by well before
going to that I just want to raise these features between cookies to be better focused on traffic and so
8 o'clock was the class where the before and there they class is gonna like the team place that you use to create more such objects in the case of cookies the cookie cutter here from the quality of the screen and
use you you take who got a great many many many cookies and you read them hopefully after works not all because of of the 4 you so entitled and
this is however an object looks have a class really that's the template for creating duties on that occur in this case is the heart so I mean our construction function that you that is that call every time difference here and
has some they are at the at
tributes and many right now you're ready expert in orientation back if you want to read that located in this instance the duplex if you want to what 1 of
the key concepts in object-oriented promise that you cancel so you can express your last extend 1 1 of the other 1 class to make something special in this case I'll add just a type could be that sitting in the Spain of South America in ideas in this with this are example like I extend the to class and I just got some of what time so put something here
with secular this is where your hand if you use should be there and advise them to be so not so much but when you're when you're working in in our IPython or Jupiter and you're comming secularist temples you're writing statement many of those
staining look like this in what you're doing there's a draw actually calling object in creating object and indirect without so you have to be aware of what you're doing and how you can use that in your are so how do I write good object-oriented programming alone on the right road this is this a really tough question I don't plan to answer today but I didn't give you some there's basic ideas in the objectivity of the warning actually in programming what that want to know 1 is a star and don't repeat yourself so if you are
writing streets and you feel you're repeating repeating goal from 1 5 to another single you might want to create an out and out of date introduces and that's 1 of the key features of degree and of the target using
to the invasion to you things g is always the simple don't try to
put a lot of things that objects are in also used the solid principle is a really abstract principles are not gonna go into details but basically use that 1 class of 1 and you have to do only 1 job and you have in well I'm going to skip the rest of the time issues but my recommendations to check it out into 2 billion below it's really important every nite to know these status of so 1st
preferred thing I like I think is important if you're going to start in serious data science in which their auditory parameter
masters so the next thing that you have to learn once you already know how to organize the recording of this is this spoken combinations is the Commission is like table manners for developers so you're sitting at something you want to do so you know what I'm going other people when you're doing what you're reading over your program in this case I when they are scientists tried to integrate because 1 of the things that annoy people animals is that they have no idea how information and you have to always return the call to to fix it or you fix it yourself on why convention well what is well convention are important because the rate of the in its there they are small because actually things
like let ideas space or Suu task where the indentation rules how they organize the code in a file that it is the fact is that the fact of standard solution learning there's some resources online to you check this is a nice user-friendly way of learning that they got or is an example of right this wrong way to do things there are many
details in these conventions and you might get all so many things to
learn but I just want to go where you have yourself in your edit or in this case is the max 1 I use that sorry for the the idealized or other editors you can
configure probably to help you not only would checking that you're following the convention of the company that you're using that they want to make it but also to help you detecting there thing that might go wrong like for example in this case it's a variable is never used and your data can help to detect such things all of those that I will have a low
dimensional to go into more detail but I don't have the time because I want to show you more cool stuff lots of project structure this thing there's no testing of I'd been assigned for you generally they don't come with test version in a branching namely learn how to use your source control for
reviews in in general this aware that let's there's some books that I recommend you to read
are there really general the name that is to be the 1 language but I you want to get closer to the side of the they define the area becomes amorphous or whatever those are good books to start with some
also I was reading the description of Europe I website and there are some thoughts that I think are relevant and they probably talk about the these features but if you go to 1 of them please tell the guy that I send you there
probably but have you for that so I some so let's go to so now
going to call right now being a really theoretical of boring and you wanna see some cold so as through it so that it's entreaty I
would have loved to known before starting doing cold and frequency I can use in a nutshell the collection much are it's incredible how you when you start using Python from big assigns perspective how do you know of other things that are in the and library in 1 of those things is the collection of em let's are we basically company companies that are like the basic building block for many statistical algorithm if you start from base a date to work today they are based on however I don't think they decide there's no hope to come properly by the sea a mythology public county 1st of them use dictionaries who has read such trouble 2 kinds of what the burden 1 the use of learning the soul they actually more of a running away if this was when you the you look after
operation about for forgiveness I think it's something like that and the means I was reading something like this this is again the this is collection of all the from it was summing of following so some of you this book some of you so it's basically the same you use of all the that has the basically you pass the the full well you already full generation function and in this case is an integer and real by default will be to in this so I have to go into but this is the country was familiar with the counter
field you here so got is really cool
but in is just 84 albeit that is already prepared for counting and it's for free are in that's how you use it I just passed only the list of items on a durable there and I just get the come conference Summit securities like you can yeah so the most common some values and do some separation and I found that recall lot remember countries that class can I just mentioned that done take
classes and extended in add your own behavior and for example you want to are calculated probability in for some items I can extend the glass
down there at normalized function and you really have the RO-TD must function for the if please if I want to overload the hour they initialize article normalized as soon as I have all the items in the country director of so when you're counting things in Python that you 1 the when you're using a statistic and you can things like secular writing sometimes what European the features
Our she should outcomes in all kinds of plastic in the sun really nice article from 300 about how it's the counting process that have been developed in Python in is the really good 3 main troubles so Main our thing diarrhea I discovered reason recently there and and there cannot be who familiar with make doubles most of you some of you know so the thing about when you're writing Gold you use people use a lot of dictionaries reasonable couples and when you start integrating that into a logical base you see that told indices of dictionary and you know how what expected so is really makes the cold hard to read in the example if I remove this you have no idea what
I'm talking about what is beauty in this case you might not part of the context may be so just like using named troubles you gonna make the clear so named double of basically sort of like a class generator all life with the particularity that there be the attributes are read-only so they're basically the nice structure implied as if you're familiar with the next moralistically and so you can create classes on the it has household metals
also if you really need to use the the transform it into a dictionary and you'd actually 1 part of an so and I think it's a nice way when you're writing code to organize it and to create sort of like main places that represent things in your in your in your problem you're in your ontology in this days we work so a lot with hotel so I created a hotel based and out of the street there and I actually inherit from it then add an ethyl 2 Douglas something so an eye-popping classes for instances of this
class around my goal and that makes it in my opinion but more readable so so to do the
mn more media Park and is a really interesting because I it's really
going to use it and for me was and actually I remembered that my 1st must use interview for my company working right now I was asked something about the derivatives and durables and I think I answered correctly by the was out of luck I don't think I mean I have been at the store have the right but you know why so this talk about about and so when you see like this you're probably familiar with the whole
iterate through the least but what is happening under what what you can do these help how can you do this and why it works and how can you write your own classes that a have the same behavior in I was
confused I was looking for ways to you know what's the difference between the the rebuilding dictionary and so on and I found this nice article by the same reason that when I use to use the the this graph
from him I is just ticket and we're going to start can I like exploring the concept of using this so to come maybe
after it for you right now it's it's a terrible and iterate so an iterator sorry and in the the world is something that you can call that the remember on any will return and iterate and any directory something that produces and then you when you call next abstract OK there's going to more did so at comprehension for example because is a container that containers for example can be at least a dictionary and toppled also their container is something that you can check whether something is sight the container the word for listening come from in this case I checked that 1 numbers in that leaf in this case set in a container is typically a need and a terrible so you can go through all the elements 1 by
1 so in this case this is latest and I call that the
German like the mean x and y got so I can call the price of the types of both 1 is soleus and over is the iterate and I can call them at the next from those are item in obtained the items from the Phillies mouth when you go this store on the in the that's what happened Michael gets the territory Rome delays least and start updating the values so this like syntactic sugar In some ways so in a nutshell the terrible is any object that can return the rate of that includes container like these dictionaries files they have to implement the if you want an object to behave like that you have to implement in the 3rd fundamental some some of those things might not be
finite villages can generate value for everyone as an example of that Bob there's a module entitled qualita tools that have a lot local of functionality working with them in the troubles indirect so How will implement my your during the so for the galley pragmatic reasons you can implement both the terrible and data of what class so or you have India Jamaica the return itself and then you implement the next method in Python 3 like about the method In this case is just a told that
a final then iterates in inverse order to start from the the last line up when there is no more lines to rhetorical will raise iteration which is this section that is called was felt therefore you can use it easy inference here and then you do the same as well as a as if it were a really so now we go over to the green part of the of the of the this crap so we know that we can get the terrible from things like this and dictionaries and files and from then we can get the the rate of of knowledge
with but there's another way to get back writer and his by using generate who knows as a generator if fewer them so what you so let's start argument is generated from a generation of friction and what's or
from generation function both as a safe are generators FIL from a generator respiration arm let's start with unknown generation and it's basically earliest comprehension in and generating the numbers and then the the idea of creating sorry I'll use of the numbers and then I create the families of the square those numbers so if I take the time selling what these are all the time but
what is disobedient number Brody I have enough memory to store them in my my random something or might be used you can do the same with generations inspirations and this is not a
couple years although it looks like it could generate it creates a generator objects that we produce but the squares from the least number in lady way so each time I call next it was coupling to the square in return I think about a body that's a factory of items in the factory uses in this case the function the square function will be by it's by itself some so if I want to do the same idea to do with the latest I can get a generates the squares and the laces where I can bring the items would be all the generated when
our therefore internally calls the next function this week before about those number they don't so I generation function is the same idea but uses a magical work
colonial all that works not a nice ways when you create this you call the fortunately would you will obtain also generation a generator then you call next what will happen it's not the code is going be executed then the will return the value back to the to the program and will continue only after the the next this the guy with the next next this called on and this December and calculated that the 1 i should sequence that you might be you're familiar I'm sure you're familiar with and I can just called exon-exon something to
we aware here these assigned in generated using a wide the you want if I put this into a you will go for it would be generating generating generating their numbers or the sequence of I can use some wonderment explored function from data to produce obtained just a subset of that and in this case I just get the 1st 3
using for you also implemented during data already there also using the yield uh you work mainly replacing the beta function instead the restored itself you just rhetorical generation function in this case I'm reading of fighting for example from the HDFS them in a distributed system you might just
1 1 7 1 server located somewhere and I are imagine that has a source that has been met book
open and I start iterating through it it might this open medical might be even will even generation innovative for example I do something with the lines and then I pass it back and I just got a call out as I with data with a for loop impressive local so that's
more or less the either Wilson iterate so I I I I can't think that is supposed to be related to basic science and data processing and so on what is this with iteration no you sometimes you can load all your data into memory and if you're working to the data field that's probably Europe your situation you might not have enough memory to store all the data you want that I will happen when you use it is so you can work with so in such cases using data streaming in data stream in you can get it by lazy evaluation which is what I just showed generating all processing things as long as they are available available or needed and event result such like the memory data processing pipeline using it by changing the some examples I do show you believe that class
get some food obtain line from the server and use standard do some processing may be split something I can create an order that based on and then check whether said Python common or some random comment and passes over so it's kind of like a I stream that give process and then sent forward and you can change 1st created with the source creative the 1 1 object in passive as they will are there for example and you can just call it that's enough look inside you wanna be going be processing in Ghana like industry fashion so think about in section the movie you're going the various levels of first-level do something the 2nd level of something they are sending data but you
don't have to write it on the object for the 2 countries get a generation function and replace
the whole thing the function just width of an example of how I like to do Our home there's a thought in your which will get into more detail odds and frightening if you got really brought you didn't get the whole idea they also want but you can for sure get more information In
this talk so in finalized on so the question or closing remarks made of scientists and engineers developers name a few sure there start with a collection of data to smaller basic you they are your best friends intervals the director of a meal you data-processing Pylab using them use object-oriented programming for organize you go in have been that's not only to metrical more maintainable but when you go to integration and you're working in large you will have a better time getting cold into the covariance and finally you're going to have to start moving to be more so for engineers instead of being just assign already a data a job where you will have to get becomes more so whatever uh when you want to get your solution you solutions into either 2 things from so credits arms the images that
I use them I base most of my talk in In the proposed articles in Part 2 main ideas coming from writing directly uh the greater clerical against and he also I really like how he had in this article and linking here how he talks about data processing pipelines using such iterators antivirals armor
as if they were just you we're hiring you want to know more about it when it working for we have a lot more stable when you can get some goodies and good just drop by
and talk to us and if you want to talk to me about it talk after the Q & a session also are work there have so question comments remarks where you want to trust my the work to thank you few at the UN is hungry so I don't think many questions with them for question you if you want you really don't have questions we thank you again