Scientist meets web dev: how Python became the language of data

Video in TIB AV-Portal: Scientist meets web dev: how Python became the language of data

Formal Metadata

Scientist meets web dev: how Python became the language of data
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Gaël Varoquaux - Scientist meets web dev: how Python became the language of data Data science is a hot topic and Python has emerged as an ideal language for it. Its strength for data analysis come from the cultural mix between the scientific Python community, and more conventional software usage, such as web development or system administration. I'll show how and why Python is a easy and powerful tool for data science. ----- Python started as a scripting language, but now it is the new trend everywhere and in particular for data science, the latest rage of computing. It didn't get there by chance: tools and concepts built by nerdy scientists and geek sysadmins provide foundations for what is said to be the sexiest job: data scientist. In my talk I'll give a personal perspective, historical and technical, on the progress of the scientific Python ecosystem, from numerical physics to data mining. What made Python suitable for science; How could scipy grow to challenge commercial giants such as Matlab; Why the cultural gap between scientific Python and the broader Python community turned out to be a gold mine; How scikit-learn was born, what technical decisions enabled it to grow; And last but not least, how we are addressing a wider and wider public, lowering the bar and empowering people. The talk will discuss low-level technical aspects, such as how the Python world makes it easy to move large chunks of number across code. It will touch upon current exciting developments in scikit-learn and joblib. But it will also talk about softer topics, such as project dynamics or documentation, as software's success is determined by people.
Trail Slide rule Group action Touchscreen
Sign (mathematics) Web-Designer Reflection (mathematics) Reflection (mathematics) Self-organization Right angle Event horizon Formal language World Wide Web Consortium
Point (geometry) Predictability Cognition Link (knot theory) Open source Computer 3 (number) Open set Virtual machine Medical imaging Machine learning Visualization (computer graphics) Software Software Quantum mechanics Physics Computational science Quantum Task (computing) Condition number Row (database)
Virtual machine Perturbation theory Bit Cycle (graph theory) Solid geometry Library (computing) Formal language Library (computing) Formal language
Area Dataflow Software developer Database Database Instance (computer science) Computer programming Numerical analysis Number Array data structure Array data structure Goodness of fit Object-oriented programming Different (Kate Ryan album) Web-Designer Term (mathematics) String (computer science) Cuboid World Wide Web Consortium
Execution unit Interior (topology) 3 (number) Bit Stack (abstract data type) Total S.A. Open set Web 2.0 Bit rate Website Self-organization Website Quicksort Abstraction Window
Scheduling (computing) Electronic mailing list Mereology Word Uniform resource locator Network topology Object model Website Quicksort Point cloud Scalable Coherent Interface Matching (graph theory) Broadcast programming Web page Mathematical analysis Electronic mailing list Bit Stack (abstract data type) Web crawler Open set Word Oval Network topology Website Abstraction Library (computing)
Module (mathematics) Broadcast programming Multiplication sign Web page Code Counting Electronic mailing list Term (mathematics) Web crawler CAN bus Uniform resource locator Frequency Word Vector space Network topology Term (mathematics) Website Table (information) Quicksort Descriptive statistics
Regulärer Ausdruck <Textverarbeitung> Broadcast programming Web page Computer-aided design Bit Database Electronic mailing list Inverse element Term (mathematics) Web crawler Numerical analysis Uniform resource locator Frequency Sparse matrix Vector space Network topology Term (mathematics) Vector space Matrix (mathematics) Website Quicksort Matrix (mathematics)
Matrizenzerlegung Divisor Constraint (mathematics) Structural load Term (mathematics) Local area network CAN bus Word Term (mathematics) Operator (mathematics) Matrix (mathematics) Text mining Nichtlineares Gleichungssystem Website Quicksort Partial derivative
Email Divisor Constraint (mathematics) Compiler Formal language Sign (mathematics) Goodness of fit Optics Term (mathematics) Helmholtz decomposition Object (grammar) Matrix (mathematics) Representation (politics) Website Quicksort Game theory Constraint (mathematics) Matrizenzerlegung Code Term (mathematics) Computer programming Formal language Word Compiler Computer science Website Text mining
Trail Algorithm Code Mathematical analysis Function (mathematics) Virtual machine Formal language Data model Different (Kate Ryan album) Visualization (computer graphics) Order (biology) Website Video game Self-organization Software testing Internet der Dinge Website Quicksort Laptop
Implementation Link (knot theory) Code Source code Lattice (order) Compiler Web 2.0 Software development kit Error message Integrated development environment Website Compilation album Library (computing)
Algebra Divisor Different (Kate Ryan album) Code Operator (mathematics) Code Conservation law Set (mathematics) Core dump Arithmetic progression Library (computing) Library (computing)
Goodness of fit Vector space Interface (computing) Text mining Website Text mining Mass Linear algebra Instance (computer science) Semantics (computer science) Formal language
Point (geometry) Divisor Multiplication sign Database Inverse element 2 (number) Product (business) Frequency Sign (mathematics) Array data structure Term (mathematics) Operator (mathematics) Matrix (mathematics) Loop (music) Scalable Coherent Interface Algorithm View (database) Code Inverse element Covariance matrix Instance (computer science) Term (mathematics) Element (mathematics) Numerical analysis Number Kontrollfluss Frequency Uniform resource name Right angle Programmschleife Reading (process)
Point (geometry) Divisor Multiplication sign Shape (magazine) Pointer (computer programming) Bit rate Read-only memory Matrix (mathematics) Computational science Loop (music) Data type Electronic mailing list Inverse element Term (mathematics) RAID Element (mathematics) Shape (magazine) Number Structured programming Array data structure Pointer (computer programming) Frequency Data type Programmschleife Speicheradresse Address space
Copula (linguistics) Code Direction (geometry) Multiplication sign Numerical analysis Water vapor Electronic mailing list Regular graph Wave packet Formal language Sequence Pointer (computer programming) Read-only memory Semiconductor memory Operator (mathematics) Vector space Touch typing Speichermodell Directed set Endliche Modelltheorie Data type Operations research Interactive television Bit Sequence Shape (magazine) Element (mathematics) Numerical analysis Type theory Array data structure Structured programming Number Pointer (computer programming) Typprüfung Library (computing) Address space
Implementation Divisor Multiplication sign Electronic mailing list Mereology Element (mathematics) Sequence Cache (computing) Read-only memory Semiconductor memory Befehlsprozessor Operator (mathematics) Vector space Directed set Computational science Proxy server Operations research System call Element (mathematics) Cache (computing) Number Befehlsprozessor Kernel (computing) Vector space Calculation IRIS-T Typprüfung Library (computing)
Number Array data structure Regulärer Ausdruck <Textverarbeitung> Rational number Cache (computing) Befehlsprozessor Multiplication sign Operator (mathematics) Computational science Endliche Modelltheorie Element (mathematics)
Operations research Graph (mathematics) Cluster analysis Element (mathematics) Element (mathematics) Number Number Array data structure Cache (computing) Resource allocation Befehlsprozessor Operator (mathematics) Resource allocation Programmschleife Loop (music)
Regulärer Ausdruck <Textverarbeitung> Expression Expert system Maxima and minima Line (geometry) Element (mathematics) Number Number Programmschleife Cache (computing) Befehlsprozessor String (computer science) String (computer science) Operator (mathematics) Compilation album Curve fitting Compilation album
Regulärer Ausdruck <Textverarbeitung> Just-in-Time-Compiler Sequel Code Element (mathematics) Theory Compiler Similarity (geometry) Performance appraisal Number Cache (computing) Bit rate Befehlsprozessor String (computer science) Query language Compilation album Computational science Curve fitting
Game controller Overhead (computing) Gender Range (statistics) Expression Instance (computer science) Mereology Element (mathematics) Number Arithmetic mean Cache (computing) Kontrollfluss Befehlsprozessor Operator (mathematics) Right angle Lie group Computer-assisted translation Mathematical optimization Writing
Gender Parallel port Database Mereology Numerical analysis Type theory Array data structure Arithmetic mean Message passing Array data structure Kontrollfluss Read-only memory Query language Query language Selectivity (electronic) Library (computing)
Code Java applet Direction (geometry) Database Database Scalability Web service Web-Designer Network topology Normed vector space Right angle Compilation album God
Group action Autocovariance Database Database Cache (computing) Subject indexing Array data structure Word Cache (computing) Ring (mathematics) Query language Network topology Subject indexing Compilation album Right angle Quicksort Compilation album
Group action Electric generator Sequel Multiplication sign Software developer Combinational logic Database Bit Instance (computer science) Query language Array data structure Cache (computing) Query language Subject indexing Compilation album Computational science Problemorientierte Programmiersprache Compilation album Mathematical optimization
Data model Subject indexing Java applet Data storage device Computer simulation Database Computational science Infinity Form (programming) Spacetime
Machine learning Operations research Statistics Scaling (geometry) Graph (mathematics) Information Structural load Interactive television Virtual machine Parallel port Instance (computer science) Computer Statistics Array data structure Machine learning Term (mathematics) Operator (mathematics) Data mining Computational science Pattern language Multivariate Analyse Reading (process) Singuläres Integral
Machine learning Operations research Algorithm Algorithm Structural load Software developer 3 (number) Principle of locality Statistics Data quality Number Stochastic Online-Algorithmus String (computer science) Operator (mathematics) Data mining Social class Multivariate Analyse Gradient descent Singuläres Integral
Covering space Algorithm Structural load Code Expression Code Term (mathematics) Mathematics Term (mathematics) Network socket Matrix (mathematics) Right angle Factorization
Dataflow Regulärer Ausdruck <Textverarbeitung> Dataflow Structural load Code Multiplication sign Structural load Graph (mathematics) Physical law Code Parallel port Compiler Mathematical analysis Parallel port Mathematics Tensor Fluid statics Online-Algorithmus Personal digital assistant Aerodynamics Scheduling (computing) Library (computing) Wide area network
Functional (mathematics) Scheduling (computing) Regulärer Ausdruck <Textverarbeitung> Connectivity (graph theory) Set (mathematics) Compiler Mathematical analysis Parallel port Tensor Operator (mathematics) Computational science Aerodynamics Library (computing) Compilation album Exception handling Noise (electronics) Graph (mathematics) Dataflow Expression Mathematical analysis Instance (computer science) System call Demoscene Fluid statics Mixed reality Scheduling (computing) Library (computing)
Meta element Graph (mathematics) Dataflow Algorithm Software developer Numerical analysis Control flow Reflexive space Inversion (music) Computer programming Formal language Formal language Different (Kate Ryan album) Operator (mathematics) Synchronization Software framework Computational science Library (computing) Form (programming)
Meta element Game controller Run time (program lifecycle phase) Algorithm Code Information overload Numerical analysis Mathematical analysis Reflexive space Inversion (music) Number Software framework Computational science Algorithm Dataflow Run time (program lifecycle phase) Mathematical analysis Code Bit Control flow Computer programming Formal language Numerical analysis Inversion (music) Word Synchronization Software framework
Distribution (mathematics) Code Run time (program lifecycle phase) Mathematical analysis Code Mathematical analysis Numerical analysis Data model Term (mathematics) Different (Kate Ryan album) Hash function Computational science Object (grammar) Data structure Resultant Physical system
Standard deviation Distribution (mathematics) Run time (program lifecycle phase) Weight Code Variance Mathematical analysis Limit (category theory) Data model Mathematics Hash function Hash function Core dump Data structure Pressure Data structure Library (computing)
Point (geometry) Programming paradigm Algorithm Distribution (mathematics) Code Primitive (album) Parallel computing Parallel port Data management Pointer (computer programming) Programmschleife Dedekind cut Core dump Computational science Service-oriented architecture Process (computing) Library (computing) Algorithm Paradox Core dump Array data structure Software development kit Data exchange Process (computing) Physical system Writing Library (computing)
Point (geometry) Java applet Java applet Counting Database transaction Instance (computer science) Database transaction Number Software Read-only memory Web-Designer Semiconductor memory Function (mathematics) Software Distributed computing Speicherbereinigung Extension (kinesiology) Physical system Speicherbereinigung
Type theory Pointer (computer programming) Array data structure System call Type theory Function (mathematics) Website Right angle Damping Formal language Library (computing)
System call Code Software developer Instance (computer science) Wellenwiderstand <Strömungsmechanik> Number Formal language Pointer (computer programming) Array data structure Mathematics Type theory Vector space Function (mathematics) Website Extension (kinesiology) Abstraction Writing Library (computing)
Machine learning Pattern recognition Web-Designer Sheaf (mathematics) Virtual machine Software testing Object (grammar) Encapsulation (object-oriented programming) Wave packet Power (physics)
Machine learning Black box Encapsulation (object-oriented programming) Encapsulation (object-oriented programming) Wave packet Power (physics) Object-oriented programming Class diagram Matrix (mathematics) Software testing Summierbarkeit Cycle (graph theory) Endliche Modelltheorie
Machine learning Algorithm Structural load Software developer Structural load Energy level Software testing Encapsulation (object-oriented programming) System call Wave packet Power (physics)
Machine learning Mathematics Machine learning Information Complex (psychology) 1 (number) Information Cycle (graph theory) Student's t-test Instance (computer science) Unicode Buffer overflow
Machine learning Constructor (object-oriented programming) Complex (psychology) Interactive television Unicode Formal language Arithmetic mean Oval Deadlock Web-Designer Information Energy level World Wide Web Consortium
Point (geometry) Source code Machine learning Computer program Dynamical system Constructor (object-oriented programming) Reflexive space Encapsulation (object-oriented programming) Wave packet Power (physics) Formal language Demoscene Formal language Word Function (mathematics) Network topology Energy level Software testing Energy level output World Wide Web Consortium
Meta element Dynamical system Run time (program lifecycle phase) Sequel Constructor (object-oriented programming) Code Chemical equation Reflexive space Computer programming Formal language Word Lokalisationstheorie Integrated development environment Web-Designer Query language Pauli exclusion principle Compilation album Energy level Data structure Compilation album World Wide Web Consortium
Process (computing) Constructor (object-oriented programming) Personal digital assistant Compilation album Knowledge engineering Right angle Database Energy level Reflexive space Knowledge engineering Formal language World Wide Web Consortium
Goodness of fit Bit rate Constructor (object-oriented programming) Different (Kate Ryan album) Statement (computer science) Compilation album Energy level Reflexive space Knowledge engineering Formal language Theory World Wide Web Consortium
Boss Corporation Algorithm Code Multiplication sign Mathematical analysis Database transaction Cartesian coordinate system Mereology Theory Formal language Type theory Mathematics Software Angle Semiconductor memory Robotics Endliche Modelltheorie Marginal distribution Family Library (computing)
Different (Kate Ryan album) Division (mathematics) Quicksort
Moment (mathematics) Division (mathematics) Staff (military) Formal language
Email Multiplication sign Workstation <Musikinstrument> Virtual machine Basis <Mathematik> Bit Line (geometry) Data analysis Numerical analysis Formal language Wave packet Medical imaging Word Software Term (mathematics) Family Library (computing)
Cybersex Area Group action Theory of relativity Personal digital assistant Website Database Quicksort Mereology Spectrum (functional analysis) Library (computing)
Point (geometry) Medical imaging Statement (computer science) System call
yeah on so our keynote speaker
for assigned no scientific my my or on the predator track today have I think the most of you know already and he is of 1 of his like the quorum more like and that all of them remained on contributors to the scientific stack on Angela yes please will come Gail within a few for good the screen is working Mike is working slides the working group OK some thank you everybody for
coming here they uh to the organisers and Alex the introduction so I think we all agree that your Python pretty cool right that's the idea right
so size the on event was really cool history so I hope you will get coffee this morning I did again so what I'd like to do in this talk is to address that but the very diverse community that we have here and so what what this talk tries to be is a
reflection on what we have in common which is Python so I'll be talking about things you don't understand which is my signs and things that I don't understand which is web development so I don't know how I get into these horrible situations anyhow I
did at some point a PhD in quantum physics so I think I'm qualified as
a scientist but these
days I do computer science foreign yourself so what we try to do is that we try to link to the annual activity so far and of the neurons basically 2 thoughts
and conditions like what you would do when you drive a car the way we do this is we use brain imaging and specifically we we which
this as a machine learning problem this is what I do and we develop Icahn's offered to do this of course so
if you want to try this you can actually do prediction of things like visual stimuli 5 days are on the recordings of brain activity using this open source software and open data
you can go online it's there but I will be talking about this today so
on the way we created a machine learning library which is known as cycle if I say we make here is that with many people was of course not only in
your mind as there was a huge success we suddenly became
cool because this science my there's there's a fairly cool thing is that so these days Python is the go-to language for sides so like to think a bit about how did that happen because we did build so I could learn another builds and there's no other tools but these were built on a solid foundation in Python
is really giving is that nations so to set up the picture
scientists do we have a reputation of being a bit different in the Python community that this historically you may say that they come from Jupiter but then
web developers are very different and I have a dream most scientists do not know what are their boxes I I I saw this kind of discussions with the 2 of them out what is that OK so a
different for instance when developers worry about strings what we worry about numbers in areas of course what developers care about database well we think in terms of arrays of of numbers of course this is so you might think of object-oriented programming but no there is a good enough flood control and we get
to do with the rate right right so there's a bit of a culture again right alright so let's let's do something
together how about we sort the your Python website I mean there too many abstracts 205 I can't read them all and the you know the hugely varied go from OpenStack to making 10 million dollars with a story that so that's when choosing this slide and so the way will
do this is that will do a bit of web scraping to get the data from the website I could've got asked that the conference organizers but that was more right
and then there will be a bit of text analysis and then will be data science and will give you a topic so my thing about this
example is it walks us through a good part of the whole uh Python's stack that's why I like it sorghum using things like you're led word beautiful soon but also that Cycorp learned and not label WordCloud forgotten so the
1st thing that we're going to do is that we're gonna crawl the web sites and so our goal here is to get a schedule the follows from the schedule i mean to retrieve the list of titles in Europe and then mortgages crowded pages and
retrieve abstracts and we have been doing this using beautiful if you've never use that's that's analysis library that allows you to basically do some matching on the document object model tree of an HTML so it's really awesome scientists would never have
developed then agreeing to vectorize the text the idea is that if you get a text it's a bunch of words
right or characters so for each document were in count how many times a word appear as an organ of but this in the table so recall this the frequency frequency for each channel so here we have a term frequency vector that's describing my mind my doctor and you can see that the most common word is aid and then the Python is a very common
so maybe that's not a very good description because some of these terms are all over the documents so what we can do is that we can the ratio between the terms all over the
documents that the frequency of the terms over the whole database and the frequency of the term in the document so we call this the uh uh TF-IDF the term frequency inverse document frequency and you can do this with sighted wearing using what what all chip idea vectorized OK so now I feel a bit more in my conference of
grown from text which I don't understand to vectors of numbers feels better so long we refer to all the documents then we have a matrix
right but to the array that gives us the terms in the documents was the term-document matrix this can be represented as a sparse matrix because most of the terms are present in very few documents right so we can use the site by stack to to use sparse mattresses and the good news is that the scientific community not even the scientific Python community
has developed lots of false operations for a sports interest so wording text mining where things that have been developed by people who do partial monitoring July equations are things like this then we can't extract topics so
what we're going to do here is that we're matrix factorization really take this as term-document matrix and we're going to factorizes into 2 matrices 1 that gives the loadings of documents on various terms and the other that gives the loadings of no sorry wordings of documents on what we're going
to call topics and then loadings and topics on terms right so here's the 1st major tells me
what documents on a different topic and the 2nd matrix tells me what terms on a different topic so this is a matrix
factorization so once again and back and things I know is a computer scientist often we do this with nonnegative constraints in uh and text mining because the fact that the
term is negatively loaded and topic might or might not mean something we can do this in so I could learn site . decomposition not for nonnegative matrix factorization that's where the magic happens so we run this end we get word so
that's a representation of the 1st topic and what is it about it's about the Python language good news the 2nd
topic is about while science and machinery and
then the 3rd topic is something like this thing and
then we can look at all the topics and there's a bunch of different things you may have a synchronous you got a topic about the community what about basically conference organizations Internet of Things best practice and 1 and are not shown here which is thoughts in Spanish order but so as Python is not only a numerical language we can also output website from this using a template engine and if you were make life think you can get a recently used usable
websites so that it's on the web you can have a look at it and there's a link to the code that actually generates all this so you can run if you're interested
they want to try and OK but source psychic learn and the complains that by installed or bibenzyl bank it was a C compiler now you're starting to get angry at me right those back to the fact that were different historically we've had a lot of problems with people don't have fortune compilers why don't you go environments well you often
fortune has given us really really false libraries the meeting and leaves the implementation of Major that's
operations and a fortune optimized when you can get the factor of 70 of difference effective 70 something right so packaging has been historically a major roadblock for scientific Python and the reason is we we
rely on a lot of compiled code and shared library so we've been hitting problems like the fact that libraries were not there or ABI compatibility issues but the good news is that there is a huge amount of progress for 2 reasons the first one hour wheels and specifically recently many Linux wheels so the idea being that you rely only on a conservative course set of libraries so that basically is solving so that the problem I showed shouldn't happen it should should work you can try to tell me the and the other the other reason is
that this this thing it's called open mass which is linear algebra not using fortran so that's good news by way fortune is the very modern language that is super performance because it allows you to automatic vectorization which C cannot do because it's got different semantics so don't think that fortunate something from the seventies qualities yes
but it's different but if a white together we can get rainy or something so for instance I hope that you can get this example to get text mining and any of your website it should be easy to do
right when so it's magic but you can use it all right so now let me let me help you think a
bit more like like assigned to
and in how we code they you know what it's mostly about so we
really love and employ the unemployed right it's the numerical Python covariance matrix or operations arrays operation so the reason reading of and is because of spots so this trying for instance to compute the product of term frequencies versus inverse document frequencies on 100 thousand times right so we can do this with miscomprehension and takes 6 seconds 6 ms may not sound a lot but when I do say nonnegative matrix factorization algorithm I do these things in many many many times and actually a 100 thousand terms is not think it's needed so that is actually points now if we do
this with them by so the got a slightly different in we get 70 microseconds so that's almost a factor of thousand speed up another
thing that we really like is that if you used to it's in that it's actually very much more readable array computing requires learning it but once you've learned it is extremely readable what compare the S T A times IDF to compute the at times I get to the list comprehension so it's
important to realize that rate where actually to us nothing but pointers what is what defines an entire race is a memory address but data types a shape must strike so the shaven strike or things that tell you how you can move through array and basically you moving through the air raid by pointed matrix OK it is moving from 1 1 point to another by computing offsets so when represents is regular that in a structured way
so this is really important because it matches the memory model up just about every numerical library whether it's in C C + + for training were actually believe other languages most languages
fertilizers copulas interactions across this combined language water so for me the value of the bike is ringing that has a memory model so let's look a bit but why it's foster
such a community of idea 1 thing is that you're not getting in touch checking during the operation the 1st you're getting all that that the dynamic types during the competition to due to know what T times idea will do but then it's combined code that runs that the operation but then maybe most importantly you're using direct regular sequential memory access OK so just grabbing your data there's no pointer dereferencing for there's 1 but after you're done you just grabbing chunks of data from from the ran or from the cash and
that's really fast and so then your CPU or your after kernel library can implement things like vector operations using presence in the the operations so that's what we
really makes them by bypass the time checking is part of it but it's not only right that's much faster than this that's cool let's look
at that was directives begin and then suddenly we get a factor of 2 call in compute time for element so you have an idea what this may be due to it's gap so 10 to the 5 elements of approximately the size of the CPU cache you could do the computational you these are probably flowed 64 so here 8 bytes right so the problem is that memory is much lower than the CPU so your goal when you walk past calculations to get things in the CPU as fast as possible and here you're
starting to get out of that so that's bad news for for a computer but there is even worse
if we do a slightly more complex operations the 2 times minus 1 then the cost that you starts increasing so what's going on here well if we look at what's happening I is computing times idea in creating array that we don't see and I'm going to call it temporary arrays and then it's removing 1 from this temporary so what we're doing here is that we're really moving things in and out the cash hugely so we had pretty bad rationalization here so then this is because of the by contributing model it's just the
way that works so we can find this and we see that there is a
huge cost to removing this 1 into the competition but if we play a trick is that would you know role there is and the things slightly better by using an in-place operation for the 2nd so the idea is that we're reusing the allocation of the temporary arrays were not allocating race twice if we did is it gets much much faster and the reason is we've become much better with caffeine reality less so if we look at our
graph we can do and invited place so it's still going out with with the number of elements
but because and operation it's cluster so what we have
here is really a compilation problem might wanna go from this expression to this expression
uh so we want to do things like removing or reducing temporaries or we might want to achieve chunk operations right so if I can do for Willemstad to loops on the data size of the right size then it would be and suppresses non max which is something that's mostly developed by French is felt can't do this using string expressions so that's that's an example number expert
evaluate Chipcom's I give minus 1 a
without being clever mimics was clever for us you get the speed OK so you get the same speed up as the land line right so figure out of them so
that is basically uh a just-in-time compiler will a compiler that does these kinds of things uh with by
putting an inspection another approaches a nice package it's called lazy rate that basically Bill an expression but doesn't evaluated and then evaluate its when you call again the basic it's going around the uh Python evaluation and I like to
point out that this is actually not a problem that is specific to scientific computing it's a similar problem to do things like grouping invaginating sequel
theories from talking about things like don't know him right so just
to to summarize the kind of things you could give to your reduce it your your CEO but it's too small you get over here overhead of Python overhead of operation range if it's too big you fall out of cat of optimal lies in the middle we probably want to be lying here because that's where Big Data is that's where the
magic is the money some people to take a picture so I this part right look at what if we
need for control for instance we
don't want to divide by IDF on 0 so I told you we don't use full control for what we're going to do is that we're going to do an expression of this expression is basically saying that where the idea is the role that returns the and array of billions of but then I will put chip 2 0 OK so that we we don't talk control so um suppose we're looking at ages in the population and I want to compute the mean age of males versus females so then I can select the age array with gender writing and say well for gender is equal to
male uh I'll compute the mean well substract were gender is equal to gender a type of now this is really starting to
look like it it is right were really trying to starting to selections so um what kind of government by parallel to them part there's a library called dependence that is really something in between arrays in in the numerator so it's it's been huge hugely by dividing the community because it's fantastic for these queries in this data messages of foreign numerical algorithms that's maybe less than Catholics because anyhow we're the falling
back to them OK so what what does
it tell me as you're not believing Python right you're doing a better beautiful Python code that sits on top of lots of I believe for tree C + + readings and that gives you scalability but it's insulation problems but then I realized that most web development is actually some beautiful Python code that's sitting on services like a database that could be in C + + and Java in their land in God knows what In node yes and that actually gives as deployment problems the direct compilation problems you deployment problems there
were not that different right which is struggling with similar things instantiated in different matter so know these days I
like to think is the bias the scientists equivalent to don't use sort of what I'm talking so numerics as we've seen a
really efficient the kids we apply them to regular use this data but now apply the words creates cache misses for bigger rings so we need to fight to remove temporaries in may be tempted but if we do
queries and then they're going to be really efficient but if we can use indexes trees so typically databases do that but we're going to need that to a group group covariance so all these the compilation problems but
combinations is and like so we can do for instance we can think of computation and query language that's a bit what non-expert does but I
really hate domain-specific languages and each time I try to use equal because I'm not aware that after I get it wrong in I get annoyed and the other problem is that no by that she extremely expressive things that you can do with them by or with related tools is extremely varied so I don't think that's a
good way to go and any help i'd like like and I want to be doing by so
1 approach is to hack and they're really cool example is putting your is when you're in that's what development you should do better than me that so what pony urine does uh is it will uh compiled python generators to optimize sequel query you can write something that looks like a book by the generator book it's going to Dubai could inspection well based inspection I believe that a mn then grab grab BEST and billed as a sequel query on top of this and optimize its uh by a compilation groupings so so that that's really grows longer really surprising but it's really cool so I'd like to use
draw your attention to something that's happening a lot in the big data
big Data world which is something that's known as spot in its it's a rising star in its Indians and basically on top of the G the amount of the Jenna world and it combines 2 things that combines a distributed stored so people don't realize is usually but it combines a distributed store which is some form of they'd have
babies like stored and a computing model and put them together in it allows it to do distributed computing in a reasonably efficient way now the thing is that we supplied in the world but actually much faster when the data fits in RAM and the reason is that uh we're really representing data as rigorous space race and so then we're going to string the fall where's
the Jabba world that has a lot of references so if you want to
scale up maybe we're going to have to do operations on chunks right maybe we need to the date that and then maybe in parallel or in series is a matter compute things all arrays that fit in RAM often
cash now this is great for certain computing patterns
things that instance known as extract-transform-load but if you're doing multiword statistics which machine learning is about that you really combining information from all over the you're reading and you're
reading learning to that but the interaction between machine the term machine and learning those 2 together make a topic at so the kind of compute graph that you get are horrible and it means that things like out of
course operations which is basically what we're doing when we're chunking data are not efficient there is no data quality uh so 1 approach is to do algorithm development which is what I do so I'm happy and the idea being that you use of online algorithms so it's basically you don't use the same algorithm using the algorithm that works on a string and then you start changing the and the algorithms so if you've heard of
deep learning yeah then the number 1 algorithm that using in deep learning is stochastic gradient descent and that's how it works that's how people can apply the burning which is extremely computationally expensive to huge datasets so back to data
science uh so
we have shown you how we can go from the matrix of term document to a factorisation then there's magic right so there is an algorithm I did not discuss how it works which is imported from
what what the socket that's do is that they take hobbled papers full of Metabric expressions and drinking a lot of coffee they turn it into this cover really hard by the way uh
people have been asking me yes they so why do we still use code that's written 40 years ago or 20 years ago unfortunate because writing stable numerical code is extremely hard in
no better code is being written so far so the reason that we use like it when and by the time have been able to do this is
thanks to the high-level syntax of Python and everything I've presented here so the reason all this is important is because it reduces are cognitive load and allows us to do all right let's talk a
bit about something else than the mere and let's talk about the future and about what's going to make like a great again so I think that we've
been seen recently that data flow in competition law crucial so you can have know the simple data parallel problems you can have the messy compute graphs so you can have you know online algorithms and so data flows engines are actually
popping up everywhere so for instance maybe you've heard of DOS so dust is a pure Python steps search graph compilers so it will represent a set of holes of function calls on the Duck as a graph and components uh and then use a dynamic scheduler on this to do Palin distributed computer but so it's
ringing noise except it's basically static which means I add things to my graph unknown there uh tool that people use in deep learning is the animal in people properly don't realize that it has expression analysis inter and builds a graph of operations optimizes the that's the is a scene possible library the by Google
to dig deeper learning they may also build a graph of operations so aggressive operations or they're in many many different libraries below them I
believe that Python should really sure here they can is reflected can be some form of me to programming and because of the recent Eysenck developments because I think the future is is propellant distributed computing so as
Nethanel Smith who is in by developer said Python is the best America language of because it's not a numerical language and I believe this is
extremely true that we have a bit of a problem here is that the API is really challenging because is wording algorithm design and we can't really do what would you guys have been doing something like Django where there is basically an inversion of control other and and you're no longer writing imperative code as you would
do you're buying into framework and I still believe we can write really complex algorithms like this is just too much cognitive overload
but it's just an API designed well will sold so in terms of
ingredients for our future data flows I think distributed computation and runtime analysis are important things end for this I think Prof accepted the central it's really useful for debate by the wave of upon not Python the number 1 thing and this is is the ability to debug like in the bag in a in a high-level way which means I can be bad things like numerical instability in my algorithm that's really hard to do you you got something that blows
up somewhere in terms of numerical precision of Python is fantastic to the I can do
interactive work which is how much data scientists work this will enable us to this already animals us and will enable us more code analysis which is going to be really important for being efficient then it gives his 1st systems which is extremely important for appellate computing because when
you're doing well in distributed computing need to move data we need to move objects around between different computers and you need to move code for this you need to the so
I realize that so we've been relying on on on pickle distributed computing has been relying hugely on on pickled uh and the idea is that it uses it to distribute the couldn't be done between the different um workers but we can also use it to serialize intermediate results OK so that's 1 way of doing computation on data where all the intermediate results might not fit in in it can be made
very easily with Python and another thing that that we do is that we actually use they call to get a deep have left in the sense of a cryptographic hash of any data structure so it's really nice because it allows you to see if things of change or not so do about recomputation but the problem is that people is actually very limited the weights implemented in the core and the core library
pressures there's no uh support for land is and these things are not fundamental limitations the tradeoffs basically and so the variance of the cold light deal because and I must say that I really like 1 of those 2 or maybe ideas from 1 of those 2 to go in the standard library because it's actually limiting hugely computing not
to be able to because everything so I realize we're never going to be able to people absolutely of and I was to realize that I can write code that always because that's what I do but when I give this to not very advanced user he will at some point
right because the empirical so for me by the way this is more important than the guilt that may be surprising but when you you get to know a distributed computing well these things the a problem data exchange basically that we have is
the small library that we call job that that allows us to do ingredients for data for computing and 1 thing it does is a very simple parallel computing syntax which is basically of a syntactic sugar for problem for loops and behind the hood users threading erm multiprocessing or just about any back you can plug in you can plug in your in back in there uh this false persistence so it's basically a subclass of the goal but this clever things for by raising and gives primitives of core competition the reason I'm pointing this out is it that you very non-invasive syntax and paradox uh so with this with a library like job but we can write algorithms and it's actually used uh in inside psychic even though you may you may not know it well it fossils being
designed to be forced on them by race in it's getting more and more of an extendable back in system so I'm looking forward to a world where we can use things like celery uh 2 uh basically distribute computation from psychic Learning in more of uh web developments and 1 I don't know if it's a good or a bad idea but I'd
like to try so I think the point
in in great it's us and 1 of the reason it's great it's because it's simple which is what a lot of people have been criticizing for for instance the Jabba world tells us that they have software transactional memory and it's really cool it would be nice for Python but I personally I really need to use for a number I needed interestingly Java has gained recently and J. Malik to allocate basically for in memory we'd like better garbage collection we really would like but just about every C extension relies on reference count and the reason is it's actually very easy to
manipulate the reference counting if you're not sitting in the BN right to basically the Python is something that I can manipulate without being inside it which means that it's really great to connect to combined language and I'm talking to people in the conference many people actually use this many people use libraries that have been developed in another language trooper another to to draw the
attention to a site who knows
cited good who uses site good it really gives us the best of C and Python you can add types for speed and they don't things so raw that when you add when you type in and by raise it basically becomes that float stars so of thought to write in same so super fast but you can also use it to buy external libraries and it's surprisingly easy the
good thing is suddenly you're working with the libraries there you working with C like code without any knowledge free pointed at which is for me the number 1 problem of these languages so I see this as an annotation
layer between the by the vehemence and its really fantastic tool by the way I think everybody should be writing extensions using site and they can as it's an abstraction over the C Python library the C Python uh API so for instance
you can write code that's very readable and that complies with Python 3 and by to even that there's been a lot of changes in the sky Python API there's also a good idea is also good for them and by developers because they'd like to change things in this the Python API and if everybody writes site and they will be able to because site and we'll do the impedance vector OK so we need
scientists can work with web developers and we really educated love each other I believe Gelimer a really serious here and you really enjoyed people
who not doing science in the Python community there 1st they teach me think thanks that section they
make that's the tools that like years and so i'd like are tools to be
useful for us and I'd like to point out that so I learn it is actually a really easy machine learning it's really a very simple syntax basically you important object and its the magic of it that will do classification to recognition of things you can still say that and then you give it they don't so it's
basically matresses right we only do interest and so you have to figure out how you convert your on data to matrices M. then you call for it and then you go predict but 2 people 1 of the
successes of cycling is is this encapsulation people have really love the fact that the classifier is sum black box so they can use it without fully understanding the uh so
that's another thing that Python is giving us is uh object-oriented in a really really cool model that allows us to do object-oriented programming without us a crazy uh crazy class diagrams uh and another thing that we've used hugely is about what people
call Dr. documentation driven development so there was a talk about this a so to try to make this API simplest possible what I'm trying to get at here is that we're trying to give you a higher level simple API to reduce year
cognitive load just like Python and then by produces are cognitive load when we're implementing these algorithms so where all due
have to their different things here and we can all benefit from each other what we can do this only for a really careful to reduce each other's cognitive load on what the other does not understand I think it's extremely important so it's important to
be didactic outside of one's own community and actually Python is really good at this the jangled uh a documentation is known as being really excellent but Python worries about syntax being beautiful uh and so To do this we need to be things like avoiding jargon so machine learning is really that it's full of jargon we in cycling try not to have too much we need to prior information and so
for instance students that are applied math students and learn about merits I had to tell you they don't care about you even the French ones that have much on the
1st thing 1 recommendation I have for people that that that that dude API design is build a
documentation upon very simple examples and examples that run so 1 thing that we do is that we this thing of course means gallery that basically users suspects 6 is also to build our documentation running all the examples that means that the examples must run they must run foster means they must be small enough to run and so I think this has helped a lot with the documentation but also the judges like all right to
I think it's pretty it's because of the interaction between people like scientists and people who were not scientist whether they're web developers or deadlocks for anything
have I been censored other people
um what was I saying well anyhow the Python language in its being is the perfect tool him to many fully low-level concepts whether you know the eraser that you can manipulate things like like trees in scenes with high-level word in and I personally think it's a personal opinion but this has been achieved through the recent success of Python by missing during hugely and when you look at how
people are using it at some point but they're pointing to something low level very often dynamism In reflexivity are crucial because it enables me to programming and debugging but we also find that we need for
compilation speed so then there's this this tension between dynamism and compilation and I have the feeling of every word it's also in web development where the say combining sequel query uh and I'm extremely excited about the pets that victory in is pushing forward like the gods on internal
structures to allow checking at runtime for modifications so that will allow us any kind of acts that we do on the code to be uh invalidated if the environment changes uh or the that for functional specialization finally I
think that pied-a-terre has gained and will gain hugely from our database will the and the controversies that are developed a lot in the world and DevOps will book I think it can also give back other things like Knowledge Engineering in AI which are really know growing hugely and just in case you haven't noticed a the science is disrupting just about every job that that you're doing so it's called that there is the science in Python right that's all i have thank you if you if that very much data on
the outbreak you know pretty insights and
different little different world soul questions raise your hands the like the might to wide off of things registered you know that 1 thing at a specific question is a statement that centered world was a very adaptive Python straight I think the they're just several years ago the most of the sentences that wasn't prices 3 which is a very thing entity can use pretty much any is good scientific pectin presence rate of something that in theory and in
that the biggest cost of Python 3 1st with the change of the uh C Python API and so actually people still in niche applications have code that doesn't run and by then 3 because of the city by the with all the main libraries by boss margin random 3 and everything I do random 3 and 2
questions OK probably get that out of the they would ask the ways who about paper a have a trolling of it and my thought yeah I know
a lot about so to give a little background like my brother studied uh language theory so we get crazy discussions all the time uh so yeah I know a lot about these things that uh part of the things I wanted to talk in my thought was the fact that it's not only about protecting that applies not only about it's about the memory model I think by the way by by has progressed hugely in this sense which is it is no longer trying to say I'm going to control the memory for everything uh which historically was a big robot for us I mean we I could not believe that type I would be useful for scientific computing because for a long time I heard that the angle of pi pi with things like a software transactional memory which is really cool by the way both will cost us things a lot in novel and the other thing is we're not going to to get rid of the compiled code because there is so much history making those algorithms really good and it's extremely hard but I do believe that what would abide by world this is doing which is a lot of analysis of the code is extremely extremely useful that it actually thank you we but
not any more questions already in in the back on his sorry that is constant and so but they're going to keep you keep referring to how world your Python world is the division that's clear
love for me knowledge for me told us by they got personal friends in all the communities uh I used all kind of different tools but I'm afraid there is division uh in I'd like to think that it's fueled by the by the different tradeoffs uh and like deflected by the way I don't want it I don't think it's useful but when when you hear a new things like come down which is sort of a package for Python and other things and the reason it was created was basically
peak is and the way I think it is the reason it was created was because that that the scientific crowd was unable to explain the struggles that would have been where the and packaging
tools in Python and just went on and did their own staff well the good thing is that some people were so people at you can match and then worked and now that I believe should be able to work fine but that's 1 example of the division and I think it exists and I think we need to fight the because ah value so there really believe in our values the fact that we're diverse we were able to work together yeah great question in this is
those based on the use of the scenario In 5 7 years thank possible variants to other languages like that or the whole from more things new things 2 or more find moments so you talking in
in the scientific Python in the scientific world yes are and be extremely of community I think or will die so be called for the give you background when we started so I can learn what was 7 years ago everybody would walk up to those in same user crazy everybody does all work a machine learning everybody does matter but it the 7 years down the line and nobody's mentioned this so why do we are is also know as a language is a horrible language but in terms of libraries and I told you know the numerical algorithms are really hot will art has a crazy amount of and for me is that the station or is the reference but what's the value of data analysis is not only numerics it's in combining things and I think we have an edge here so the Matlab yeah I think were eating slowly and they're fighting back I'm getting e-mails on a monthly basis yeah training to condemn network to see how work or would like uh but but the fact that we're going out words they're pouring money to fight this is telling me something maybe it's gonna take a bit of time in the scientific world but I mean in in a good the strong container would be Julius is typed language that is able to do and that's the clock interference and combined extremely fast of connected uses and be in I really don't like it I mean it's a fantastic language is also the best language like I really don't like it because it's a numerical language and they don't think of it that way but it's that the whole community is numerical community and no more that is going to be itself and of course a you thanks to the Vanessa talk
fantastic library psychic learning is only 1 of the libraries in the cited family there is also a psych images I could be your what is your relationship with cited family so that's very historical we used to
have that's like 20 0 8 there used to be but site did with the nest that means these packages you guys revenues this packages through 1 of my nightmares uh inside by and that's how we all sorts of and then to it of sci-fi because cyber was going to make uh a and then we got rid of the sort that means this package it used to be called like it's not learned and 2 intersecting action uh and it means scientific uh it's very historical but was the relations of ideas for friends with friends OK on the last question
on value so that but 1 that's sort of question and to point out that 1 specific thing about from the dead is beyond Python beyond this is where it comes struggle people come to struggle with known specific stuff so if you wanna database or a specific uh you wanna solace spectral nodes area as an part of the candidate can actually do that so it actually sits on top of 5 and not in is more like that together then then the of and so in this case I'm not I'm not really sure why should have something the center library that actually does that what was also I completely
agree so so the comment is come there's more than 5 and basically uh know it is by the way but historically it's not been marketed like this I mean I've heard to image but don't use that use common which is linear this in mind that by the way like like the uh and the other thing is I haven't seen much work go from come now to them not even talking about contributing back to bed but I'm talking about explaining what was being heard right it's extremely important I would really like I would like to call the for each from them is statement but I would like on forged forward 4 point to either died or to push a phonetically to pushing it it but I would be also but we mean 1 place where we can tell everybody go and get your stuff and we need this place to be good and we need to work together in a sense call has achieved this because I'm only has created as it's created in maybe an inside release it showing that you can do things better uh but you need to go all the way back and get get new back in the wider Python ecosystem of improvements because it's all going to benefit OK so long we
have 1 more thing to announce so please don't run away after you've given them fantastic enthusiastic applause forgive you think


  604 ms - page object


AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)