Scientist meets web dev: how Python became the language of data
Formal Metadata
Title 
Scientist meets web dev: how Python became the language of data

Title of Series  
Part Number 
168

Number of Parts 
169

Author 

License 
CC Attribution  NonCommercial  ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and noncommercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license. 
Identifiers 

Publisher 

Release Date 
2016

Language 
English

Content Metadata
Subject Area  
Abstract 
Gaël Varoquaux  Scientist meets web dev: how Python became the language of data Data science is a hot topic and Python has emerged as an ideal language for it. Its strength for data analysis come from the cultural mix between the scientific Python community, and more conventional software usage, such as web development or system administration. I'll show how and why Python is a easy and powerful tool for data science.  Python started as a scripting language, but now it is the new trend everywhere and in particular for data science, the latest rage of computing. It didn't get there by chance: tools and concepts built by nerdy scientists and geek sysadmins provide foundations for what is said to be the sexiest job: data scientist. In my talk I'll give a personal perspective, historical and technical, on the progress of the scientific Python ecosystem, from numerical physics to data mining. What made Python suitable for science; How could scipy grow to challenge commercial giants such as Matlab; Why the cultural gap between scientific Python and the broader Python community turned out to be a gold mine; How scikitlearn was born, what technical decisions enabled it to grow; And last but not least, how we are addressing a wider and wider public, lowering the bar and empowering people. The talk will discuss lowlevel technical aspects, such as how the Python world makes it easy to move large chunks of number across code. It will touch upon current exciting developments in scikitlearn and joblib. But it will also talk about softer topics, such as project dynamics or documentation, as software's success is determined by people.

00:00
Trail
Slide rule
Group action
Touchscreen
00:45
Sign (mathematics)
WebDesigner
Reflection (mathematics)
Reflection (mathematics)
Selforganization
Right angle
Event horizon
Formal language
World Wide Web Consortium
01:34
Point (geometry)
Predictability
Cognition
Link (knot theory)
Open source
Computer
3 (number)
Open set
Virtual machine
Medical imaging
Machine learning
Visualization (computer graphics)
Software
Software
Quantum mechanics
Physics
Computational science
Quantum
Task (computing)
Condition number
Row (database)
02:21
Virtual machine
Perturbation theory
Bit
Cycle (graph theory)
Solid geometry
Library (computing)
Formal language
Library (computing)
Formal language
03:03
Area
Dataflow
Software developer
Database
Database
Instance (computer science)
Computer programming
Numerical analysis
Number
Array data structure
Array data structure
Goodness of fit
Objectoriented programming
Different (Kate Ryan album)
WebDesigner
Term (mathematics)
String (computer science)
Cuboid
World Wide Web Consortium
03:54
Execution unit
Interior (topology)
3 (number)
Bit
Stack (abstract data type)
Total S.A.
Open set
Web 2.0
Bit rate
Website
Selforganization
Website
Quicksort
Abstraction
Window
04:36
Scheduling (computing)
Electronic mailing list
Mereology
Word
Uniform resource locator
Network topology
Object model
Website
Quicksort
Point cloud
Scalable Coherent Interface
Matching (graph theory)
Broadcast programming
Web page
Mathematical analysis
Electronic mailing list
Bit
Stack (abstract data type)
Web crawler
Open set
Word
Oval
Network topology
Website
Abstraction
Library (computing)
05:32
Module (mathematics)
Broadcast programming
Multiplication sign
Web page
Code
Counting
Electronic mailing list
Term (mathematics)
Web crawler
CAN bus
Uniform resource locator
Frequency
Word
Vector space
Network topology
Term (mathematics)
Website
Table (information)
Quicksort
Descriptive statistics
06:16
Regulärer Ausdruck <Textverarbeitung>
Broadcast programming
Web page
Computeraided design
Bit
Database
Electronic mailing list
Inverse element
Term (mathematics)
Web crawler
Numerical analysis
Uniform resource locator
Frequency
Sparse matrix
Vector space
Network topology
Term (mathematics)
Vector space
Matrix (mathematics)
Website
Quicksort
Matrix (mathematics)
07:14
Matrizenzerlegung
Divisor
Constraint (mathematics)
Structural load
Term (mathematics)
Local area network
CAN bus
Word
Term (mathematics)
Operator (mathematics)
Matrix (mathematics)
Text mining
Nichtlineares Gleichungssystem
Website
Quicksort
Partial derivative
08:03
Email
Divisor
Constraint (mathematics)
Compiler
Formal language
Sign (mathematics)
Goodness of fit
Optics
Term (mathematics)
Helmholtz decomposition
Object (grammar)
Matrix (mathematics)
Representation (politics)
Website
Quicksort
Game theory
Constraint (mathematics)
Matrizenzerlegung
Code
Term (mathematics)
Computer programming
Formal language
Word
Compiler
Computer science
Website
Text mining
08:54
Trail
Algorithm
Code
Mathematical analysis
Function (mathematics)
Virtual machine
Formal language
Data model
Different (Kate Ryan album)
Visualization (computer graphics)
Order (biology)
Website
Video game
Selforganization
Software testing
Internet der Dinge
Website
Quicksort
Laptop
09:35
Implementation
Link (knot theory)
Code
Source code
Lattice (order)
Compiler
Web 2.0
Software development kit
Error message
Integrated development environment
Website
Compilation album
Library (computing)
10:19
Algebra
Divisor
Different (Kate Ryan album)
Code
Operator (mathematics)
Code
Conservation law
Set (mathematics)
Core dump
Arithmetic progression
Library (computing)
Library (computing)
11:09
Goodness of fit
Vector space
Interface (computing)
Text mining
Website
Text mining
Mass
Linear algebra
Instance (computer science)
Semantics (computer science)
Formal language
11:55
Point (geometry)
Divisor
Multiplication sign
Database
Inverse element
2 (number)
Product (business)
Frequency
Sign (mathematics)
Array data structure
Term (mathematics)
Operator (mathematics)
Matrix (mathematics)
Loop (music)
Scalable Coherent Interface
Algorithm
View (database)
Code
Inverse element
Covariance matrix
Instance (computer science)
Term (mathematics)
Element (mathematics)
Numerical analysis
Number
Kontrollfluss
Frequency
Uniform resource name
Right angle
Programmschleife
Reading (process)
12:47
Point (geometry)
Divisor
Multiplication sign
Shape (magazine)
Pointer (computer programming)
Bit rate
Readonly memory
Matrix (mathematics)
Computational science
Loop (music)
Data type
Electronic mailing list
Inverse element
Term (mathematics)
RAID
Element (mathematics)
Shape (magazine)
Number
Structured programming
Array data structure
Pointer (computer programming)
Frequency
Data type
Programmschleife
Speicheradresse
Address space
13:58
Copula (linguistics)
Code
Direction (geometry)
Multiplication sign
Numerical analysis
Water vapor
Electronic mailing list
Regular graph
Wave packet
Formal language
Sequence
Pointer (computer programming)
Readonly memory
Semiconductor memory
Operator (mathematics)
Vector space
Touch typing
Speichermodell
Directed set
Endliche Modelltheorie
Data type
Operations research
Interactive television
Bit
Sequence
Shape (magazine)
Element (mathematics)
Numerical analysis
Type theory
Array data structure
Structured programming
Number
Pointer (computer programming)
Typprüfung
Library (computing)
Address space
15:11
Implementation
Divisor
Multiplication sign
Electronic mailing list
Mereology
Element (mathematics)
Sequence
Cache (computing)
Readonly memory
Semiconductor memory
Befehlsprozessor
Operator (mathematics)
Vector space
Directed set
Computational science
Proxy server
Operations research
System call
Element (mathematics)
Cache (computing)
Number
Befehlsprozessor
Kernel (computing)
Vector space
Calculation
IRIST
Typprüfung
Library (computing)
16:15
Number
Array data structure
Regulärer Ausdruck <Textverarbeitung>
Rational number
Cache (computing)
Befehlsprozessor
Multiplication sign
Operator (mathematics)
Computational science
Endliche Modelltheorie
Element (mathematics)
17:06
Operations research
Graph (mathematics)
Cluster analysis
Element (mathematics)
Element (mathematics)
Number
Number
Array data structure
Cache (computing)
Resource allocation
Befehlsprozessor
Operator (mathematics)
Resource allocation
Programmschleife
Loop (music)
17:46
Regulärer Ausdruck <Textverarbeitung>
Expression
Expert system
Maxima and minima
Line (geometry)
Element (mathematics)
Number
Number
Programmschleife
Cache (computing)
Befehlsprozessor
String (computer science)
String (computer science)
Operator (mathematics)
Compilation album
Curve fitting
Compilation album
18:35
Regulärer Ausdruck <Textverarbeitung>
JustinTimeCompiler
Sequel
Code
Element (mathematics)
Theory
Compiler
Similarity (geometry)
Performance appraisal
Number
Cache (computing)
Bit rate
Befehlsprozessor
String (computer science)
Query language
Compilation album
Computational science
Curve fitting
19:19
Game controller
Overhead (computing)
Gender
Range (statistics)
Expression
Instance (computer science)
Mereology
Element (mathematics)
Number
Arithmetic mean
Cache (computing)
Kontrollfluss
Befehlsprozessor
Operator (mathematics)
Right angle
Lie group
Computerassisted translation
Mathematical optimization
Writing
20:34
Gender
Parallel port
Database
Mereology
Numerical analysis
Type theory
Array data structure
Arithmetic mean
Message passing
Array data structure
Kontrollfluss
Readonly memory
Query language
Query language
Selectivity (electronic)
Library (computing)
21:16
Code
Java applet
Direction (geometry)
Database
Database
Scalability
Web service
WebDesigner
Network topology
Normed vector space
Right angle
Compilation album
God
22:04
Group action
Autocovariance
Database
Database
Cache (computing)
Subject indexing
Array data structure
Word
Cache (computing)
Ring (mathematics)
Query language
Network topology
Subject indexing
Compilation album
Right angle
Quicksort
Compilation album
22:54
Group action
Electric generator
Sequel
Multiplication sign
Software developer
Combinational logic
Database
Bit
Instance (computer science)
Query language
Array data structure
Cache (computing)
Query language
Subject indexing
Compilation album
Computational science
Problemorientierte Programmiersprache
Compilation album
Mathematical optimization
24:10
Data model
Subject indexing
Java applet
Data storage device
Computer simulation
Database
Computational science
Infinity
Form (programming)
Spacetime
25:04
Machine learning
Operations research
Statistics
Scaling (geometry)
Graph (mathematics)
Information
Structural load
Interactive television
Virtual machine
Parallel port
Instance (computer science)
Computer
Statistics
Array data structure
Machine learning
Term (mathematics)
Operator (mathematics)
Data mining
Computational science
Pattern language
Multivariate Analyse
Reading (process)
Singuläres Integral
26:01
Machine learning
Operations research
Algorithm
Algorithm
Structural load
Software developer
3 (number)
Principle of locality
Statistics
Data quality
Number
Stochastic
OnlineAlgorithmus
String (computer science)
Operator (mathematics)
Data mining
Social class
Multivariate Analyse
Gradient descent
Singuläres Integral
26:44
Covering space
Algorithm
Structural load
Code
Expression
Code
Term (mathematics)
Mathematics
Term (mathematics)
Network socket
Matrix (mathematics)
Right angle
Factorization
27:31
Dataflow
Regulärer Ausdruck <Textverarbeitung>
Dataflow
Structural load
Code
Multiplication sign
Structural load
Graph (mathematics)
Physical law
Code
Parallel port
Compiler
Mathematical analysis
Parallel port
Mathematics
Tensor
Fluid statics
OnlineAlgorithmus
Personal digital assistant
Aerodynamics
Scheduling (computing)
Library (computing)
Wide area network
28:16
Functional (mathematics)
Scheduling (computing)
Regulärer Ausdruck <Textverarbeitung>
Connectivity (graph theory)
Set (mathematics)
Compiler
Mathematical analysis
Parallel port
Tensor
Operator (mathematics)
Computational science
Aerodynamics
Library (computing)
Compilation album
Exception handling
Noise (electronics)
Graph (mathematics)
Dataflow
Expression
Mathematical analysis
Instance (computer science)
System call
Demoscene
Fluid statics
Mixed reality
Scheduling (computing)
Library (computing)
29:10
Meta element
Graph (mathematics)
Dataflow
Algorithm
Software developer
Numerical analysis
Control flow
Reflexive space
Inversion (music)
Computer programming
Formal language
Formal language
Different (Kate Ryan album)
Operator (mathematics)
Synchronization
Software framework
Computational science
Library (computing)
Form (programming)
29:52
Meta element
Game controller
Run time (program lifecycle phase)
Algorithm
Code
Information overload
Numerical analysis
Mathematical analysis
Reflexive space
Inversion (music)
Number
Software framework
Computational science
Algorithm
Dataflow
Run time (program lifecycle phase)
Mathematical analysis
Code
Bit
Control flow
Computer programming
Formal language
Numerical analysis
Inversion (music)
Word
Synchronization
Software framework
30:55
Distribution (mathematics)
Code
Run time (program lifecycle phase)
Mathematical analysis
Code
Mathematical analysis
Numerical analysis
Data model
Term (mathematics)
Different (Kate Ryan album)
Hash function
Computational science
Object (grammar)
Data structure
Resultant
Physical system
32:01
Standard deviation
Distribution (mathematics)
Run time (program lifecycle phase)
Weight
Code
Variance
Mathematical analysis
Limit (category theory)
Data model
Mathematics
Hash function
Hash function
Core dump
Data structure
Pressure
Data structure
Library (computing)
32:53
Point (geometry)
Programming paradigm
Algorithm
Distribution (mathematics)
Code
Primitive (album)
Parallel computing
Parallel port
Data management
Pointer (computer programming)
Programmschleife
Dedekind cut
Core dump
Computational science
Serviceoriented architecture
Process (computing)
Library (computing)
Algorithm
Paradox
Core dump
Array data structure
Software development kit
Data exchange
Process (computing)
Physical system
Writing
Library (computing)
34:13
Point (geometry)
Java applet
Java applet
Counting
Database transaction
Instance (computer science)
Database transaction
Number
Software
Readonly memory
WebDesigner
Semiconductor memory
Function (mathematics)
Software
Distributed computing
Speicherbereinigung
Extension (kinesiology)
Physical system
Speicherbereinigung
35:26
Type theory
Pointer (computer programming)
Array data structure
System call
Type theory
Function (mathematics)
Website
Right angle
Damping
Formal language
Library (computing)
36:29
System call
Code
Software developer
Instance (computer science)
Wellenwiderstand <Strömungsmechanik>
Number
Formal language
Pointer (computer programming)
Array data structure
Mathematics
Type theory
Vector space
Function (mathematics)
Website
Extension (kinesiology)
Abstraction
Writing
Library (computing)
37:25
Machine learning
Pattern recognition
WebDesigner
Sheaf (mathematics)
Virtual machine
Software testing
Object (grammar)
Encapsulation (objectoriented programming)
Wave packet
Power (physics)
38:10
Machine learning
Black box
Encapsulation (objectoriented programming)
Encapsulation (objectoriented programming)
Wave packet
Power (physics)
Objectoriented programming
Class diagram
Matrix (mathematics)
Software testing
Summierbarkeit
Cycle (graph theory)
Endliche Modelltheorie
38:56
Machine learning
Algorithm
Structural load
Software developer
Structural load
Energy level
Software testing
Encapsulation (objectoriented programming)
System call
Wave packet
Power (physics)
39:39
Machine learning
Mathematics
Machine learning
Information
Complex (psychology)
1 (number)
Information
Cycle (graph theory)
Student's ttest
Instance (computer science)
Unicode
Buffer overflow
40:23
Machine learning
Constructor (objectoriented programming)
Complex (psychology)
Interactive television
Unicode
Formal language
Arithmetic mean
Oval
Deadlock
WebDesigner
Information
Energy level
World Wide Web Consortium
41:13
Point (geometry)
Source code
Machine learning
Computer program
Dynamical system
Constructor (objectoriented programming)
Reflexive space
Encapsulation (objectoriented programming)
Wave packet
Power (physics)
Formal language
Demoscene
Formal language
Word
Function (mathematics)
Network topology
Energy level
Software testing
Energy level
output
World Wide Web Consortium
42:09
Meta element
Dynamical system
Run time (program lifecycle phase)
Sequel
Constructor (objectoriented programming)
Code
Chemical equation
Reflexive space
Computer programming
Formal language
Word
Lokalisationstheorie
Integrated development environment
WebDesigner
Query language
Pauli exclusion principle
Compilation album
Energy level
Data structure
Compilation album
World Wide Web Consortium
42:53
Process (computing)
Constructor (objectoriented programming)
Personal digital assistant
Compilation album
Knowledge engineering
Right angle
Database
Energy level
Reflexive space
Knowledge engineering
Formal language
World Wide Web Consortium
43:39
Goodness of fit
Bit rate
Constructor (objectoriented programming)
Different (Kate Ryan album)
Statement (computer science)
Compilation album
Energy level
Reflexive space
Knowledge engineering
Formal language
Theory
World Wide Web Consortium
44:31
Boss Corporation
Algorithm
Code
Multiplication sign
Mathematical analysis
Database transaction
Cartesian coordinate system
Mereology
Theory
Formal language
Type theory
Mathematics
Software
Angle
Semiconductor memory
Robotics
Endliche Modelltheorie
Marginal distribution
Family
Library (computing)
46:34
Different (Kate Ryan album)
Division (mathematics)
Quicksort
47:39
Moment (mathematics)
Division (mathematics)
Staff (military)
Formal language
48:40
Email
Multiplication sign
Workstation <Musikinstrument>
Virtual machine
Basis <Mathematik>
Bit
Line (geometry)
Data analysis
Numerical analysis
Formal language
Wave packet
Medical imaging
Word
Software
Term (mathematics)
Family
Library (computing)
50:50
Cybersex
Area
Group action
Theory of relativity
Personal digital assistant
Website
Database
Quicksort
Mereology
Spectrum (functional analysis)
Library (computing)
52:30
Point (geometry)
Medical imaging
Statement (computer science)
System call
00:01
yeah on so our keynote speaker
00:04
for assigned no scientific my my or on the predator track today have I think the most of you know already and he is of 1 of his like the quorum more like and that all of them remained on contributors to the scientific stack on Angela yes please will come Gail within a few for good the screen is working Mike is working slides the working group OK some thank you everybody for
00:49
coming here they uh to the organisers and Alex the introduction so I think we all agree that your Python pretty cool right that's the idea right
00:59
so size the on event was really cool history so I hope you will get coffee this morning I did again so what I'd like to do in this talk is to address that but the very diverse community that we have here and so what what this talk tries to be is a
01:17
reflection on what we have in common which is Python so I'll be talking about things you don't understand which is my signs and things that I don't understand which is web development so I don't know how I get into these horrible situations anyhow I
01:36
did at some point a PhD in quantum physics so I think I'm qualified as
01:40
a scientist but these
01:42
days I do computer science foreign yourself so what we try to do is that we try to link to the annual activity so far and of the neurons basically 2 thoughts
01:52
and conditions like what you would do when you drive a car the way we do this is we use brain imaging and specifically we we which
02:02
this as a machine learning problem this is what I do and we develop Icahn's offered to do this of course so
02:11
if you want to try this you can actually do prediction of things like visual stimuli 5 days are on the recordings of brain activity using this open source software and open data
02:22
you can go online it's there but I will be talking about this today so
02:27
on the way we created a machine learning library which is known as cycle if I say we make here is that with many people was of course not only in
02:37
your mind as there was a huge success we suddenly became
02:42
cool because this science my there's there's a fairly cool thing is that so these days Python is the goto language for sides so like to think a bit about how did that happen because we did build so I could learn another builds and there's no other tools but these were built on a solid foundation in Python
03:05
is really giving is that nations so to set up the picture
03:11
scientists do we have a reputation of being a bit different in the Python community that this historically you may say that they come from Jupiter but then
03:21
web developers are very different and I have a dream most scientists do not know what are their boxes I I I saw this kind of discussions with the 2 of them out what is that OK so a
03:35
different for instance when developers worry about strings what we worry about numbers in areas of course what developers care about database well we think in terms of arrays of of numbers of course this is so you might think of objectoriented programming but no there is a good enough flood control and we get
03:55
to do with the rate right right so there's a bit of a culture again right alright so let's let's do something
04:02
together how about we sort the your Python website I mean there too many abstracts 205 I can't read them all and the you know the hugely varied go from OpenStack to making 10 million dollars with a story that so that's when choosing this slide and so the way will
04:27
do this is that will do a bit of web scraping to get the data from the website I could've got asked that the conference organizers but that was more right
04:37
and then there will be a bit of text analysis and then will be data science and will give you a topic so my thing about this
04:44
example is it walks us through a good part of the whole uh Python's stack that's why I like it sorghum using things like you're led word beautiful soon but also that Cycorp learned and not label WordCloud forgotten so the
05:02
1st thing that we're going to do is that we're gonna crawl the web sites and so our goal here is to get a schedule the follows from the schedule i mean to retrieve the list of titles in Europe and then mortgages crowded pages and
05:15
retrieve abstracts and we have been doing this using beautiful if you've never use that's that's analysis library that allows you to basically do some matching on the document object model tree of an HTML so it's really awesome scientists would never have
05:32
developed then agreeing to vectorize the text the idea is that if you get a text it's a bunch of words
05:41
right or characters so for each document were in count how many times a word appear as an organ of but this in the table so recall this the frequency frequency for each channel so here we have a term frequency vector that's describing my mind my doctor and you can see that the most common word is aid and then the Python is a very common
06:08
so maybe that's not a very good description because some of these terms are all over the documents so what we can do is that we can the ratio between the terms all over the
06:17
documents that the frequency of the terms over the whole database and the frequency of the term in the document so we call this the uh uh TFIDF the term frequency inverse document frequency and you can do this with sighted wearing using what what all chip idea vectorized OK so now I feel a bit more in my conference of
06:39
grown from text which I don't understand to vectors of numbers feels better so long we refer to all the documents then we have a matrix
06:51
right but to the array that gives us the terms in the documents was the termdocument matrix this can be represented as a sparse matrix because most of the terms are present in very few documents right so we can use the site by stack to to use sparse mattresses and the good news is that the scientific community not even the scientific Python community
07:16
has developed lots of false operations for a sports interest so wording text mining where things that have been developed by people who do partial monitoring July equations are things like this then we can't extract topics so
07:31
what we're going to do here is that we're matrix factorization really take this as termdocument matrix and we're going to factorizes into 2 matrices 1 that gives the loadings of documents on various terms and the other that gives the loadings of no sorry wordings of documents on what we're going
07:56
to call topics and then loadings and topics on terms right so here's the 1st major tells me
08:04
what documents on a different topic and the 2nd matrix tells me what terms on a different topic so this is a matrix
08:12
factorization so once again and back and things I know is a computer scientist often we do this with nonnegative constraints in uh and text mining because the fact that the
08:26
term is negatively loaded and topic might or might not mean something we can do this in so I could learn site . decomposition not for nonnegative matrix factorization that's where the magic happens so we run this end we get word so
08:46
that's a representation of the 1st topic and what is it about it's about the Python language good news the 2nd
08:55
topic is about while science and machinery and
08:59
then the 3rd topic is something like this thing and
09:03
then we can look at all the topics and there's a bunch of different things you may have a synchronous you got a topic about the community what about basically conference organizations Internet of Things best practice and 1 and are not shown here which is thoughts in Spanish order but so as Python is not only a numerical language we can also output website from this using a template engine and if you were make life think you can get a recently used usable
09:35
websites so that it's on the web you can have a look at it and there's a link to the code that actually generates all this so you can run if you're interested
09:45
they want to try and OK but source psychic learn and the complains that by installed or bibenzyl bank it was a C compiler now you're starting to get angry at me right those back to the fact that were different historically we've had a lot of problems with people don't have fortune compilers why don't you go environments well you often
10:13
fortune has given us really really false libraries the meeting and leaves the implementation of Major that's
10:20
operations and a fortune optimized when you can get the factor of 70 of difference effective 70 something right so packaging has been historically a major roadblock for scientific Python and the reason is we we
10:35
rely on a lot of compiled code and shared library so we've been hitting problems like the fact that libraries were not there or ABI compatibility issues but the good news is that there is a huge amount of progress for 2 reasons the first one hour wheels and specifically recently many Linux wheels so the idea being that you rely only on a conservative course set of libraries so that basically is solving so that the problem I showed shouldn't happen it should should work you can try to tell me the and the other the other reason is
11:10
that this this thing it's called open mass which is linear algebra not using fortran so that's good news by way fortune is the very modern language that is super performance because it allows you to automatic vectorization which C cannot do because it's got different semantics so don't think that fortunate something from the seventies qualities yes
11:34
but it's different but if a white together we can get rainy or something so for instance I hope that you can get this example to get text mining and any of your website it should be easy to do
11:46
right when so it's magic but you can use it all right so now let me let me help you think a
11:57
bit more like like assigned to
11:59
and in how we code they you know what it's mostly about so we
12:05
really love and employ the unemployed right it's the numerical Python covariance matrix or operations arrays operation so the reason reading of and is because of spots so this trying for instance to compute the product of term frequencies versus inverse document frequencies on 100 thousand times right so we can do this with miscomprehension and takes 6 seconds 6 ms may not sound a lot but when I do say nonnegative matrix factorization algorithm I do these things in many many many times and actually a 100 thousand terms is not think it's needed so that is actually points now if we do
12:49
this with them by so the got a slightly different in we get 70 microseconds so that's almost a factor of thousand speed up another
13:00
thing that we really like is that if you used to it's in that it's actually very much more readable array computing requires learning it but once you've learned it is extremely readable what compare the S T A times IDF to compute the at times I get to the list comprehension so it's
13:20
important to realize that rate where actually to us nothing but pointers what is what defines an entire race is a memory address but data types a shape must strike so the shaven strike or things that tell you how you can move through array and basically you moving through the air raid by pointed matrix OK it is moving from 1 1 point to another by computing offsets so when represents is regular that in a structured way
13:59
so this is really important because it matches the memory model up just about every numerical library whether it's in C C + + for training were actually believe other languages most languages
14:15
fertilizers copulas interactions across this combined language water so for me the value of the bike is ringing that has a memory model so let's look a bit but why it's foster
14:33
such a community of idea 1 thing is that you're not getting in touch checking during the operation the 1st you're getting all that that the dynamic types during the competition to due to know what T times idea will do but then it's combined code that runs that the operation but then maybe most importantly you're using direct regular sequential memory access OK so just grabbing your data there's no pointer dereferencing for there's 1 but after you're done you just grabbing chunks of data from from the ran or from the cash and
15:12
that's really fast and so then your CPU or your after kernel library can implement things like vector operations using presence in the the operations so that's what we
15:24
really makes them by bypass the time checking is part of it but it's not only right that's much faster than this that's cool let's look
15:36
at that was directives begin and then suddenly we get a factor of 2 call in compute time for element so you have an idea what this may be due to it's gap so 10 to the 5 elements of approximately the size of the CPU cache you could do the computational you these are probably flowed 64 so here 8 bytes right so the problem is that memory is much lower than the CPU so your goal when you walk past calculations to get things in the CPU as fast as possible and here you're
16:18
starting to get out of that so that's bad news for for a computer but there is even worse
16:25
if we do a slightly more complex operations the 2 times minus 1 then the cost that you starts increasing so what's going on here well if we look at what's happening I is computing times idea in creating array that we don't see and I'm going to call it temporary arrays and then it's removing 1 from this temporary so what we're doing here is that we're really moving things in and out the cash hugely so we had pretty bad rationalization here so then this is because of the by contributing model it's just the
17:07
way that works so we can find this and we see that there is a
17:11
huge cost to removing this 1 into the competition but if we play a trick is that would you know role there is and the things slightly better by using an inplace operation for the 2nd so the idea is that we're reusing the allocation of the temporary arrays were not allocating race twice if we did is it gets much much faster and the reason is we've become much better with caffeine reality less so if we look at our
17:37
graph we can do and invited place so it's still going out with with the number of elements
17:42
but because and operation it's cluster so what we have
17:48
here is really a compilation problem might wanna go from this expression to this expression
17:53
uh so we want to do things like removing or reducing temporaries or we might want to achieve chunk operations right so if I can do for Willemstad to loops on the data size of the right size then it would be and suppresses non max which is something that's mostly developed by French is felt can't do this using string expressions so that's that's an example number expert
18:19
evaluate Chipcom's I give minus 1 a
18:24
without being clever mimics was clever for us you get the speed OK so you get the same speed up as the land line right so figure out of them so
18:38
that is basically uh a justintime compiler will a compiler that does these kinds of things uh with by
18:47
putting an inspection another approaches a nice package it's called lazy rate that basically Bill an expression but doesn't evaluated and then evaluate its when you call again the basic it's going around the uh Python evaluation and I like to
19:05
point out that this is actually not a problem that is specific to scientific computing it's a similar problem to do things like grouping invaginating sequel
19:15
theories from talking about things like don't know him right so just
19:21
to to summarize the kind of things you could give to your reduce it your your CEO but it's too small you get over here overhead of Python overhead of operation range if it's too big you fall out of cat of optimal lies in the middle we probably want to be lying here because that's where Big Data is that's where the
19:41
magic is the money some people to take a picture so I this part right look at what if we
19:54
need for control for instance we
19:56
don't want to divide by IDF on 0 so I told you we don't use full control for what we're going to do is that we're going to do an expression of this expression is basically saying that where the idea is the role that returns the and array of billions of but then I will put chip 2 0 OK so that we we don't talk control so um suppose we're looking at ages in the population and I want to compute the mean age of males versus females so then I can select the age array with gender writing and say well for gender is equal to
20:35
male uh I'll compute the mean well substract were gender is equal to gender a type of now this is really starting to
20:47
look like it it is right were really trying to starting to selections so um what kind of government by parallel to them part there's a library called dependence that is really something in between arrays in in the numerator so it's it's been huge hugely by dividing the community because it's fantastic for these queries in this data messages of foreign numerical algorithms that's maybe less than Catholics because anyhow we're the falling
21:19
back to them OK so what what does
21:24
it tell me as you're not believing Python right you're doing a better beautiful Python code that sits on top of lots of I believe for tree C + + readings and that gives you scalability but it's insulation problems but then I realized that most web development is actually some beautiful Python code that's sitting on services like a database that could be in C + + and Java in their land in God knows what In node yes and that actually gives as deployment problems the direct compilation problems you deployment problems there
22:05
were not that different right which is struggling with similar things instantiated in different matter so know these days I
22:13
like to think is the bias the scientists equivalent to don't use sort of what I'm talking so numerics as we've seen a
22:25
really efficient the kids we apply them to regular use this data but now apply the words creates cache misses for bigger rings so we need to fight to remove temporaries in may be tempted but if we do
22:39
queries and then they're going to be really efficient but if we can use indexes trees so typically databases do that but we're going to need that to a group group covariance so all these the compilation problems but
22:54
combinations is and like so we can do for instance we can think of computation and query language that's a bit what nonexpert does but I
23:05
really hate domainspecific languages and each time I try to use equal because I'm not aware that after I get it wrong in I get annoyed and the other problem is that no by that she extremely expressive things that you can do with them by or with related tools is extremely varied so I don't think that's a
23:24
good way to go and any help i'd like like and I want to be doing by so
23:27
1 approach is to hack and they're really cool example is putting your is when you're in that's what development you should do better than me that so what pony urine does uh is it will uh compiled python generators to optimize sequel query you can write something that looks like a book by the generator book it's going to Dubai could inspection well based inspection I believe that a mn then grab grab BEST and billed as a sequel query on top of this and optimize its uh by a compilation groupings so so that that's really grows longer really surprising but it's really cool so I'd like to use
24:13
draw your attention to something that's happening a lot in the big data
24:17
big Data world which is something that's known as spot in its it's a rising star in its Indians and basically on top of the G the amount of the Jenna world and it combines 2 things that combines a distributed stored so people don't realize is usually but it combines a distributed store which is some form of they'd have
24:40
babies like stored and a computing model and put them together in it allows it to do distributed computing in a reasonably efficient way now the thing is that we supplied in the world but actually much faster when the data fits in RAM and the reason is that uh we're really representing data as rigorous space race and so then we're going to string the fall where's
25:06
the Jabba world that has a lot of references so if you want to
25:14
scale up maybe we're going to have to do operations on chunks right maybe we need to the date that and then maybe in parallel or in series is a matter compute things all arrays that fit in RAM often
25:26
cash now this is great for certain computing patterns
25:31
things that instance known as extracttransformload but if you're doing multiword statistics which machine learning is about that you really combining information from all over the you're reading and you're
25:44
reading learning to that but the interaction between machine the term machine and learning those 2 together make a topic at so the kind of compute graph that you get are horrible and it means that things like out of
26:01
course operations which is basically what we're doing when we're chunking data are not efficient there is no data quality uh so 1 approach is to do algorithm development which is what I do so I'm happy and the idea being that you use of online algorithms so it's basically you don't use the same algorithm using the algorithm that works on a string and then you start changing the and the algorithms so if you've heard of
26:31
deep learning yeah then the number 1 algorithm that using in deep learning is stochastic gradient descent and that's how it works that's how people can apply the burning which is extremely computationally expensive to huge datasets so back to data
26:49
science uh so
26:51
we have shown you how we can go from the matrix of term document to a factorisation then there's magic right so there is an algorithm I did not discuss how it works which is imported from
27:05
what what the socket that's do is that they take hobbled papers full of Metabric expressions and drinking a lot of coffee they turn it into this cover really hard by the way uh
27:19
people have been asking me yes they so why do we still use code that's written 40 years ago or 20 years ago unfortunate because writing stable numerical code is extremely hard in
27:31
no better code is being written so far so the reason that we use like it when and by the time have been able to do this is
27:39
thanks to the highlevel syntax of Python and everything I've presented here so the reason all this is important is because it reduces are cognitive load and allows us to do all right let's talk a
27:52
bit about something else than the mere and let's talk about the future and about what's going to make like a great again so I think that we've
28:03
been seen recently that data flow in competition law crucial so you can have know the simple data parallel problems you can have the messy compute graphs so you can have you know online algorithms and so data flows engines are actually
28:18
popping up everywhere so for instance maybe you've heard of DOS so dust is a pure Python steps search graph compilers so it will represent a set of holes of function calls on the Duck as a graph and components uh and then use a dynamic scheduler on this to do Palin distributed computer but so it's
28:43
ringing noise except it's basically static which means I add things to my graph unknown there uh tool that people use in deep learning is the animal in people properly don't realize that it has expression analysis inter and builds a graph of operations optimizes the that's the is a scene possible library the by Google
29:10
to dig deeper learning they may also build a graph of operations so aggressive operations or they're in many many different libraries below them I
29:25
believe that Python should really sure here they can is reflected can be some form of me to programming and because of the recent Eysenck developments because I think the future is is propellant distributed computing so as
29:42
Nethanel Smith who is in by developer said Python is the best America language of because it's not a numerical language and I believe this is
29:54
extremely true that we have a bit of a problem here is that the API is really challenging because is wording algorithm design and we can't really do what would you guys have been doing something like Django where there is basically an inversion of control other and and you're no longer writing imperative code as you would
30:15
do you're buying into framework and I still believe we can write really complex algorithms like this is just too much cognitive overload
30:25
but it's just an API designed well will sold so in terms of
30:29
ingredients for our future data flows I think distributed computation and runtime analysis are important things end for this I think Prof accepted the central it's really useful for debate by the wave of upon not Python the number 1 thing and this is is the ability to debug like in the bag in a in a highlevel way which means I can be bad things like numerical instability in my algorithm that's really hard to do you you got something that blows
30:56
up somewhere in terms of numerical precision of Python is fantastic to the I can do
31:02
interactive work which is how much data scientists work this will enable us to this already animals us and will enable us more code analysis which is going to be really important for being efficient then it gives his 1st systems which is extremely important for appellate computing because when
31:21
you're doing well in distributed computing need to move data we need to move objects around between different computers and you need to move code for this you need to the so
31:34
I realize that so we've been relying on on on pickle distributed computing has been relying hugely on on pickled uh and the idea is that it uses it to distribute the couldn't be done between the different um workers but we can also use it to serialize intermediate results OK so that's 1 way of doing computation on data where all the intermediate results might not fit in in it can be made
32:02
very easily with Python and another thing that that we do is that we actually use they call to get a deep have left in the sense of a cryptographic hash of any data structure so it's really nice because it allows you to see if things of change or not so do about recomputation but the problem is that people is actually very limited the weights implemented in the core and the core library
32:24
pressures there's no uh support for land is and these things are not fundamental limitations the tradeoffs basically and so the variance of the cold light deal because and I must say that I really like 1 of those 2 or maybe ideas from 1 of those 2 to go in the standard library because it's actually limiting hugely computing not
32:54
to be able to because everything so I realize we're never going to be able to people absolutely of and I was to realize that I can write code that always because that's what I do but when I give this to not very advanced user he will at some point
33:08
right because the empirical so for me by the way this is more important than the guilt that may be surprising but when you you get to know a distributed computing well these things the a problem data exchange basically that we have is
33:23
the small library that we call job that that allows us to do ingredients for data for computing and 1 thing it does is a very simple parallel computing syntax which is basically of a syntactic sugar for problem for loops and behind the hood users threading erm multiprocessing or just about any back you can plug in you can plug in your in back in there uh this false persistence so it's basically a subclass of the goal but this clever things for by raising and gives primitives of core competition the reason I'm pointing this out is it that you very noninvasive syntax and paradox uh so with this with a library like job but we can write algorithms and it's actually used uh in inside psychic even though you may you may not know it well it fossils being
34:16
designed to be forced on them by race in it's getting more and more of an extendable back in system so I'm looking forward to a world where we can use things like celery uh 2 uh basically distribute computation from psychic Learning in more of uh web developments and 1 I don't know if it's a good or a bad idea but I'd
34:36
like to try so I think the point
34:42
in in great it's us and 1 of the reason it's great it's because it's simple which is what a lot of people have been criticizing for for instance the Jabba world tells us that they have software transactional memory and it's really cool it would be nice for Python but I personally I really need to use for a number I needed interestingly Java has gained recently and J. Malik to allocate basically for in memory we'd like better garbage collection we really would like but just about every C extension relies on reference count and the reason is it's actually very easy to
35:26
manipulate the reference counting if you're not sitting in the BN right to basically the Python is something that I can manipulate without being inside it which means that it's really great to connect to combined language and I'm talking to people in the conference many people actually use this many people use libraries that have been developed in another language trooper another to to draw the
35:53
attention to a site who knows
35:56
cited good who uses site good it really gives us the best of C and Python you can add types for speed and they don't things so raw that when you add when you type in and by raise it basically becomes that float stars so of thought to write in same so super fast but you can also use it to buy external libraries and it's surprisingly easy the
36:30
good thing is suddenly you're working with the libraries there you working with C like code without any knowledge free pointed at which is for me the number 1 problem of these languages so I see this as an annotation
36:44
layer between the by the vehemence and its really fantastic tool by the way I think everybody should be writing extensions using site and they can as it's an abstraction over the C Python library the C Python uh API so for instance
37:02
you can write code that's very readable and that complies with Python 3 and by to even that there's been a lot of changes in the sky Python API there's also a good idea is also good for them and by developers because they'd like to change things in this the Python API and if everybody writes site and they will be able to because site and we'll do the impedance vector OK so we need
37:28
scientists can work with web developers and we really educated love each other I believe Gelimer a really serious here and you really enjoyed people
37:39
who not doing science in the Python community there 1st they teach me think thanks that section they
37:45
make that's the tools that like years and so i'd like are tools to be
37:50
useful for us and I'd like to point out that so I learn it is actually a really easy machine learning it's really a very simple syntax basically you important object and its the magic of it that will do classification to recognition of things you can still say that and then you give it they don't so it's
38:10
basically matresses right we only do interest and so you have to figure out how you convert your on data to matrices M. then you call for it and then you go predict but 2 people 1 of the
38:25
successes of cycling is is this encapsulation people have really love the fact that the classifier is sum black box so they can use it without fully understanding the uh so
38:37
that's another thing that Python is giving us is uh objectoriented in a really really cool model that allows us to do objectoriented programming without us a crazy uh crazy class diagrams uh and another thing that we've used hugely is about what people
38:58
call Dr. documentation driven development so there was a talk about this a so to try to make this API simplest possible what I'm trying to get at here is that we're trying to give you a higher level simple API to reduce year
39:14
cognitive load just like Python and then by produces are cognitive load when we're implementing these algorithms so where all due
39:25
have to their different things here and we can all benefit from each other what we can do this only for a really careful to reduce each other's cognitive load on what the other does not understand I think it's extremely important so it's important to
39:42
be didactic outside of one's own community and actually Python is really good at this the jangled uh a documentation is known as being really excellent but Python worries about syntax being beautiful uh and so To do this we need to be things like avoiding jargon so machine learning is really that it's full of jargon we in cycling try not to have too much we need to prior information and so
40:13
for instance students that are applied math students and learn about merits I had to tell you they don't care about you even the French ones that have much on the
40:25
1st thing 1 recommendation I have for people that that that that dude API design is build a
40:34
documentation upon very simple examples and examples that run so 1 thing that we do is that we this thing of course means gallery that basically users suspects 6 is also to build our documentation running all the examples that means that the examples must run they must run foster means they must be small enough to run and so I think this has helped a lot with the documentation but also the judges like all right to
41:04
I think it's pretty it's because of the interaction between people like scientists and people who were not scientist whether they're web developers or deadlocks for anything
41:18
have I been censored other people
41:28
um what was I saying well anyhow the Python language in its being is the perfect tool him to many fully lowlevel concepts whether you know the eraser that you can manipulate things like like trees in scenes with highlevel word in and I personally think it's a personal opinion but this has been achieved through the recent success of Python by missing during hugely and when you look at how
41:53
people are using it at some point but they're pointing to something low level very often dynamism In reflexivity are crucial because it enables me to programming and debugging but we also find that we need for
42:11
compilation speed so then there's this this tension between dynamism and compilation and I have the feeling of every word it's also in web development where the say combining sequel query uh and I'm extremely excited about the pets that victory in is pushing forward like the gods on internal
42:34
structures to allow checking at runtime for modifications so that will allow us any kind of acts that we do on the code to be uh invalidated if the environment changes uh or the that for functional specialization finally I
42:54
think that piedaterre has gained and will gain hugely from our database will the and the controversies that are developed a lot in the world and DevOps will book I think it can also give back other things like Knowledge Engineering in AI which are really know growing hugely and just in case you haven't noticed a the science is disrupting just about every job that that you're doing so it's called that there is the science in Python right that's all i have thank you if you if that very much data on
43:53
the outbreak you know pretty insights and
43:55
different little different world soul questions raise your hands the like the might to wide off of things registered you know that 1 thing at a specific question is a statement that centered world was a very adaptive Python straight I think the they're just several years ago the most of the sentences that wasn't prices 3 which is a very thing entity can use pretty much any is good scientific pectin presence rate of something that in theory and in
44:34
that the biggest cost of Python 3 1st with the change of the uh C Python API and so actually people still in niche applications have code that doesn't run and by then 3 because of the city by the with all the main libraries by boss margin random 3 and everything I do random 3 and 2
44:59
questions OK probably get that out of the they would ask the ways who about paper a have a trolling of it and my thought yeah I know
45:10
a lot about so to give a little background like my brother studied uh language theory so we get crazy discussions all the time uh so yeah I know a lot about these things that uh part of the things I wanted to talk in my thought was the fact that it's not only about protecting that applies not only about it's about the memory model I think by the way by by has progressed hugely in this sense which is it is no longer trying to say I'm going to control the memory for everything uh which historically was a big robot for us I mean we I could not believe that type I would be useful for scientific computing because for a long time I heard that the angle of pi pi with things like a software transactional memory which is really cool by the way both will cost us things a lot in novel and the other thing is we're not going to to get rid of the compiled code because there is so much history making those algorithms really good and it's extremely hard but I do believe that what would abide by world this is doing which is a lot of analysis of the code is extremely extremely useful that it actually thank you we but
46:40
not any more questions already in in the back on his sorry that is constant and so but they're going to keep you keep referring to how world your Python world is the division that's clear
47:03
love for me knowledge for me told us by they got personal friends in all the communities uh I used all kind of different tools but I'm afraid there is division uh in I'd like to think that it's fueled by the by the different tradeoffs uh and like deflected by the way I don't want it I don't think it's useful but when when you hear a new things like come down which is sort of a package for Python and other things and the reason it was created was basically
47:39
peak is and the way I think it is the reason it was created was because that that the scientific crowd was unable to explain the struggles that would have been where the and packaging
47:52
tools in Python and just went on and did their own staff well the good thing is that some people were so people at you can match and then worked and now that I believe should be able to work fine but that's 1 example of the division and I think it exists and I think we need to fight the because ah value so there really believe in our values the fact that we're diverse we were able to work together yeah great question in this is
48:22
those based on the use of the scenario In 5 7 years thank possible variants to other languages like that or the whole from more things new things 2 or more find moments so you talking in
48:40
in the scientific Python in the scientific world yes are and be extremely of community I think or will die so be called for the give you background when we started so I can learn what was 7 years ago everybody would walk up to those in same user crazy everybody does all work a machine learning everybody does matter but it the 7 years down the line and nobody's mentioned this so why do we are is also know as a language is a horrible language but in terms of libraries and I told you know the numerical algorithms are really hot will art has a crazy amount of and for me is that the station or is the reference but what's the value of data analysis is not only numerics it's in combining things and I think we have an edge here so the Matlab yeah I think were eating slowly and they're fighting back I'm getting emails on a monthly basis yeah training to condemn network to see how work or would like uh but but the fact that we're going out words they're pouring money to fight this is telling me something maybe it's gonna take a bit of time in the scientific world but I mean in in a good the strong container would be Julius is typed language that is able to do and that's the clock interference and combined extremely fast of connected uses and be in I really don't like it I mean it's a fantastic language is also the best language like I really don't like it because it's a numerical language and they don't think of it that way but it's that the whole community is numerical community and no more that is going to be itself and of course a you thanks to the Vanessa talk
50:37
fantastic library psychic learning is only 1 of the libraries in the cited family there is also a psych images I could be your what is your relationship with cited family so that's very historical we used to
50:52
have that's like 20 0 8 there used to be but site did with the nest that means these packages you guys revenues this packages through 1 of my nightmares uh inside by and that's how we all sorts of and then to it of scifi because cyber was going to make uh a and then we got rid of the sort that means this package it used to be called like it's not learned and 2 intersecting action uh and it means scientific uh it's very historical but was the relations of ideas for friends with friends OK on the last question
51:40
on value so that but 1 that's sort of question and to point out that 1 specific thing about from the dead is beyond Python beyond this is where it comes struggle people come to struggle with known specific stuff so if you wanna database or a specific uh you wanna solace spectral nodes area as an part of the candidate can actually do that so it actually sits on top of 5 and not in is more like that together then then the of and so in this case I'm not I'm not really sure why should have something the center library that actually does that what was also I completely
52:31
agree so so the comment is come there's more than 5 and basically uh know it is by the way but historically it's not been marketed like this I mean I've heard to image but don't use that use common which is linear this in mind that by the way like like the uh and the other thing is I haven't seen much work go from come now to them not even talking about contributing back to bed but I'm talking about explaining what was being heard right it's extremely important I would really like I would like to call the for each from them is statement but I would like on forged forward 4 point to either died or to push a phonetically to pushing it it but I would be also but we mean 1 place where we can tell everybody go and get your stuff and we need this place to be good and we need to work together in a sense call has achieved this because I'm only has created as it's created in maybe an inside release it showing that you can do things better uh but you need to go all the way back and get get new back in the wider Python ecosystem of improvements because it's all going to benefit OK so long we
53:45
have 1 more thing to announce so please don't run away after you've given them fantastic enthusiastic applause forgive you think