The Painless Route in Python to Fast and Scalable Machine Learning
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 130 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/49927 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
ScalabilityMachine learningIntelData analysisRoutingMachine learningVirtual machineMathematical optimizationLibrary (computing)Software engineeringSupercomputerSoftwareScalabilityComputer animationMeeting/Interview
00:47
Virtual machineScalabilitySoftwareSample (statistics)Virtual machineComputer clusterMulti-core processorData analysisFormal languageMechanism designSelf-organizationField (computer science)Java appletLimit (category theory)Machine codeResultantNeuroinformatikIntelWorkloadVector potentialDistribution (mathematics)Dependent and independent variablesProduct (business)Machine learningLoginMereologyPrototypeArithmetic meanComputer animation
02:43
Numerical analysisDrop (liquid)Machine codeSoftwareIntelDistribution (mathematics)Vertex (graph theory)Thread (computing)Vector graphicsSoftware frameworkVirtual machineData analysisPredictionoutputTotal S.A.Distribution (mathematics)Interpreter (computing)Multiplication signLevel (video gaming)Data analysisLibrary (computing)Virtual machinePredictabilityNeuroinformatikComputer hardwareMachine codeJust-in-Time-CompilerFrame problemSource codeProcess (computing)Computer fileSound effectAlpha (investment)InferenceComputational scienceoutputWorkloadTask (computing)Numeral (linguistics)Machine learningSet (mathematics)ScalabilityWave packetOverhead (computing)DatabaseComplex (psychology)Key (cryptography)Goodness of fitCompilerEndliche ModelltheorieThread (computing)Vector spaceTransformation (genetics)AlgorithmCompilerStreaming mediaComputer animation
06:15
Kernel (computing)MathematicsOperations researchIntelLibrary (computing)Parallel computingBlock (periodic table)Cache (computing)Read-only memoryMatrix (mathematics)BuildingVertex (graph theory)Machine codePort scannerLinear mapVirtual machineScalabilityInstallation artOpen sourceCurve fittingStructural loadConfiguration spaceLetterpress printingObject (grammar)Data analysisIntelLibrary (computing)Machine learningAlgorithmTelecommunicationNeuroinformatikMultiplication signGradientOpen sourceVector spaceBuildingLinear algebraAlgebraMultiplicationImplementationDefault (computer science)Variety (linguistics)Decision theoryNetwork topologyNumberLevel (video gaming)WorkloadDifferent (Kate Ryan album)ResultantObject (grammar)Structural loadCluster analysisRandomizationSoftware testingScalabilityNatural numberAttribute grammarHeat transferElectronic mailing listVirtual machineProcedural programmingLinearizationPatch (Unix)Maxima and minimaCASE <Informatik>MathematicsMessage passingIterationMereologyPairwise comparisonComputer configurationIdentity managementMachine codeBlock (periodic table)Mathematical optimizationComputer fileData transmissionGoodness of fitoutputOperator (mathematics)Single-precision floating-point formatRandom number generationPhysical systemUltraviolet photoelectron spectroscopyProcess (computing)Alpha (investment)Game theoryKernel (computing)DialectMusical ensembleDigital electronicsWebsiteRight angleForestArithmetic meanIntegrated development environmentComputer animation
13:32
Structural loadConfiguration spaceCurve fittingLetterpress printingCluster analysisFile formatIntegrated development environmentArray data structureAlgorithmProcess (computing)EmailMachine codeParsingEndliche ModelltheorieSocial classHeat transferRead-only memorySample (statistics)NP-hardNumberServer (computing)OracleScalabilityIntelLemma (mathematics)Network socketSocket-SchnittstelleBefehlsprozessorElectric generatorProcess (computing)Different (Kate Ryan album)Representation (politics)AlgorithmFrame problemFile formatScripting languageTask (computing)Utility softwareArithmetic meanRun time (program lifecycle phase)Alpha (investment)Social classoutputSingle-precision floating-point formatInstance (computer science)Machine learningUsabilityComputer fileData structureData transmissionParameter (computer programming)NumberEmailPairwise comparisonMereologyCluster analysisScalability2 (number)Table (information)Game theoryFiber bundleNeuroinformatikDialectSound effectMultiplicationEnumerated typeGroup actionMachine codeSemiconductor memoryCASE <Informatik>Array data structureBoundary value problemNP-hardFormal languageResultantObject (grammar)Virtual machineParsingPiData typeMultiplication signPrincipal component analysisGoodness of fitAttribute grammarWorkloadBitIterationFlagImplementationSampling (statistics)TelecommunicationPopulation densityParsingIntegrated development environmentUniform resource locatorWebsiteLine (geometry)XML
20:41
Frame problemIntelCompilerGeometryData typeMachine codeDynamical systemoutputFluid staticsVariable (mathematics)Function (mathematics)WorkloadWordState observerFunctional (mathematics)Line (geometry)Vector potentialSet (mathematics)Parameter (computer programming)Computer hardwareEntire functionTheoryLatent heatoutputMultiplicationVector spaceFrame problemNeuroinformatikExecution unitMathematical analysisMereologyDomain nameType theoryReading (process)Core dumpOrder (biology)Default (computer science)ImplementationMachine codeData typeRun time (program lifecycle phase)Arithmetic meanMultiplication signFormal languageCASE <Informatik>Binary codeComputer fileStability theoryTransformation (genetics)View (database)AlgorithmCompilerMathematical optimizationQuery languageDifferent (Kate Ryan album)Cartesian coordinate systemNumberCompilerJust-in-Time-CompilerLimit (category theory)Coordinate systemPoint (geometry)State of matterCharge carrierGeneric programmingVariable (mathematics)Semiconductor memoryLibrary (computing)1 (number)ResultantLoop (music)Paralleler AlgorithmusOperator (mathematics)Condition numberSeries (mathematics)Virtual machineExpressionThread (computing)Writing2 (number)QuicksortNumeral (linguistics)Attribute grammarComputer animation
29:57
IntelMaxima and minimaMereologyCompilerMachine codeParallel computingFunction (mathematics)Frame problemOperations researchDisintegrationStatisticsASCIIString (computer science)Relational databaseMilitary operationMoving averageElectronic mailing listSeries (mathematics)TupleBeta functionInstallation artOpen sourceCluster analysisWorkstation <Musikinstrument>Entire functionProcess (computing)Machine learningSoftwareMultiplication signCASE <Informatik>NeuroinformatikInheritance (object-oriented programming)Functional (mathematics)Computer programmingJust-in-Time-CompilerCuboidDifferent (Kate Ryan album)Level (video gaming)Context awarenessNumberMathematicsCompilerWindowMathematical optimizationSource codeDomain nameSystem callOverhead (computing)Group action2 (number)Link (knot theory)Parameter (computer programming)Standard deviationUltraviolet photoelectron spectroscopySingle-precision floating-point formatScalabilityFundamental theorem of algebraSimilarity (geometry)Lambda calculusKernel (computing)Projective planeOperator (mathematics)Paralleler AlgorithmusSlide ruleMultiplicationTotal S.A.Series (mathematics)Thread (computing)Machine codeRun time (program lifecycle phase)MereologyArithmetic meanBeta functionLinearizationCompilerSelf-organizationFrame problemString (computer science)Row (database)Entire functionMaxima and minimaCluster analysisOpen sourceLimit (category theory)Ocean currentProduct (business)Equivalence relationTheory of relativityMiniDiscTerm (mathematics)Pairwise comparisonImplementationRight angleFreewareComputer fileComa BerenicesTelecommunicationCache (computing)Graphics processing unit10 (number)BefehlsprozessorLatent heatHeat transfer coefficientBinary codeWordResultantPresentation of a groupLibrary (computing)Loop (music)Cartesian coordinate systemFiber bundleCountingRevision control
39:11
Entire functionProcess (computing)ScalabilityMachine learningGraphics processing unitMereologyProduct (business)Latent heatPatch (Unix)Data analysisScalabilityIntelComputer hardwareCoprocessorCycle (graph theory)Multiplication signVirtual machineMathematical optimizationArmControl flowDistribution (mathematics)Closed setCollaborationismLibrary (computing)Arithmetic meanFront and back endsVirtualizationGoodness of fitTerm (mathematics)DebuggerClassical physicsNationale Forschungseinrichtung für Informatik und AutomatikOnline chatMachine learning2 (number)Analytic continuationMeeting/Interview
Transcript: English(auto-generated)
00:06
So we're coming to the next talk. We have Victoria Fedotova. She is a leading software engineer at Intel. She does some optimization of data analytics and machine learning in the Intel One API
00:23
data analytics library. I think we will maybe hear more about that. And then we have Frank Schlimbach. He is software architect at Intel with a HPC background. So the title of the talk is the painless route in Python to fast and scalable machine learning.
00:42
So let's see if this is really painless. So hello, my name is Victoria Fedotova. And in this talk, my colleague Frank Schlimbach and I will tell you how to create fast machine learning solutions that can scale from multi-cores on a single machines
01:02
to a large clusters of workstations. Python is a very popular language for data analytics and machine learning as you know, it's a superior productivity make it the preferred tool for prototyping. However, as an interpreted language,
01:20
it has performance limitations that prevent from using it in the fields that have high demand for compute performance like production machine learning. If an organization wants to improve the performance of the machine learning solution, they often hire engineers to rewrite existing Python codes into more performant C++ or Java code,
01:43
or even to re-implement the solution into a distributed fashion using MPI or Apache Spark. Also many machine learning workloads are compute limited, meaning that the performance bottleneck
02:01
is in the compute part of the workload, not in the data loading. And for interpreted languages like Python, it is hard to achieve bare metal performance on such workloads. As a result, a typical data scientist analyzes
02:20
only a small portion of data that they think has the most potential of bringing the great insights. This means they may miss out on available insights from the remaining bigger portion of the data, insights that may be crucial for the business. As a response to all those challenges, Intel created a Intel distribution for Python.
02:43
Intel distribution for Python is a Python interpreter and a set of libraries for numerical and scientific computing. All of those packages are linked with high performance libraries,
03:01
which allow to provide close to native code performance. And as you might already notice, one of the ingredients of the better performance is optimized libraries. Such libraries as NumPy and Scikit-learn spend a lot of time in native computations.
03:22
This helps to significantly increase the performance of the machine learning solutions. Another key ingredient of good performance is just-in-time compilation.
03:41
Just-in-time compilation allows to reduce the overheads that are introduced by Python interpreter. So proper use of both those ingredients help you to get the performance of Python code almost as good as for C++ native codes.
04:09
Data analytics and machine learning workloads usually consists of multiple steps. We usually start with data input when we load the data from some data source
04:20
like from file or from database or some stream of data. After that, we pass the data to a processing stage where we prepare the data for further use by machine learning algorithm. Here, we can do some filtering, some data transformations, feature extractions, and so on. After that, we train the model
04:40
and with trained model, we usually do prediction or inference to get some insights from a new previously unseen data. Those stages can repeat and so on in the process of the machine learning task, but all those stages have different demands
05:03
to perform effectively on their modern hardware. Another source of complexity is hardware itself. Each year, we get new machines with more threads, wider vectors, and so on. The hardware becomes more sophisticated each year
05:22
and even modern machine learning packages could not use effectively all those features of the modern hardware. So it is quite a complex task to have one machine
05:41
solution that implements all the stages of this data analytics and machine learning pipelines effectively and use the hardware effectively. In this talk, we will talk about scalable data frame compiler, scikit-learn optimized and the alpha pi packages,
06:02
which helped to optimize those end-to-end machine learning pipelines and bring the performance to Python workloads. Let me now describe how Intel makes machine learning in Python faster with Intel Data Analytics
06:21
Acceleration Library. DAL is an acronym for Intel Data Analytics Acceleration Library. And this library helps to optimize machine learning algorithmic building blocks by providing algorithms for all stages
06:43
for machine learning workloads from data input to decision making. And default, as you may already know, default Condo installation of scikit-learn already comes built with Intel Math Kernel Library, which speeds up linear algebra operations
07:02
within machine learning algorithms. But to get the best performance, it is not enough just to optimize linear algebra, because for some algorithms like decision trees or gradient boosting we have little use of linear algebra there. So that's why we created DAL.
07:23
When data comes to DAL, it splits it into blocks and process those blocks in parallel using Intel trading building blocks library. And of course, we use Math Kernel Library to process mathematical parts of the algorithms
07:41
like linear algebra, random number generation, vector algebra and so on. So all this leads to a great performance. Here you can see a comparison of stock scikit-learn performance to Intel scikit-learn performance.
08:01
And both of these are compared to native code performance, which mean here 100% is a performance of C++ DAL algorithms. You could see that Intel Python performance,
08:21
which is shown as blue bars, almost everywhere is greater than 80% where scikit-learn, which is installed from PIP, rarely reaches 5% of native codes performance.
08:41
So let me now describe what you need to do to get all those impressive speed ups on your system. First, you need to install the alpha pay package from Condor from PIP. After that, you have two options available. Either you can use this minus M DAL for Pi
09:02
command line option to enable the optimization for all the algorithms, all the scikit-learn algorithms available in DAL from, in DAL for Pi, no code changes needed here. Another option is to enable monkey patching
09:22
for the algorithms case by case programmatically. Really, for example, here we see this for k-means algorithm. That's it. On the left, you can see the list of algorithms that are currently supported in DAL for Pi and which are equivalent to the scikit-learn algorithm.
09:45
This means that those algorithm have a numerical is the same results as the scikit-learn algorithms. They pass all the scikit-learn compatibility tests. So you can use those algorithm and get the same results as with scikit-learn but with better performance.
10:01
On the right, you can see the algorithms that have identical IP, API with scikit-learn API, but those algorithms do not pass compatibility tests. This is for example, it doesn't mean those algorithms are incorrect, just it is hard for some algorithm to pass compatibility tests.
10:22
For example, for random forest, it is hard to build the trees, which have the same depths as the trees, which are built with scikit-learn due to random nature of the algorithm. But you constantly work on increasing the number of algorithms on the left,
10:41
either by adding new algorithms or by moving the algorithms from this part to the left, making them pass compatibility tests, which is actually the most trickiest part for us. Now, let's talk how to scale the machine learning solution to multi nodes to cluster.
11:05
DAL for Pi is a convenient Python API to Intel DAL, and it contains a variety of algorithms that have close to native code performance. Part of these algorithms has ability to work in a cluster environment,
11:24
has a distributed implementations. For distributed execution, we use communication avoids and algorithms, which spent most of the time in computation on the nodes, not in data transferring.
11:42
And this is, we tried to make this kind of designs, and this results in a good scaling and for transfer layer, we use MPI library. And both DAL and DAL for Pi packages are open source
12:02
and they are available on GitHub, you can see or contribute even. So, let's see how the API of DAL for Pi looks like. Here is a single node API. Here is a comparison with scikit-learn.
12:22
For example, to implement k-means clustering algorithm and scikit-learn, you need to do some imports, import k-means, import pandas, we will use here for data load. Let's read the data from comma separated file. After that, we need to create k-means algorithm object
12:42
with 20 clusters. We use a k-means++ as an initialization procedure and we set up a maximum iterations to five. After that, we do actual clustering and the results, the cluster centers and labels will be available
13:02
as attributes of the result object. Here is a DAL API that does exactly the same and reduces the same results. Here, we also need to do some imports, also use pandas in the same way. The difference is that for k-means in DAL,
13:24
we have k-means split into two algorithms. First, we need to run k-means init algorithm to initialize, compute the initial centers of the clusters. We also use k-means++ algorithm here.
13:41
And after that, we create a k-means algorithm with 20 clusters and with five iterations and set the assignment flag to true. This means that we also compute labels or assignments for our input samples. And after that, we run a compute method
14:02
with our input data and the initial cluster centers from the initialization algorithm. And we get the same results as we psychically learn in the attributes of the result objects. You can see that DAL for pi API is also almost as simple as psychically or maybe
14:23
a little more bit verbose, but not too very verbose. And now let's see how this code will change if you need to do a multi-node computations, what will change in this case.
14:43
Here is the comparison. I highlighted yellow, the differences between a single node and multi-node code. You need to do some additional inputs, which you will need to during multi-node computations. First, you will need to initialize
15:02
the distributed environment. And after that, in this particular example, we use different names of files on different nodes. On the first node, we have k-means dense one CSV file on the second node, k-means dense two, et cetera.
15:23
So we use a node ID to get the name of the file on each node. And finally, we need to provide additional parameter to all the algorithms, which says that distributed equals true, and that's it.
15:42
This is all the differences you need to do to implement the distributed machine learning algorithm. And the common line to run this script will change. We use here MPI run utility to run this on four cluster nodes.
16:01
And after that, we just use our script and it will run the computations on multiple nodes in cluster. Now, let's dive a bit deeper into DAL for PI implementation.
16:23
On the background, you see C++ code, actually that runs distributed principal component analysis algorithm using DAL for PI C++ API and MPI as a transport player. This is one of the shortest examples.
16:45
K-means example is about twice longer. And as you can see, we couldn't just like port this API to Python. It is not a very usable user friendly.
17:03
So we use semi-automatic process to generate our DAL for PI IP for distributed and for single node algorithms. First to parse DAL for PI, DAL C++ headers with our in-home developed parser to locate different classes, functions,
17:23
enumerations and other objects. And we generate site on code from this C++ API. And after that, we use Jinja 2 to generate Python classes for all which are very different from our C++ classes.
17:45
This process has two main advantages. First, it simplifies the API. So we get a simple, not very verbose Python API. And second advantage is that when we add a new algorithm
18:03
into DAL or some new feature, we get API for that feature for free just having this API generation process. It is a cool thing. And- Niko, you are running out of time.
18:24
The thing that helps DAL for PI to achieve great performance is effective data transfer across language boundaries. Python data types store data differently. For example, NumPy and D-Array is used what we call homogeneous numeric table to represent NumPy arrays.
18:42
And in Pandas data frame, columns can have different data types. And usually each column is represented as a contiguous array in memory. Then this representation is called structure of arrays. DAL for PI supports various basic data types
19:01
and a number of representation formats to avoid data copies and perform optimally with various data layouts. And finally, to the performance. Here you can see scaling of k-means cluster and algorithms from one up to 32 cluster nodes.
19:21
Hard or strong scaling means that the size of the task is fixed with increased number of nodes. Orange bar show hard scaling for k-means algorithm for 35 gigabyte dataset.
19:41
You can see that on two nodes, the computations runs twice faster than on one node, on four nodes, four times faster and so on. The time to process the data reduces when the number of nodes increases and it shows a good hard scaling.
20:01
The scaling measures the performance of workload when the task size is fixed per node. It means on one node we process here 35 gigabytes of data on two nodes, 70 gigabytes and on 32 nodes up more than one terabyte of data. And alpha pi shows close to ideal with scaling
20:21
with this algorithm in yellow bars you can see. So the runtime stays the same with the increased size of the data. And this is due to our communication avoiding design of the algorithms. So actually that's it about compute part
20:40
of the machine learning workloads. I transfer the word to Frank. Yeah, thanks Sika. So the machine learning part, so there's actually the actual machine learning algorithm is often only a small part of the entire data science application.
21:03
So actually in many cases, the pre-processing, so data input is reading the data and doing data cleaning, data manipulation, transformation, et cetera to make it ready for the actual algorithm and takes most of the time. So that can easily happen.
21:20
So it would be good to also improve the performance of this pre-processing steps. In most cases, people use Pandas. Pandas has a really very, very powerful and nice API. The problem is that Pandas is not necessarily optimized for speed. For example, it uses only a single core.
21:42
So it leaves all the multiple, many cores and vector units that basically all computers today have unused. So there's a lot of hardware that is unused if you're running Pandas, even though it would be entirely available. The other observation is that the pre-processing code
22:01
is usually not just one or two lines of code. It's actually a larger chunk of code. So if you have a larger chunk of code, there's usually great potential for optimization. If you can do optimizations across different instructions, for example, loop fusion, something like that.
22:21
And because Pandas is actually inherently data parallel, that is something that we want to utilize. And that's why we're writing a compiler or that's why we decided to write a compiler. And it's actually a just-in-time compiler. What we do is we just extend number. Number is an existing just-in-time compiler provided by Anaconda. It's a domain-specific compiler.
22:42
So it focuses on NumPy and numeric codes around that. So what we do is we extend number so that it also understands Pandas and code around Pandas. And by that, we get a very good performance. And the beauty of number and its approach
23:00
is actually that you don't need to leave Python. It's a pure Python package. And all you need to do is you annotate your function with a function decorator, the ng.jit decorator. So you don't need to leave Python, learn a different language or use compilation prior to your runtime. So everything stays within Python.
23:22
And all you need to do is if you have a function that you want to compile, you annotate it with your decorator. And then when the decorator is called, so when you call your function that is decorated, then the argument with the decorator is of course the function. And so that function, that JIT function or the ng.jit that is provided by number
23:40
can now decide whether it wants to call that function directly or whether it enters the compilation pipeline, creates a native binary and then calls that instead. And of course, hopefully if you compile it to native, that will be much faster. And here's a very super, very simplified view of this compilation pipeline.
24:02
So it's really, really simplified. But just to give you an idea of what needs to be done is there are two things that are very important to get the performance from Python into native. The first one is type analysis. So Python is a super flexible language. The typing is mostly not even explicit and it's dynamic.
24:23
While if you want to compile something to native, of course you need to map that to the native types in order to do, for example, use vector registers and all that. So we need to do this type analysis and of course you can imagine that can be pretty tedious and pretty challenging. And here is where the domain specific application
24:42
actually plays into because number, natively understands number, it knows how to treat all the number data types and know how to do that. Now we are adding panda support and now we're adding to number that it understands what panda does and what all the data types in there. So it's a domain specific optimization. And the same is true for the parallel analysis.
25:03
So you know that pandas operations are usually like data parallel. So you could write that yourself, but because we know from the domain that how things work, we can do that for you. So we can automatically extract the parallelism for you. And that is part of our pipeline. So we add the parallelism within the code
25:24
that we generate in the efficient binary. So you now don't need to do that automatically. So this is not claiming that the parallelism can be done in a generic way. It's really domain specific for NumPy and pandas. Here's, I just want to show like as an example,
25:43
two optimizations that we can do in the compiler that are not that easy, cannot be done easily with other packages or if you have just a library. Here's an example that is really stupid. It's really a simple example. You see this function that is annotated with a tic-tac query. All it does is it reads a file.
26:00
In that file, you have data about employees. The employee first name, for example, and the bonus, and maybe there are 10 other attributes in that. But once that's read in, we compute the mean of the bonuses, sort the first names and then return the two serious. So it's really silly. So we don't need to talk about that.
26:21
The two things I want to stress here is, first, the read CSV. So if you look at the read CSV, and what pandas, the default implementation does is, it simply takes the entire file, reads it from start to end in a serial manner. So there's no parallel in the minute. While our implementation that is generated by SDC,
26:43
divides the entire file into multiple chunks, and then each thread computes or does the reading on each chunk. So we have a fully parallel read even for the read CSV. That of course brings quite a performance boost. And of course, this parallelism
27:01
is not only applied to read CSV, but also to things like mean and sort value. So all these operations on data frames and series are parallelized from SDC. The second thing that is interesting is here, is a memory optimization. If you look here in comments, you have this one argument
27:21
that you could apply to read CSV, use calls, which means that the listed columns are the only ones that you're interested in. So if you provide that to read CSV, you will read in only these two columns. With SDC, you don't need to do that because SDC can do the analysis in your function body,
27:41
in the function body and see that this function actually only uses bonus and first name. So it can do that for you automatically, you don't have to do that. But basically it takes away some burden of optimization from you because it can do analysis and things. And because it knows the domain, it knows what to do. So these are just two examples. Of course, there are many more optimizations
28:01
that are applied like blue fusion and things like that. But this just gives you an impression what is possible if you use a compiler compared to just using a library or a package that is optimized under the hood. But of course, there are also caveats or limitations to approach like that. The most fundamental one here
28:21
is because we're doing a static compilation, even if it's at runtime, at some point you have to fix the code or you have to code in some state and then you compile it. So it's a static compilation in that sense. The code needs to be type stable. Type stable means that from a given set of input arguments,
28:42
basically, no, from the set of input types, which are deduced from the input arguments, all the other types in your function, like the variables and other functions, et cetera, need to be determinable statically, right? So that means type stable.
29:02
Basically, that means that you cannot write expressions or Python allows you to write expressions where that's not the case. And those expressions are usually like the examples here below, where you have a condition that is not constantly, it's not a content expression. And then the result of the if and the else case
29:22
result in different types. So that applies to functions as well as variables, but here in our domain with data frames that also applies to the data frame schema. So the column names need to be statically determinable as well. So that's a limitation of the core limitation. It's thinkable to do it differently,
29:41
but if you require that to be static, we can do certain optimizations, which we couldn't do otherwise. In most cases, this is nothing that, data science application actually do, but some cases they do it. In some of those cases, you can work around it by just changing the code. But of course, it's not always possible.
30:01
So this is really a fundamental limitation, but hopefully, in most cases, you can work around that if you use that. The other thing is, of course, that compilation takes time. So it doesn't come for free. So when they, and you can't assume that the compilation takes less time than what you gain by the compilation.
30:22
So the first time when number chimes in, it takes the code, compiles it, and then run it. So it has some overhead for the compilation. But what it does is it cache that result of the compilation. So the native function will be cached. So the next time when you call it, it will actually not compile it again, use the cached binary and run that.
30:42
So before you annotate your function with a chit, you should think a little whether it actually makes sense. So if you have a very small function, whereas no data parallelism, no computation in it, and it's called only once, it doesn't make sense to really chit it. But if you have a function that has parallelism in it,
31:00
that's a lot of computation, and it's called, for example, in a loop, that's your target. That's your candidate to do it, and you will get very nice performance just out of the box. The other limitation or the other thing to think about, which is not that obvious, is that currently the compile time of SDC is dependent on the number of columns.
31:20
So if you have very wide data frames, SDC might take very long time to compile. We have one example where SDC actually takes 10 minutes to do the compilation. In this particular case, we were still able to reduce the total runtime from hours to tens of minutes, but that is of course nothing you can assume.
31:41
So think before you apply the chit decorator, but if it's a good candidate, it basically gives that free, the performance free. Let's look very quickly at some performance data just to prove that it actually works. So here is read CSV. So we're just reading a CSV file,
32:02
and on the x-axis, you see a different number of threads, so from one to 56, and on the y-axis, we see the speed up over stock pandas, so the non-compiled pandas compared the speed up we get with the SDC compiled pandas over the stock pandas that is not compiled. You see with one thread, it's actually slower,
32:20
and that's what I just mentioned. We have compile time, so that gets slower, but as soon as you have more than one thread, it's faster. With four threads, it's already two and a half times faster, and you get up to more than 10x. You cannot expect something like a linear speed up, of course, because we're doing disk IO, and so we are limited by the disk speed, so we can't expect something like a linear speed up here.
32:42
So this is already a very, very good number. This one here provides you with two charts comparing operations and entire data frames, or the data frame with multiple series in it, multiple columns. Again, we're showing the speed up of our SDC compiled operations
33:01
compared to the stock pandas on multiple threads. On the left-hand side, we're doing that on the entire series, on the entire data frame. So we're comparing the number of rows to count. We're dropping rows, and then we are computing the max for each column. And you see the speed up is quite impressive.
33:22
So for the max, for example, which has more computation than the other two, we get more than 100x. On the right-hand side, we do a similar thing. We compute the quarteuses, the mean and the standard deviation, but this time not for the entire columns, but for rolling windows.
33:42
So this is a problem that is much harder to parallelize, but you see the performance is still very nice and it scales. In particular, if you have more, if you have operations that are a little more compute intensive like the standard deviation. This is the last performance slide here,
34:01
and we do similar comparisons, but on single series, not the entire data frame. Again, on the left-hand side, we do that for the entire series. On the right-hand side, we do that with rolling windows. Again, we see nice speed ups and in some cases, a good scalability.
34:20
Let me just highlight the apply thing, the green one on the left-hand side. We get more than 400x speed up. And that is because when you do an apply on a lambda, a compiler can actually compile not only the apply into native, but also the lambda. If you had a library, a package, that package, even if it's very efficient
34:41
in its implementation, it always would need to go back to Python, call that lambda and go back. So we can eliminate all of that Python stuff. And that's why we get this fantastic speed of more than 400 with a simple apply function. Here's just a quick summary of all the functionality that is supported right now in STC.
35:01
So it's an evolving project. It's of course not finished. And the Pandas API is super wide. So there are many, many function in it. And of course, in our initial version, we can't support everything. We have more than 170 functions that we already support. For example, statistical operations that we have seen in the charts,
35:20
but we also do relational operations like filter, group by join, things like that. We saw the rolling windows. And just let me add something to the data side of things. So we not only support the data containers, like the data frames and the series, but we also added support for dates and strings, which is something that Numba did not support before.
35:42
And that is a major source of performance optimization. So if you have code that uses strings or dates or times and you compile that with STC, you get super, super speed up. So we had some cases where we have 1,000 X or something more than that, just because we are optimizing how strings and dates are handled.
36:03
Yeah, this is a slide about the current status of STC. It's an open source project. So it's not closed, it's free. So you don't have to pay anything for it. It's open source. You can get it here from GitHub and compile it yourself. Or we also provide binary packages.
36:21
We provide combo packages and PIP packages, PIP wheels. So whatever you like more. STC will be part of our new one API product that will be released end of this year. This will not change anything. So it will still be free. It will still be open. So it just, basically, if you want,
36:42
you can get support for that. So until then it will be in beta. So this is basically in product terms, that's in beta quality. Last but not least, you see this second link here, which is the documentation for STC. STC, the documentation not only lists all the things that it can do,
37:02
and probably more importantly, it lists for the Pandas things that it supports what it does not support. So you can actually see if there's some argument that it does not support yet, you will see that. So what's next? We are working on a few things, of course, adding more Pandas features,
37:21
but more interestingly, we want to extend the parallelism to not only do a scale up within a node, but also do a scale out across node. So scale it out to a cluster. Because it's, again, because we know what we're doing, we know what you're working on, it's domain specific, we can do all this automatically for you. So we can scale that, do all the communication,
37:42
all that internally. We know how to do this. The user doesn't need to do anything, just run MPR run and the rest of the code is identical. The other thing we are currently actively working on is running things on a GPU. So that the JIT thing can actually not only generate code for the CPU,
38:00
but also for the GPU. If you have done something like that in Python already, you might know that you see on the left, that's what you usually do with CUDA, for example, you have this CUDA kernel start up programming that we think is too low level for a Pythonist. So what we want, we are aiming for much more. So we want to actually apply the same style
38:22
of JIT programming that we have right now for the CPU, also for the GPU. So no change in your function that you want to run on the GPU, only when you execute it, when you call it, you call that in a so-called device context. So this is that in a nutshell, what we are trying to do. So we are very far with that.
38:41
It's not yet ready. Certain things belong to that, like two more packages. One is a NumPy equivalent for GPU and the other is to do this device context stuff. That's all in preparation. So maybe next year we are back and talking about that. That basically concludes our presentation. We hope it was interesting to you. And if it is, here are some links, check it out.
39:02
And one last word, I want to thank the organizers for this really incredible good preparation of this conference. It's fantastic. Thank you so much.
39:21
Thank you very much, Victoria and Frank. So you convinced me it's really painless. It's really nice work. And I like if things go faster. We do have some questions. The first two can combine. Actually, the question is,
39:40
do you need any specific Intel hardware to use Intel data analytics acceleration library? And the second part is, are there any benefits, good question, achieved on AMD processors or is it strictly Intel?
40:01
So you can assume that everything also works on AMD. So it's, of course, all the best optimization, so it will work best on Intel. So I mean, that's clear. We are Intel, so we optimize for our hardware, but you can also assume that it will perform really well on AMD. So we are not supporting something like ARM, for example.
40:21
But for example, for the STC part, STC has different back ends and we were mostly working on the front end. So it actually should be pretty, so for example, for the TPU work that we're doing, should be fairly simple for NVIDIA or someone else to write a CUDA backend for the same thing, if that's wanted.
40:41
Yeah, and question for Antus. Okay, sorry. I wanted to add, not about Python, but in C++ products, we're currently, additionally to DAL, classical product, we have one API, one DAL product that can run also on GPUs,
41:00
on integrated GPUs or any Intel GPUs. And I think further we are going to scale this to Python also. That's great. So I fear we are running out of time, but the question from Matus, how closely does the Intel Python distribution
41:22
follow the release cycles of pandas, scikit-learn, et cetera? So we have a very close collaboration with Inria. So with scikit-learn, we are usually very, very close. So when they have a new release, even during their release cycle,
41:41
we are checking against our patches and our optimization. So we are, it should be very, very close. So we are trying to be as close as possible. Pandas is usually a little more behind, but you can expect that we are no more than maybe a quarter behind. So there are more questions.
42:01
We have to take this to the discord chat. So best is to continue there. You will find the chat by pressing Control K. And I think it's under scalable machine learning. I'm sure you will find that one.
42:20
So let's continue in discord. We even have three questions here, but time's up. Now it's time for the coffee break, the virtual coffee break, time to get some coffee. And thank you very much again, Victoria and Frank. It's really great work and continue it, please. Thank you. Thank you.