Scaling scikit-learn: introducing new sets of computational routines
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 112 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 4.0 International: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben. | |
Identifikatoren | 10.5446/60850 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
EuroPython 202285 / 112
5
10
14
17
19
23
25
29
31
32
34
44
47
51
53
54
57
61
69
70
82
83
88
93
94
97
101
105
106
00:00
ComputerSoftwarewartungQuellcodeDigitalsignalStichprobeStochastische AbhängigkeitProdukt <Mathematik>DatenanalyseImplementierungSchätzfunktionKonfiguration <Informatik>TopologieVerschlingungAbstandMetrisches SystemProdukt <Mathematik>GamecontrollerSchnittmengeAggregatzustandPunktEndliche ModelltheorieTVD-VerfahrenKonditionszahlCASE <Informatik>Green-FunktionGüte der AnpassungOrdnungsreduktionTopologieTermDialektGefangenendilemmaMinimalgradFunktionalOrdnung <Mathematik>Abstimmung <Frequenz>Kontextbezogenes SystemMinimumSelbst organisierendes SystemResultanteVersionsverwaltungBitSpeicherabzugZweiProgrammbibliothekDatenstrukturEinfache GenauigkeitKonfiguration <Informatik>MenütechnikForcingMeterBinärbaumNachbarschaft <Mathematik>Physikalischer EffektProjektive EbeneKoroutineVerschiebungsoperatorHochdruckSpieltheorieWissenschaftliches RechnenQuellcodeBAYESBenutzerfreundlichkeitGradientÄußere Algebra eines ModulsInstantiierungDatenanalyseEinsAlgorithmische LerntheorieKomplex <Algebra>Spannweite <Stochastik>Globale OptimierungWort <Informatik>DatenverarbeitungssystemRadiusAlgorithmusSchlüsselverwaltungFeldrechnerDimensionsanalyseMatrizenrechnungValiditätRandomisierungKategorie <Mathematik>ImplementierungNichtlinearer OperatorRichtungGraphMusterspracheSoftwarewartungGeradeSchätzfunktionNeuroinformatikHilfesystemFokalpunktQuick-SortInformationOpen SourceComputerarchitekturPartitionsfunktionFormale GrammatikXML
08:24
KraftParalleler AlgorithmusLineare AbbildungSkalierbarkeitThreadWechselseitiger AusschlussInterpretiererGlobale OptimierungFreewareSystemaufrufKernel <Informatik>Array <Informatik>Operations ResearchCachingDatenstrukturAlgebraisches ModellSpeicherabzugArchitektur <Informatik>Total <Mathematik>OrdnungsreduktionOffene MengeZeiger <Informatik>MAPMagnetbandlaufwerkMAPFreewareRichtungDatenverarbeitungssystemVirtuelle MaschineHalbleiterspeicherResultanteParalleler AlgorithmusKomplex <Algebra>Nichtlinearer OperatorImplementierungMatrizenrechnungLinearisierungSkalierbarkeitVererbungshierarchieAlgorithmusProjektive EbeneSpeicherabzugSchnittmengeStrategisches SpielCachingWissenschaftliches RechnenAbstandUltraviolett-PhotoelektronenspektroskopiePunktrechnungPaarvergleichDatenstrukturGesetz <Physik>FeldrechnerDiagrammLineare AlgebraPlateau-ProblemGarbentheorieMultigraphGanze FunktionTermMultiplikationInterpretiererPuffer <Netzplantechnik>Euklidischer Algorithmusp-BlockPhysikalisches SystemMultiplikationsoperatorAbstraktionsebeneAggregatzustandAbfrageZahlenbereichCASE <Informatik>DifferenteRechenzentrumGruppenoperationZeiger <Informatik>Prozess <Informatik>ThreadCodeKontextbezogenes SystemFormale SprachePrimitive <Informatik>Wechselseitiger AusschlussTypentheorieMereologieKonditionszahlRoutingFormation <Mathematik>Overhead <Kommunikationstechnik>Metropolitan area networkVerdeckungsrechnungRelativitätstheorieInstantiierungIntegralt-TestSoundverarbeitungMehrschichten-PerzeptronMinimalgradBenutzerfreundlichkeitDruckverlaufNatürliche ZahlMetrisches SystemLesen <Datenverarbeitung>Luenberger-BeobachterComputerTVD-VerfahrenSystemaufrufOrbit <Mathematik>SchlussregelQuellcodeTwitter <Softwareplattform>VerschiebungsoperatorArbeitsplatzcomputerFamilie <Mathematik>SoftwareentwicklerWeb-SeiteEDV-BeratungComputeranimation
16:48
CachingTotal <Mathematik>Architektur <Informatik>SpeicherabzugOrdnungsreduktionFreewareMAPMatrizenrechnungHardwareOverhead <Kommunikationstechnik>Offene MengeBefehlsprozessorSoftwareEreignishorizontKanalkapazitätAdressraumExplosion <Stochastik>ApproximationKernel <Informatik>BitrateMultitaskingGlobale OptimierungDatenstrukturAlgebraisches ModellLineare AbbildungMustersprachePartielle DifferentiationSystemaufrufModul <Datentyp>Hierarchische StrukturVarietät <Mathematik>ImplementierungMultiplikationComputerphysikKoroutineProgrammbibliothekAlgorithmusAbstandMessage-PassingOrdnungsreduktionZahlenbereichOrdnung <Mathematik>DatenstrukturBereichsschätzungPhysikalischer EffektGüte der AnpassungComputerspielEinfache GenauigkeitOffene MengeVersionsverwaltungDifferenteProzess <Informatik>Befehl <Informatik>SichtenkonzeptMultiplikationsoperatorWeb-SeiteBildschirmfensterAggregatzustandMinimumProgrammierungElementargeometrieMereologieInterpretiererGruppenoperationAbstandSchaltnetzMetrisches SystemHalbleiterspeicherVererbungshierarchieMinkowski-MetrikProjektive EbeneWeb SiteVerkehrsinformationSchnittmengeKonditionszahlKlasse <Mathematik>Konfiguration <Informatik>SystemaufrufKontextbezogenes SystemTVD-VerfahrenVierzigTermSchlüsselverwaltungSymmetrieImplementierungOverhead <Kommunikationstechnik>GeradeProdukt <Mathematik>TelekommunikationNichtlinearer OperatorOffice-PaketSpannweite <Stochastik>Interface <Schaltung>TemplateAbstraktionsebenePaarvergleichKoroutineHypermediaMatrizenrechnungMusterspracheAlgorithmusBefehlsprozessorSpeicherabzugRechenschieberSchwach besetzte MatrixFront-End <Software>DatenverarbeitungssystemZusammenhängender GraphMultiplikationProgrammbibliothekNeuroinformatikCachingModul <Datentyp>Hierarchische StrukturVirtuelle MaschineCASE <Informatik>MathematikGrenzschichtablösungNetzadresseAssemblerFeldrechnerUmsetzung <Informatik>Dreiecksfreier GraphFokalpunktVorhersagbarkeitBitrateRadiusHardwareComputeranimation
25:12
TermStellenringRoutingQuick-SortBitZeitzoneDatensatzUmsetzung <Informatik>Front-End <Software>ZeitrichtungArray <Informatik>Nichtlinearer OperatorVorlesung/Konferenz
25:49
Offene MengeProgrammbibliothekSpeicherabzugMusterspracheAbstandAlgorithmusHardwareKoroutineComputerphysikMessage-PassingProjektive EbeneDatenverarbeitungssystemInternetworkingComputeranimation
26:24
GoogolSpieltheorieGammafunktionInstallation <Informatik>Lokales MinimumGruppoidProjektive EbeneKugelkappeCoxeter-GruppeAlgorithmische LerntheorieProzess <Informatik>GefangenendilemmaPhysikalisches SystemPoisson-KlammerGruppenoperationKonditionszahlGüte der AnpassungAnnulatorDatenverarbeitungssystemZeitrichtungPlug inXMLProgramm/Quellcode
27:21
Virtuelle MaschineAlgorithmische LerntheorieSchätzfunktionSystemaufrufVorlesung/Konferenz
27:49
Lokales MinimumOperations ResearchKernel <Informatik>TensorDifferenteNichtlinearer OperatorReibungswärmeAlgorithmusBeweistheorieImplementierungSchätzfunktionAnalysisFlächeninhaltProgrammbibliothekMultiplikationsoperatorMathematikMereologieVollständiger VerbandCASE <Informatik>TermMehrschichten-PerzeptronWellenpaketGamecontrollerVierzigVererbungshierarchieProgramm/Quellcode
29:44
Lineare AbbildungSkalierbarkeitHardwareKraftSchätzfunktionMathematikAbstandDatenverarbeitungssystemTopologieCASE <Informatik>Domain <Netzwerk>MatrizenrechnungSchnittmengeSchießverfahrenBenchmarkInstantiierungAeroelastizitätMusterspracheSchaltnetzSchwach besetzte MatrixDichte <Physik>Vorlesung/KonferenzProgramm/QuellcodeDiagramm
31:06
ProgrammbibliothekOffene MengeSpeicherabzugMusterspracheAbstandAlgorithmusHardwareKoroutineComputerphysikMessage-PassingAbstandMetrisches SystemDienst <Informatik>MatrizenrechnungMusterspracheCASE <Informatik>ÄhnlichkeitsgeometrieHalbleiterspeicherSystemzusammenbruchComputerTrennschärfe <Statistik>KontrollstrukturVorlesung/KonferenzComputeranimation
32:21
AbstandProgrammbibliothekUnrundheitVorlesung/KonferenzBesprechung/InterviewXML
Transkript: Englisch(automatisch erzeugt)
00:06
So today, I will go through some work that we are currently doing for skin cyclical. It's mainly about the introduction of new computational routines for the library.
00:21
I will mainly go through rapidly the context of the library. And then I will describe a pattern which is used in many algorithms, and which we would like to, which we have, in fact, has improved, and which we would like to port on other architecture. So before getting started, just a brief few words
00:43
about me and about Inaya. So I'm a research engineer in the SODA team, which launched the projects a bit more than 10 years ago. I've been working there for a bit more than a year, and I became a Scikit-Learn maintainer in October 2021. So my work mainly focuses on improving the library
01:03
performances and on reviewing contributions. But there are many other aspects, and we are looking for people to help, especially for reviewing pull requests. As for Inaya, Inaya is the French National Institute for Research in Digital Science and Technology. It's kind of like important historically for the country,
01:24
and it ranges from academic research to technological startup creation. So there are many other open source libraries that are used for scientific computing there. So if you are willing or interested to know more about those, feel free to have a chat then after the talk.
01:44
So when it comes to Scikit-Learn, I don't know if we need to present the library once again. But in a few points, Scikit-Learn used to be the reference for simple and efficient and still use implementation of data analysis and machine learning algorithms.
02:02
It kind of like brought the famous estimator that fits the predict API, came with like a nice illustrated documentation. It's kind of like the reference for many product in the industry and working academia. We also organize on-lights and on-site sprints.
02:22
So there's like a sprint this afternoon. Some people from the team will be participating. So if you're new to open source or if you just would like to contribute to the library, feel free to come. And we just like will help you with other maintainers of library help. So these kind of like features just
02:42
made it one of like the most popular package for Python. Yet there are other things that can be improved. But it's mainly about improving what exists. We don't want to have like new shiny features in library. We want to make it simple and efficient, especially we can improve it on many points,
03:02
such as documentation, community engagement, direct and usability, so that it's even more user-friendly, and performance. So as for performance, there's many work which is going on. I'm going only to focus on one pattern to get to technical details. So if we look at performance, in application,
03:24
people may be using gradient boosting methods, because those are performance. It was mainly kind of like boosted by some very such as like GBM, UGBoost and CatBoost. So it can be implemented also like
03:40
the history of gradient boosting. And if we compare performance of psychic term versus those library, it's kind of like relatively similar. So here, it's kind of like a bias graph that we've done, or aggregating all validation scores on many sets of like hyperparameters and random data sets, yet those are available online,
04:01
can be produced if you want. But the point here is that there's not just those kind of methods, there are many other ones, starting from really simple ones, such as the Knabers classifier. And if we compare our implementation to, for instance, Intel's OneAPI implementation, like in the psychic term 1.0 in previous version,
04:24
those implementation were not the most optimal ones. So there was like some kind of performance gap that we could like just work on. And actually, if we look at this classifier here, most of the computations are relying on two sets of methods, which are the Knabers search and the radius neighbors search.
04:42
But it's not just the case for this algorithm here, it's the case for many other ones in the library, such as those ones and middle others. So focusing on improving the performance of those two routines, just like would improve the general performance of the library. So the focus of our talk is to get through,
05:01
or we can speed up those two things here. So just a bit of like formalism for the nearest number search. So I'm just going through the details. What you need is distance metrics, which is a function taking like two vectors of given dimension with some other properties
05:23
that I'm not going to cover. What you are working with generally is a vector X, called a query, which you would like to find neighbors of in a data set here, as referred to Y. And you can consider the distance metrics here,
05:41
called D or D, X or Y. It's defined like this. So the goal, as said, is to find the closest vector to X and Y, and there's generally two formulation, one for the key neighbor search, where you'll find all these kth smallest value using what we call our k-mean operation here
06:01
on the metrics, or you can find the closest neighbors within a ball of radius r, as shown here. There's generally two kinds of solutions to solve this problem. So the first kind of solution is to rely on binary tree data structures, such as ball trees and kth trees. So here is the complexity,
06:22
which is theoretically provable. And the problem, in fact, is that in many cases, for many reasons, including like backtracking, I see, it's explaining our luck. I'm not going to cover this. It gets through this complexity as P tends to infinity.
06:41
So it's not even the best option, in fact, when we have big data sets, and it just is the best for smaller data sets, for example, P, for this value, those kind of like range for P. In practice, people are using much bigger data sets, so we need to consider alternatives. The alternative is just to use brute force,
07:00
which is supposedly worse in terms of complexity, but in practice, it just can be way more performant because technical optimization can be used. So here we are going to only cover this case, which will give you an overview of the work that we are going through.
07:22
And when it comes to brute force k-neighbors search, one can see the problem as this. So you consider your vector x, your matrix y here. You can compute the distance. And in easy world, what you can do to get the neighbors that you are looking for
07:42
is to perform what you call a reduction on this matrix. So in this case, the step of the reduction can be as follows. You can partition this matrix around the pivot, which we take here as like the kth smallest value of this array. This allows you just to extract all the kth smallest values,
08:04
and then you can just sort them and as well sort the indices so that you have all the information of your neighbors for your vectors, and then you're done. You have here your neighborhood for your vector x. In Python, this can be implemented easily and maybe using R partition and R sort.
08:22
So using NumPy, you can just use your line gets the result here. Actually, sometimes we are just not working with one vector, but we have many. So instead of using a small x, I will be using a big x here to refer off the sets of all vectors. And what you are computing is not just like a vector,
08:41
but an entire distance matrix, like a pairwise distance matrix between, which consists of all the distances between the vector in x and vector in y. So if you try to perform those operation, namely, it will just crash your RAM because these matrix here don't fit in RAM. If you have like some data sets, like a few, like a hundred thousand observations,
09:04
you just will not fit into your memory. So what you do here is that you can check, you can check just the matrix x and compute all the operations kind of like independently for groups of vectors.
09:22
So you consider like smaller distance matrices here, and this can be done like in parallel. And for example, you can do this in parallel using job label. So this was like kind of the previous implementation of k-neighbor search. So the implementation as of 1.0.
09:40
The problem with this implementation is that if you want to use big machines, a machine with many threads, is that it's not properly scaling. So here you have like a simple graph which consists of like in x number of threads and in y the speedups. So what you see is that as you add number of threads, as you use bigger machine, you do not get any speedups.
10:02
There's a case for kind of like different kind of data sets you get up to times two speedup. So why is it the case? So the problem is that using Python, there's a few reason for sub-optimality. Python was not meant to be used for scientific computing at first, only to like have like a lesser language than bash.
10:23
So the first problem here is that the execution of this algorithm is bound to the C Python interpreter into the GIL. That is, you have like a lot of costly abstraction for simple operation like basic arithmetic. Moreover, the execution just is limited to one thread at a time due to the GIL,
10:41
which might be removed, we don't know, which is like a mutex on the interpreter state. So this just comes with like performance degradation. Sometimes you may just have like bigger data sets for y when you are training data on like a lot of example and you only have like a few queries. So chunking on x is not adapted.
11:01
You may want to chunk on y instead. What's the problem here as well is that we use high-level operations on top of our array. So this is just costly because internally you create a new data structure so that you need to allocate buffers. This calls maddock and this calls free
11:21
to like under the hood and this just block at the level of the operating system. So you're just like performing a lot of operation which are like long and which blocks. And in the context of like parallelism, this just come as an extra cost over the cost of like thread pool or process pool, set up and fill on.
11:42
Moreover, sometimes with Python it's nice because you have like a simple language but you have only a few primitives and you can't really use like complex or like more details, like a set of like executions. So the reasons or like we can solve this problem
12:01
by using a few things. So first we can get right out the CPython interpreter and the GIL using something called, or using a language called Cython. So if you do not know Cython, I encourage you to have a look at this. This is like a super nice project and you can get really nice performance just by modifying your code
12:20
with like a few extra typing and other things. You can make sure that you're not using the GIL with the no-GIL clause here. So you're sure that you basically write C code and it runs like efficiently. You can adapt your checking strategy so that you're not only checking on X but on Y and you can choose whether you want to prioritize computation on Y or on X.
12:44
You can use, instead of map-array, you can use other data structures and algorithms underneath and even call libraries that can be used for like sorting or linear algebra. And Cython allows using OpenMP,
13:02
which basically has a super all-over algorithm and basically no cost or no overhead, so it's even nicer. You have more, you have much more directive with OpenMP if you want to perform computation but they are not all exposed in Cython.
13:20
Yet it's possible with some tricks but I'm not going to cover this. So generally people, the user of scikit-learn, aren't data centers or everything. It's mainly people with laptops, such as this one. So this machine here consists of several things. So as you can see here, there's like,
13:40
it's a simple machine with like 50 gigabytes of RAM. You have like four physical core at the bottom and in between those two, you have like this thing called L3, L2, L1D, L1I. Those are like small sets of memory called cache that are used to store intermediate results.
14:01
So if you want to have implementation like performance, you need to consider this and this is what we've done. So firstly, we try to make sure the data structure that we used are like pitting into the L3 caches and that they are only allocated once at the beginning. So that's just like efficient.
14:22
You can make sure that like data structure are like properly allocated so that they are mapping correctly to the core as well. You can use, as I said, rollover parallelism using Cython, for example, with OpenMP and other algorithms in data structure. So in our case, instead of using NumPy arrays,
14:43
what we can use are maxips. This allow just having like a better complexity and in terms of like technically, you can just use C buffers and pointer arithmetic. And lastly, generally, when people are like considering this kind of algorithms,
15:00
they are mainly using the Euclidean distance case, which can be decomposed as follows for the internal flight distance matrices. So in this case, we have like free term that you can compute. Two of them being computable at the very beginning and can then be stored. And you have like this term here, which is a small little rectangle,
15:22
which can be computed efficiently using a blast level free operations. So those are just like the most efficient algorithm implementation of matrix multiplication that you can have. So using those tricks, what you can now have
15:41
is just like a linear kind of like scalability up to a plateau at the end. So what you have here is that using bigger machines, you can even have good performance. There's just like, I said, this plateau here, which can be explained by several things. First, the dataset is relatively small for the machine that we use.
16:01
And it's just like the parallel section of the algorithm is negligible compared to the sequential one. So in this case, you do not have speed ups as you add more threads because you won't add parallelism. And there's like a small cost of like syncing up the threads with OpenMP
16:21
and the small slow down at the end. So this is sometimes referred to as Adam's law. So you need to get this into, you need to remember this when you want to maintain the algorithms. So just to make sure, is it possible to see what happened on the hardware and know if this is actually efficient
16:41
or if we can just speed this even more. So coming back to this diagram here, what we can have a look at are cache hits and cache misses on the M3 cache to see if it's properly used. We want to have a high hit rate for the L3 cache.
17:06
We want to see if there's like a low overhead for OpenMP and CPython. And lastly, we want to know if the implementation is mainly of like, the algorithm is mainly,
17:20
is like mostly used for the L3 cache. What we spend into vectorizing instruction. So you can have a look at this using a tool called Perf, which is really a simple tool to use, which runs on like many kind of like programs.
17:42
So you can even run it on Python and see what happens under the hood. So I won't go into details, but basically you just need to set up a few things. Here we want to record the cache hits and misses in the CPU cycle. Then you can have a report, which is everything. And so if you have a look at where are the cycles
18:01
of the machine are spent, these are mainly spent into here as you see, libopenBLASP is the implementation of BLAS by a project called OpenBLASP. So this is where we spend the time computing the matrix multiplication for the distance matrix.
18:23
There's sometimes spend like pushing values on IPs and the time spent into the C Python interpreter and in OpenMP here is just like negligible. So this is something that we wanted. We wanted just like a super low overhead for C Python and OpenMP.
18:41
So that's what we got. And like the execution is mainly spent into CPU. There's like the execution is like CPU one. There's no problem with like the memory. If you have a look at the time spent in the core of the computation, that is the small region here at the top in green,
19:02
what we see is that all the operations here are like vectorized instructions, namely SIMD instructions. Like you can perform operations on like several machines or floats at a time and this is what you get
19:22
in terms of like assembly code here. So this just show that like the implementation is efficient for this case. And this is what we wanted to do. We haven't wrote this. It's OpenBLASP who wrote those, right? OpenBLASP is just a super nice project. If you do matches multiplication,
19:42
changes are that you are using OpenBLASP under the hood but you maybe don't know about this. So just have a look at OpenBLASP, super nice project. And lastly, if you have a look at like the cache, every cache, we can have a look at like the top, the cache hits and at the bottom the cache miss
20:01
and what we see is that there's much more cache hits here than cache misses. This just show that we have like a higher every cache hit rates. So that is we do not move like data structure between the RAM and the every cache like a lot. So this is nice. So if you wrap up everything,
20:21
you can say that we kind of have high confidence in quasi-optimal performance for this kind of like algorithm. So if you go back to this pattern here, actually we can extend it to many other algorithms or many other like situations. Firstly, we can just adapt the data structure at the end
20:42
and the operation that we use and operate on, as well as call to other libraries like C++ library or other things to support other reduction and algorithms so that you can implement the radius neighbor search like this, you can implement k-means like this, you can implement value thresholding,
21:02
you can implement even Kalman methods or just like you just can do many things with this. You can also work with sparse datasets sometimes. You may have like one dataset which is sparse, the other one being dense or all the combination possible or the possible combinations.
21:20
And you may have like storm distance metrics or things that's like not the equivalent distance that you would like to support. So in our work, we have kind of like moved to a modular class hierarchy to have the support for media algorithms back end. So this is kind of like just simple overview. We have like the nearest number search interface at the top
21:43
which depends on two other interfaces. And then it's just like Satan has some restriction but it's possible to using some kind of like templating. Dispatch calls to specialized implementation.
22:00
So here they are like, it's kind of like a partial overview of the design but you can have all the main operation done into these abstract pairwise distance reduction class and have then the reduction and the data structure be like specialized into subclasses here.
22:21
You can factor like some operation, for example, the matrix multiplication term, computation into a dedicated component. So there's support for all distance metrics like this, all the combination of sparse in this data sets pairs, support for 32 and 64 and everything.
22:41
But it's not, it does not fit in like a single slide. I don't want to show this. Okay, so part of like this job or this work just is kind of like just a first step towards making new computational routines. So we kind of have now a partnership with Intel
23:03
to work on their technology and extend this kind of like pattern to like the hardware and their technology. So this is not just something that we want to do for potentially one vendor. We would like to have kind of like a general design so that people can even have like
23:20
implement their own backends if they want. So this is just the start of a design projects. There's like, I really like the reference to the slides at the end. But basically, those are like ongoing discussions. This is just one part of like the preference work. There are many other things and other way to speed up implementation.
23:45
And yeah, so this is just kind of like the beginning. For Intel, they are working on new hardware which they call XPU, which is the combination of like a CPU and a GPU with unified memory.
24:01
So you can have like a set, like small sets of like memory space between your CPU and GPU so that you can just like work efficiently sometimes for vector operation on the GPU and for simpler operation on the CPU. And that's it. So the conversation is that improving
24:21
the performance is one of the next steps for the library. There are many others. For this work, we kind of like focus on a core pattern for the algorithms that are like prediction over pairwise distances. Python is kind of like slow for and has some problem with performance.
24:41
Yet it's possible to mitigate this using technology like Satan, OpenMP, and C++. And like the next part of the work is like work on hardware-specific computational routines. So that's it for me. Thank you for your attention. And if you have any questions, feel free to ask them.
25:06
Thank you, Julien, very impressive. I'm sure we have lots of questions. Please go ahead. Yeah, thanks for the talk, Esteson. Well, I just came out of the Pi Arrow show. Were you there for chance?
25:21
But they showed the chunked array. It can be zero copy, no cost conversion to NumPy arrays. Which already do this sort of chunking that you showed. Is there any chance that you might alleviate these problems with Python by basing your backend a little bit on the Pi Arrow
25:43
for certain of those operations that are parallelized? Yeah, so that's a nice question. We have not considered Pi Arrow as of now. There are, in fact, many projects or technologies that exist to perform these kind of computations.
26:00
So there's a project which is called PiCaps. Oh, does it have any access? CocoNet is not connected to the internet. Yeah, I can look it up later. It's a project, basically, which, yeah.
26:24
PiCaps. So this project here, here, this one, yeah. So this project really is dedicated to perform these kind of computations on GPU.
26:42
But for CPU, we haven't found any kind of projects. We haven't looked for Pi Arrow yet, but yet it may be possible to integrate this. It's just that we don't want to have too many dependencies. We may want to have a plug-in system so that if people want to have specialized implementations,
27:03
they just can eventually, at the end, have something like pip install, scikit-learn, brackets, Pi Arrow, bracket Intel, or bracket whatever. So it transitions on the machine, but for Pi Arrow, we haven't looked at it yet. All right, thanks. Thanks for your suggestion, by the way.
27:23
Hey, thanks for the presentation. That was really cool, especially seeing how you improve k-nearest neighbors. I think that's one of the first things that new people who start out with machine learning work. And I guess scikit-learn is like the one thing people go to to learn and get into machine learning.
27:42
So really impressive. I'm wondering what other estimators are on your roadmap for improvement. Okay, so thanks for your question. So what I've covered here are kind of like low-level performance improvements. So that is like we are mainly writing it ourselves,
28:02
but there are other ways to improve the performance of algorithms or estimators in scikit-learn. One of them is kind of like trying to use the standard, which is the Arrow API, which is kind of like a standard for tensor library. So this way we won't need to rewrite
28:21
all the implementation of algorithm. We just would have kind of like a high-level API for like area operation on tensors. So there's actually someone, Tim Thomas, who work on proof concept to speed up linear discriminant analysis. You can get up to times 40 speed up on let's say GPU
28:44
with no change to the implementation algorithm or like data change. So this is one part of the performance work that we are working on, which is kind of like high-level and it's a matter of like getting into a design for all the tensors library, so it just takes time.
29:02
We, I think people in the community are working on many things. If you want to join the adventure, just feel free to. But it's just like that we need to make sure that the designs are correctly set so that we do not do any mistakes. For this case, there are like libraries that are using sometimes APIs that are a bit different.
29:21
So this creates some kind of like friction. But yet, I think there's kind of like room for making sure that we get the best performance, especially for GPU, because as of now, SecLearn mainly has been relying on NumPy. But there, people are moving towards GPU, so we are like working in this direction as well. Cool stuff, thank you.
29:41
Thanks. Thanks for your talk. I have the obvious question, namely when is it gonna come out? So how long do you, what is your estimation of when these changes are gonna be implemented? Which one, exactly? The Paris distance computation.
30:00
Okay, so for, yeah, I should have been clearer, I guess. So for this, for this one? Yes. It's already in SecLearn 1.1. So not entirely, it's the case for like the main, kind of like, the main use case that is the k-neighbor search
30:20
using Euclidean instance on this dense array. So we are currently working on, extended to Flutter 32, and to kind of combination of sparse and dense sets. It's mainly like we're going to work, probably it will be in SecLearn 1.2, depends on 1.3. It's more like we are working for people
30:41
that are interested in Satan. The bottleneck is mainly PR reviews and debugging and benchmarking. And Satan is kind of like nice, but we need as well like people. We are working on many things and that's it. So I would say something like SecLearn 1.2 and 1.3 really have kind of like much more support
31:02
for this kind of like pattern. Okay, yeah, thank you. Just one follow-up question. So I had problems, especially also with RAM, when I used my own distance, when I used my own distance matrices, like that are non-Euclidean or something. Then I also, with a lot of algorithms,
31:22
I had RAM issues. Are you working on that as well or is this? Yeah, so I was just wondering about that. I do understand that you have a lot to do with it. Okay, so what was your case exactly? You had like your distance matrix, a custom distance matrix or? Yes, exactly, a custom distance matrix and I constantly run into RAM issues if I do that.
31:43
So I'm wondering. I think this is something I'm interested in. So if you want, we can just discuss it. Potentially, there are like patterns in SecLearn which can be improved, not just for this case. There's often a case for agglomerative clustering, which is computed like a dissimilarity distance,
32:03
something like this. And sometimes people just have like their memory crash because of this. But if you have a problem with a pattern like this, let's discuss this at the end. Yeah, I'm happy to, thank you.
32:21
Okay, I guess there are no more questions. So a round of applause for Julien again.