Using Pandas and Dask to work with large columnar datasets in Apache Parquet
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 132 | |
Autor | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben | |
Identifikatoren | 10.5446/44897 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
EuroPython 2018108 / 132
2
3
7
8
10
14
15
19
22
27
29
30
31
34
35
41
44
54
55
56
58
59
61
66
74
77
78
80
81
85
87
91
93
96
98
103
104
105
109
110
111
113
115
116
118
120
121
122
123
125
127
128
129
130
131
132
00:00
SoftwareGruppenoperationRechenschieberSoftwareFortsetzung <Mathematik>Exogene VariableMomentenproblemDatenbankDokumentenserverSoftware EngineeringOpen SourceBesprechung/Interview
00:56
Twitter <Softwareplattform>BitDatensatzMultiplikationsoperatorEreignishorizontWeg <Topologie>Parallele SchnittstelleDatenkompressionGrundraumTaskGradientSpeicher <Informatik>Computeranimation
02:04
ZeitrichtungDatenanalyseSoftwareSprachsyntheseFlächeninhaltZeitrichtungMaschinenschreibenSchedulingSoftwareComputeranimation
02:53
TaskSpeicher <Informatik>BitSoftwarearchitekturComputerarchitekturSchlüsselverwaltungComputeranimation
03:40
DatenmodellMaschinelles LernenKette <Mathematik>InformationMathematikDifferenteSpeicher <Informatik>VorhersagbarkeitAlgorithmusGebäude <Mathematik>Produkt <Mathematik>Virtuelle MaschineFortsetzung <Mathematik>Minkowski-MetrikDatenbankEndliche ModelltheorieDichtefunktionalUnternehmensmodellLineare RegressionDichte <Stochastik>Computeranimation
04:46
Lineare RegressionAdditionVirtuelle MaschineDatensatzLineare RegressionEndliche ModelltheorieProdukt <Mathematik>QuellcodeGenerator <Informatik>Algorithmische LerntheorieSchaltnetzMatrizenrechnungBasis <Mathematik>EreignishorizontURLClientComputeranimation
05:45
Maschinelles LernenDatenflussProdukt <Mathematik>Cluster <Rechnernetz>Parallele SchnittstelleDatensatzDatenbankEntscheidungstheorieMultiplikationsoperatorMinkowski-MetrikVirtuelle MaschineHalbleiterspeicherZahlenbereichAlgorithmusAlgorithmische LerntheorieSoftwareGruppenoperationNichtlinearer OperatorLastEinfach zusammenhängender RaumVorhersagbarkeitLokales MinimumURLSpeicher <Informatik>TabelleOpen SourceDatenflussOrdnung <Mathematik>HorizontaleLoginBasis <Mathematik>Fortsetzung <Mathematik>TaskComputeranimation
08:19
AbfrageKommandosprachePunktwolkeDesintegration <Mathematik>SystemprogrammierungHalbleiterspeicherProdukt <Mathematik>PufferspeicherTwitter <Softwareplattform>AudiodateiAggregatzustandBinärdatenMini-DiscSpeicher <Informatik>DefaultKonfiguration <Informatik>Stochastische AbhängigkeitFormale SpracheCodierung <Programmierung>Prädikat <Logik>Operations ResearchBefehlsprozessorAppletCodeTelekommunikationPhysikalisches SystemOverhead <Kommunikationstechnik>Lokalität <Informatik>CachingUmsetzung <Informatik>SkriptspracheZeitrichtungGruppenkeimGruppenoperationRechenwerkWeb-SeiteDatenkompressionDateiDatenstrukturDatenmodellGebäude <Mathematik>HalbleiterspeicherVirtuelle MaschineFormale SpracheGüte der AnpassungGruppenoperationDefaultAppletGlobale OptimierungAudiodateiSpeicher <Informatik>DateiDatensatzDatenkompressionCachingImplementierungBinärcodeMomentenproblemMetadatenDatenbankBitBefehlsprozessorPhysikalisches SystemStatistikRechenwerkDateiformatProgrammierspracheGemeinsamer SpeicherCodierung <Programmierung>URLZeitrichtungUniformer RaumProjektive EbeneMultiplikationsoperatorTeilmengeProdukt <Mathematik>Stochastische AbhängigkeitProgrammbibliothekZellularer AutomatMini-DiscIntegralZeitreihenanalyseWeb-SeiteZahlenbereichp-BlockPhysikalische TheorieDateiverwaltungPufferspeicherTwitter <Softwareplattform>DatenstrukturSerielle SchnittstelleKonfiguration <Informatik>Nichtlinearer OperatorStreaming <Kommunikationstechnik>AbfrageLipschitz-StetigkeitWeb logMereologieTaskUmwandlungsenthalpieMapping <Computergraphik>AggregatzustandGasströmungMAPNebenbedingungComputeranimation
16:35
AxonometriePrädikat <Logik>Keller <Informatik>BefehlsprozessorDigitalfilterGruppenkeimCodierung <Programmierung>BenutzerfreundlichkeitData DictionaryCodeWeb-SeiteDatenkompressionInhalt <Mathematik>Stochastische AbhängigkeitGruppenoperationFitnessfunktionGüte der AnpassungSpeicher <Informatik>StatistikGraphfärbungStreaming <Kommunikationstechnik>Mini-DiscCodierung <Programmierung>Zeiger <Informatik>DateiData DictionaryDickeWeb-SeiteKeller <Informatik>DatensatzProgrammbibliothekLesen <Datenverarbeitung>FunktionalLokales MinimumPrädikat <Logik>MatchingBefehlsprozessorZeitrichtungMetadatenOrdnung <Mathematik>DistributionenraumProdukt <Mathematik>Kartesische KoordinatenMAPMomentenproblemZellularer AutomatNichtlinearer OperatorDatenkompressionInterface <Schaltung>DatenunabhängigkeitMereologieURLHalbleiterspeicherOrtsoperatorProjektive EbeneMultiplikationsoperatorHeegaard-ZerlegungDreiecksfreier GraphLastSelbstrepräsentationTypentheorieGreen-ITGleitendes MittelKategorie <Mathematik>KategorizitätImplementierungPi <Zahl>Computeranimation
24:30
BenchmarkCodePartitionsfunktionVerzeichnisdienstDatenstrukturNummernsystemSpeicher <Informatik>Streaming <Kommunikationstechnik>p-BlockDatenbankAbfrageTreiber <Programm>PASS <Programm>ZeitrichtungReelle ZahlProxy ServerStapeldateiMaschinelles LernenDatenflussTypentheorieNummernsystemInformationDateiVerzeichnisdienstZentrische StreckungTurbo-CodeDatenbankDatensatzSpeicher <Informatik>Prädikat <Logik>EinfügungsdämpfungStandardabweichungImplementierungTabelleStapeldateiTeilbarkeitProzess <Informatik>AudiodateiDämpfungSchlüsselverwaltungMultiplikationsoperatorMessage-PassingZeitrichtungMereologieStreaming <Kommunikationstechnik>AbstraktionsebeneComputerarchitekturRPCDifferenteDatenstrukturHalbleiterspeicherBildschirmmaskePartitionsfunktionTotal <Mathematik>Endliche ModelltheorieVirtuelle MaschineObjekt <Kategorie>NeuroinformatikPaarvergleichHeegaard-ZerlegungDateiverwaltungAbfrageDatenkompressionp-BlockDatenloggerInterface <Schaltung>Dienst <Informatik>StellenringDateiformatPhysikalisches SystemProgrammbibliothekZeitreihenanalyseTransformation <Mathematik>Web logAnalysisEinsLastMultiplikationURLMatchingEin-AusgabePatch <Software>TaskMarketinginformationssystemRechter WinkelReelle ZahlOrdnung <Mathematik>Textur-MappingHecke-OperatorComputerspielComputeranimation
32:26
ZeitrichtungFormale GrammatikWeb SiteProgrammbibliothekBenutzerbeteiligungTaskNeuronales NetzLastProjektive EbeneZeitrichtungFunktionalTaskGasströmungComputeranimation
33:06
TaskFunktionalWärmeübergangSerielle SchnittstelleTurbo-CodeData DictionaryZeitrichtungDatenstrukturProgrammbibliothekRahmenproblemAudiodateiCASE <Informatik>Grundsätze ordnungsmäßiger DatenverarbeitungAnalysisDatenflussCodierung <Programmierung>RichtungCoxeter-GruppeDateiSchedulingLesen <Datenverarbeitung>Produkt <Mathematik>MultiplikationsoperatorSichtenkonzeptZeichenketteFitnessfunktionTreiber <Programm>ResultanteSpeicher <Informatik>DatenbankInterface <Schaltung>AggregatzustandHydrostatikFortsetzung <Mathematik>Mechanismus-Design-TheorieUmsetzung <Informatik>Besprechung/Interview
38:25
Coxeter-Gruppe
Transkript: Englisch(automatisch erzeugt)
00:04
So thank you for the nice introduction Yes, I'm a data scientist or software engineer at blue yonder in my day-to-day work. I work a lot with Python I work with sequel. I'm responsible for the databases in our company You can find this the slides of the talk and some other stuff
00:23
Some software that we have open source in our github repository at blue yonder on github When I'm not working I still like to fiddle around with Python and the Python community because I really like the community So I became one of the co-organizers of PyCon DE 2017 and 2018
00:44
Also a co-founder and co-organizer of the pay data Castaway group we just started so we a little small group at the moment, but we really plan to become bigger So before I go into my talk, let's pitch a little bit the conference
01:01
We'll have the second time in a row the PI data in Karlsruhe. That's in southern, Germany Near to the Black Forest. It's always a combined event. It's a PI data was a PI other picon De was a PI data track Last year we had about
01:22
450 people there we had three tracks with talks one track with tutorials We had a really nice venue with the Center for art and media in Karlsruhe. So here are some impressions We had great keynotes We talked about the universe and how it's made of and what it is made from we had in parallel the great art
01:46
Exhibition the open codes and we had another great keynote speaker, which I was very proud of the founder of dusk Matthew Rockland So probably if you think last year we had Matthew Rockland and my talk is about dusk and pandas
02:01
So probably who would be the nice keynote speaker for this year. So that's brand news We have gotten Wes McKinney as a keynote speaker keynote speaker. So he's the founder of pandas He is a PMC member at Apache arrow and paki. So that's what I'm talking about. And yeah He is also with the Apache software foundation
02:22
so I'm very proud that we have got him as a keynote speaker because it touches lots of area that we do in the Company and I'm personally interested in We have already saw sold 150 tickets So if you are interested, you really should hurry up because once we are getting a schedule out
02:41
Yeah, the tickets will just go by with a blink of an eye and we expect to be sold out once we push the push the talks out So let's come back to my talk Today I will talk about
03:00
Columnar data in Apache par key and how we use pandas and ask to work with it API wise It's not that much in Pandas and ask to work with it, but I will tell you a little bit more about The challenges and the motivations we had in our software architecture
03:23
To that we need to change and then we came to the conclusion that a picture Apache par key is a nice storage technology technology for us and Show how we use pandas and ask to work with it and how it improved our architecture in our company
03:41
So, what do we do at blue under we are machine learning company we work in the retail space So on business model is that the customer that's a big retail change a change They send us their customer data So mostly sales and stocks and product information and store information
04:03
We put all this data in a big sequel database. Then we have our machine learning algorithms They we train it on the historical data and then we can make predictions for the future about future sales in different stores and then we use this probably density function
04:23
To build replenishment on top of it, so we tell Supply chains every day. You need to order for this store this amount Of products for the next day so that you are not going out of stock and you are not having too much waste
04:41
That's all business model. We work quite with quite some data We use mostly supervised regression for our models so we have We get lots of data from the customer but we have also some other additional data sources like events or weather data and
05:03
Based on that that we basically build a huge matrix With features for each product location and date combination and we use the historical values to train our data to train our models and then run on a daily basis or machine learning pipeline to
05:21
Each day generate new sales forecasts I'm not going anymore into the machine learning today the only thing that is important, so Normally, we have historical data from our clients about three to five years and that's up to 100 billion records
05:42
So that's quite some stuff Let's look at maybe we should go back one or two years in time So how was our data flow at our company? So We had this order. We have this huge clusters
06:00
of in-memory columnar distributed SQL databases, it's Proprietary software called Exa soul we normally run this with four to eight nodes database based clusters and it keeps all the data in memory So the customer sends us data every day we put it into the database
06:22
Then we have machine learning cluster based on Apache Mesos and Aurora We run an open source software like Jupiter or pandas on top of it We pull the data out again from the Exa soul from the database to the machine learning and after this We insert it back in the database and then the customer pulls it from the database
06:45
To give you some numbers if we calculate predictions for triple kill Retailer that's about 20,000 products about 500 stores and we calculate 20 days of horizon So we calculate each day 20 millions of decisions
07:02
for each day for each customer and all this data has to go out of the X out of the database and go go Back into the database again. So that's where we already see where we have a bottleneck on the one side We can really trivial parallelize our machine learning algorithms based on clustering on product groups or locations
07:24
So we really can scale this out to 500 workers with no problems But in front we have this database where we can at max pull out data with 10 or 15 connections And then the data database is on full load and even worse
07:41
We want to insert back the data into the database because of full table locks. We only can insert with one connection at a time We are inserting with up is about 25,000 rows a second if you calculate it up We already use about two hours only to get the data back into the database
08:01
If you working in the retail space And you look at the operations clock from a typical retailer Two hours delay because we need to get all the data in the database. That's a huge amount of time for them So we figured out okay, we need to go better. We need to be better Our conclusion was always
08:22
Python is a really good companion for the data scientists. They are really happy with it. They like to build Data data models all the stuff they really like it But it has not been the best to move large data in and out
08:41
So why do I care or what do we think about? Most of our data are huge time series They Do not necessarily need to be in a database So we would really like to have this time series in a distributed file system with a non uniform schema
09:00
We like to execute queries on top of it But most of the time we are only interested in a subset of the data and not all the data There are lots of technologies out there that work very nice With this kind of constraints, but they are mostly in the
09:23
Java ecosystem, so they are not available in Python like pesto or impala or drill So that's amazing that they are very available in Java But we needed something that's available in Python to really get our data in and out
09:45
So the obvious solution for us was two years ago Okay, let's look into a Pucky as a file format and look into a blob storage as a storage technology to get the data better into our machine learning pipelines
10:02
two years ago There was not a really good Pucky library for Python With some work from us and others this has changed so we are really happy with this Before this was a this is a general problem in all all languages if you want to access
10:24
Data data or data formats you have often very good interoperability within your Ecosystems so in the Python world is mostly backed by NumPy But you have poor integration with other systems
10:40
And theory memory copy is very fast. You have a little bit of D on Serialization on top of it, but it wouldn't be that big of an issue If you go back two years in time most of the time The solution for cross-language data sharing was you we need to go with CSV
11:01
But CSV is not a file format. It's just text and Yeah, I don't like it so The obvious solution is Pucky Pucky is a binary columnar data format. I'll give you a little bit explanation what this means so
11:22
typical records that we operate on Always like on this day in this location We have sold this amount of products and of course in our real machine learning pipelines We have many more features, but it's always a little bit the same looking at this So if you look at traditional
11:42
Database systems or the memory buffer how it is stored. It's in a row based fashion, so you write row after row into your memory Whereas columnar storage takes a different approach so it writes all The cells from one column after it and then the next column and then the next column
12:03
This might not be the Optimal approach if you are only interested to work with one row because then the traditional memory buffer buffer is better You can select one row and just read the records that with this row But the columnar storage is much more efficient if you want to work on huge blocks of your data
12:25
Because you can fetch all the data it passes into the caches of your CPU and you can Work with this blocks much better. So Paki is a columnar storage It was started in 2012 by Cloudera and Twitter
12:42
so it's in the Hadoop ecosystem or the born in the Hadoop ecosystem the first implementations were in Java and The first 1.0 release was in 2013 It's now an Apache top-level project and starting in fall 2016 There was the first Python and C++ support
13:03
At the moment in the Hadoop or in the big data Java ecosystem, it's the state-of-the-art default IO option So why should you use Paki as I said earlier it's a columnar format it's very good for vectorized operations It has very efficient encoding and compressions
13:22
You Can use predicted push down to only read the data that you need I'll tell you later a little bit more how this works and it's a language independent format So you have lips in Java Scala C++ and Python a big advantage Over for example CSV. It's that it's a splittable
13:44
Format so that you can only read parts of your file that you're interested in So for example in CSV, you can't just to jump to the row 100,000 and read a blog of it and you have always to seek to all the data to see where you are
14:02
Their Paki is much better You had have compression at a column level, especially with the column format of your data This is very efficient and you have also rich metadata. So you have a schema attached to your file and Statistics which you can use to efficiently read your data
14:23
At the moment there are two implementations that you can use in Python to work with Paki The one is the fast Paki library. It's more from from the task People they have implemented it and the other one is of the Apache arrow project. That's mostly driven by West McKinney
14:44
What is Apache arrow so it's a specification for an in-memory column not data format So it maps very good to the Paki on disk file format It's an language independent implementation of a memory format so
15:02
The implementation is in C or C++ and you can expose the API truth to other programming languages and so you can the support for Paki was brought to pandas in park AC plus plus without any
15:20
Additional code to read it and you use can use the Apache arrow project with also other technologies like spark or drill so Let's have a look at the pocket file structure As I said, it's an on disk format each file
15:43
Splits up into row groups. So a row groups are about 5 megabyte to 1 gigabyte of data and it's a number from I say would say 10,000 to 100,000 rows within a row group
16:01
You have column chunks So it's a columnar file format So for each column you write the data for each record one after another and then you have the page units within a column chunk that is used for compression and efficient storage and
16:22
At the end of the file you have statistics. This is On the metadata of the file. This is a very nice feature because with this way You can stream the data into a Paki file Just keep records about the metadata and the statistics and once you are done just by the statistics
16:42
So you don't have to write the statistic at first or even worse keep all the data in memory before you write it But you can stream it out Now if you want to work with Python and parquet file, how does this look like so it's
17:01
Very easy you just have the pandas top-level function read parquet. You can either pass it file handle or file name And the first thing if you are interested in in I'm in efficient read of parquet files is that you should only specify this
17:23
Columns that you are interested to read in so if you have a parquet file with like I said earlier Five columns and you are only interested in the day and the sale If you pass it in you will also only read the data from the disk From the columns that you have read that you have specified in the column projection in the column order
17:45
This is also a feature that I contributed to pandas and it's been my first contribution to pandas I'm pretty proud of it, but it was basically just passing it through the underlying libraries Yeah, and as you can see if you pass the columns you will only read from the disk the columns that you requested
18:09
What's next what you should care about if you want to work with efficient with parquet is predicate pushdown So it also you don't not only want to read the columns that you are interested in but also
18:24
Split data skip data that is not relevant for relevant for you the saves of course I o load as the data doesn't need to be transfers and this also saves CPU cycles Because the data doesn't need to be decoded
18:40
Example here if you want to know Which products are sold in dollars you can only choose? to select the columns products and the dollar column and Then you can filter out the euros and if you have the statistics on for the on file
19:04
Representation you can only read the cells that you are interested in Predicate pushdown is at the moment only available in the fast parquet library in with the Python interface It's planned for pi ro2. It's already already there in the underlying parquet C++
19:23
implementation, but it's not yet exposed to the API in Python or to pandas So what you can do you can have a very Easy filter syntax so you can specify the column an operation
19:40
so equals bigger smaller in not in and then the value and this is then used to Skip the reading of certain parts in your file For An example if you use so you can switch the engine between fast parquet
20:03
Pi arrow in pandas to read from the file, so if you want to read the data and you are only interested in certain locations You can use this to Skip reading of whole row groups and just read the row group you are interested in
20:23
How is this done if you write the data to a parquet file? It also writes statistics and metadata about the row groups So for example here it writes for every Column the min and the max values and if you read this metadata first
20:42
Then you can use this to skip the whole row group if it doesn't match your predicate This works only very good if you work on search on sorted Data because otherwise if you have randomly distributed columns
21:01
More or less in each row group you will have The min max values from your whole distribution of your data, so if you want to use this feature You should make sure that if you write the data to disk that it's in a sorted order And then this will really speed up your your read But otherwise you won't gain that much
21:29
Another trick that you can use on the one hand To minimize the data storage size on disk and also to speed up reading later is to use dictionary
21:41
encoding This is a very good fit for pandas categoricals, so if you have for example for your product The color of your product, and it's always from the same category like blue red blue yellow You couldn't can use a dictionary encoding and you can specify this and parquet will then store the
22:05
Data with the dictionary encoding on disk so you only Store the name of your value once and then just store pointers to the value And this is much more efficient and on top of this you also could use
22:21
Run lengths encoding on the values, so this will also get smaller and then if you want to filter again Based on on the color or here The example was a sales type, and then you can just check in the dictionary for this row group if
22:42
your predicate Matches this row group then reached read the data and otherwise not read it So basically the dictionary encoding you can work with it like a bloom filter, but you don't have the false positives it works the The
23:01
Performance penalty for adding dictionaries is not that that high. It's just about 1% of the reading reading time to to use it all The values in the condom need to be dictionary encoded so for all row groups and all page chunks really must use the dictionary encoding
23:23
And this is especially helpful if you have not sorted The column so you still can benefit from partial reading of your data so again if you look at the Dictionary of each row group first and then you can decide ok to skip a whole row group
23:44
and not read it from this and this will massively speed up your applications Beside compression and encoding on a column level Parquet has compression on the whole Parquet file
24:00
So you can shrink the data independent of its content, it's more CPU intensive than the encoding but encoding and compression works better than compression alone and Available are G sips nappy and Brodley Brodley is the newest one of the compression
24:21
Engines if you don't want to go deep in what what compression engine is the best for your data Just stick with the snappy and that's what we have seen Here are some search comparisons that we have done So it's I think it's based from the New York taxi data
24:40
records The columns are I think a boo boo la daytime float and I end and we have tried to store it in different different formats and You see with Parquet you are really Have a compression of factor 10 and you still read it much faster than for example CSV
25:05
Another trick that we very often use and this time we'll switch to Dask is data partitioning So until now we have always read from one file and just to read parts of a file
25:20
This is fine if you work on a load local machine or work with the data only from one consumer But as soon as you have a distributed system and work with multiple consumers and especially with multiple Producers it makes sense to split the data into multiple files There is it's not a really a standard but it's based on the hive partitioning scheme so that you basically
25:48
Encode the partitioning scheme in your file name or in the past So for example, so we have a directory date 2070 0 1 0 1 and another directory within it location equals location 0 and then you have multiple
26:04
parts of your Parquet files and With dust you can create this kind of structure But also dust can be used to auto detect the structure and read the data back in so you just handle it to Dask directory and it also
26:21
Yeah, it recognizes the structure of the directory and you can also use predicates to Predicate pushed on in this in this way to only read the files where the predicate of your file pass match
26:41
If you're working in the distributed system Sooner or later you have to move away from your local computer and split computation from storage The approach that we did was so we didn't want to host the HDFS file system on our own So we are running on Azure and then we decided to use the Azure blob storage
27:02
To store all our data and all the Parquet files You Can pass into to Parquet or to pi arrow either a file handle or a local file or Something that behaves like a Python file object
27:21
So this is an example if you just get a stream on a file object from the block block service This is the way how to you access as a blob storage then pass it into the Parquet read table And then go to pandas and to get the data into pandas This works very well if you want to work with the whole data
27:44
But you'll lose if you take the streaming approach you use all the benefits that I explained before That this predicate pushed on and only to read the data. You're interested in so we have also written an interface for
28:01
The Azure blob storage this implements the Python file interface So we have an IO interface where you can tell where you can seek where you can read We have open-sourced it at part as part of the simple KV library so simple KV as an abstraction layer over different storage technologies and
28:25
This helps us to gain the benefit of predicate push down And also store the data on a remote file system Another real-life improvement that we have seen with Apache arrow
28:44
So as I told earlier Our architecture is that we need to you to get the data out of the database and then analyze it in pandas The data in the database is stored in a columnar form Until two years ago if we wanted to pull the data out
29:01
We used pi ODBC as an layer to talk to the database But pi ODBC works in a row wise form So all the data was transferred from a column and data storage to the row wise And then when we pass it into pandas, it was again converted from the row wise to the column wise
29:23
What we did and probably what many people do if they want to work with data in pandas Yeah Take the ugly real-life solution that was to export the data from the database as CSV bypass ODBC In total and then read it back into CSV
29:40
Yeah, but CSV you all you lose all the type information and it's just a heck. It's not a real thing how to do it So what was our solution? One of my colleagues Michelle Koenig. He implemented turbo DBC. That's an ODBC Implementation for Python that can also work with columnar data and also supports the Apache arrow
30:06
Standard so you can get the data from the database in a columnar fashion and then expose batches of pi arrow tables To the pipe to the Python world and you could either then write it away as par key again with nearly zero memory
30:24
Transformations or you can get it into pandas and then work with this If you are more interested in about in this technology or turbo DBC in general You can see the talk from Michael Koenig last year the picon or you can check out our tech blog where we have some really
30:41
Deep dive how he implemented this So Where are we now? So we moved away from the direct access to the database because it was the bottleneck. So we now Export the data from the database into the Azure blob storage and then from there
31:04
We run our machine learning models. This is now the huge benefit that we really can scale out our machine learning models Because the Azure blob storage Scales much much better than our database And the same thing with inserts on the Azure blob storage. We don't have to synchronize the inserts
31:26
every every job can just insert into into one blob and then we can Work on the data to get it back to the customer Also a nice benefit is that we now use
31:40
File format that is shared by different implementations that we can now also use hive pesto or drill To do analysis of the data and that we don't have to store these large time series in our In-memory database, which is much more expensive than the Azure blob storage. So the as a blob storage
32:02
For terabytes, it's not even yeah, I don't 10 or 15 euros So that's that's nothing and the the query engines on the on the bottom, right? These are ones that you can fire up for the analysis and once you are done down You can fire them down again, and you don't have to pay like the database
32:23
The whole day to keep it running So what we have learned is safe in one load in another's ecosystem This really helps us But always persist the intermediate data So if you are interested in this topic too, there are still lots of stuff to do
32:45
In the Apache parquet project, but also in the arrow project There are still lots of functions that are not yet exposed to Python or pandas or dask. So Yeah, some of my colleagues Apache
33:00
Members and they work on this project, but there's still lots of thing to do That's it My talk using pandas and dust to work with large column data sets
33:20
So if you have any questions, feel free to ask me now or later or just ping me at the conference If you have questions, can you quail on this side of the room, please? I'm really happy I sit on the left side of the room Thanks, I remember two years ago you were also here at Python but you were talking about spark
33:42
Yes, I sparked back then now I see is that picture of your stack that doesn't have pi sparking anymore It's replaced with dask. Yeah So just a question in that sense dask is ready for primetime Not I would not say not yet So just go back to pi spark
34:02
Also with arrow pi spark has really improved all the Python access in inside spark has really improved Also two years ago if you go from the JVM world to the Python world You have very expensive Encoding and decoding every well view and it was really that slow that basically you can't use it at all
34:21
I think this spark 2.3. They have also introduced arrow for the conversation so that now you can work in pi spark in Python With to get pandas results and the whole serialization is done in arrow So that's much faster now, but real God wasn't available. So we went for the dask direction
34:43
We are still not using dask everywhere in production So we still have an own scheduler and some own libraries But we plan to do it and we do it because it's much more lightweight and much more fits into our stack because we are Python company so that everything can still be Python That's the reason for us. The footprint of task is much much smaller than
35:05
Thanks If you have a question you should come here you're not passing the mic from sorry Sorry Great talk. Thank you. I wanted to ask about the turbo DBC I think have you tried to use the support for SQL can mean pandas to actually get the data like that
35:26
So we are to receive to be DBC also works nicely with SQL alchemy But you always so SQL alchemy is other you can use it from pandas But you always need the driver below to talk to the database and we either use to be DBC or pyodbc and
35:45
That's the thing that is fast or slow So if you want to use SQL alchemy in pandas, that's just a layer layer above the other stuff Thank you How would some questions from my particular case
36:03
There's one in our you know data pipeline flow there's one step we need to read in the The fail which is in nested JSON, JSON format, nested JSON format So the first time we try to use a pandas, you know read JSON, read JSON The JSON that function cannot handle the nested the problem. So so the solution is
36:26
Quite an evil basically, I have to write in the like internal very static static static library myself to handle these short strings, but you know All these dictionaries are kind of a stable so make a transfer them to the pandas data frame. So I was thinking
36:46
whether in future you want you you can or you would like to think about so maybe adding the feature like Reading the nested JSON So maybe the solution could be firstly you you have no light analysis the structure of the JSON to identify
37:02
What is going on and then you use some like a string cutting mechanism? I mean, it's from my It's my temporary solution for that. So basically Parquet supports nested data structures Nested so in the Parquet files, you can really store nested data structures
37:22
Yeah, as a not JSON but nested data structures Parquet can handle this but I don't think this will be exposed to Python to Pandas because Pandas is a tabular two-dimensional Interface so it doesn't fit the ball to the data. The reason is because the the nested the JSON
37:42
We will finally turn out to be a Two-dimensional, okay, so that's why I think is it would be good to consume, you know, and the transfer direct to data frame That's why I write a library to I mean, it's an internal library to directly transfer this Jason nested JSON file to a data frame and pass to Pandas and
38:01
I think that's the way to go like you did it. Yeah Do we have another question? Then about so let's go downstairs and
38:21
Okay, so I want to thank you again Peter for the nice presentation