We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Using Pandas and Dask to work with large columnar datasets in Apache Parquet

00:00

Formale Metadaten

Titel
Using Pandas and Dask to work with large columnar datasets in Apache Parquet
Serientitel
Anzahl der Teile
132
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Apache Parquet Data Format Apache Parquet is a binary, efficient columnar data format. It uses various techniques to store data in a CPU and I/O efficient way like row groups, compression for pages in column chunks or dictionary encoding for columns. Index hints and statistics to quickly skip over chunks of irrelevant data enable efficient queries on large amount of data. Apache Parquet with Pandas & Dask Apache Parquet files can be read into Pandas DataFrames with the two libraries fastparquet and Apache Arrow. While Pandas is mostly used to work with data that fits into memory, Apache Dask allows us to work with data larger then memory and even larger than local disk space. Data can be split up into partitions and stored in cloud object storage systems like Amazon S3 or Azure Storage. Using Metadata from the partiton filenames, parquet column statistics and dictonary filtering allows faster performance for selective queries without reading all data. This talk will show how use partitioning, row group skipping and general data layout to speed up queries on large amount of data.
35
74
Vorschaubild
11:59
SoftwareGruppenoperationRechenschieberSoftwareFortsetzung <Mathematik>Exogene VariableMomentenproblemDatenbankDokumentenserverSoftware EngineeringOpen SourceBesprechung/Interview
Twitter <Softwareplattform>BitDatensatzMultiplikationsoperatorEreignishorizontWeg <Topologie>Parallele SchnittstelleDatenkompressionGrundraumTaskGradientSpeicher <Informatik>Computeranimation
ZeitrichtungDatenanalyseSoftwareSprachsyntheseFlächeninhaltZeitrichtungMaschinenschreibenSchedulingSoftwareComputeranimation
TaskSpeicher <Informatik>BitSoftwarearchitekturComputerarchitekturSchlüsselverwaltungComputeranimation
DatenmodellMaschinelles LernenKette <Mathematik>InformationMathematikDifferenteSpeicher <Informatik>VorhersagbarkeitAlgorithmusGebäude <Mathematik>Produkt <Mathematik>Virtuelle MaschineFortsetzung <Mathematik>Minkowski-MetrikDatenbankEndliche ModelltheorieDichtefunktionalUnternehmensmodellLineare RegressionDichte <Stochastik>Computeranimation
Lineare RegressionAdditionVirtuelle MaschineDatensatzLineare RegressionEndliche ModelltheorieProdukt <Mathematik>QuellcodeGenerator <Informatik>Algorithmische LerntheorieSchaltnetzMatrizenrechnungBasis <Mathematik>EreignishorizontURLClientComputeranimation
Maschinelles LernenDatenflussProdukt <Mathematik>Cluster <Rechnernetz>Parallele SchnittstelleDatensatzDatenbankEntscheidungstheorieMultiplikationsoperatorMinkowski-MetrikVirtuelle MaschineHalbleiterspeicherZahlenbereichAlgorithmusAlgorithmische LerntheorieSoftwareGruppenoperationNichtlinearer OperatorLastEinfach zusammenhängender RaumVorhersagbarkeitLokales MinimumURLSpeicher <Informatik>TabelleOpen SourceDatenflussOrdnung <Mathematik>HorizontaleLoginBasis <Mathematik>Fortsetzung <Mathematik>TaskComputeranimation
AbfrageKommandosprachePunktwolkeDesintegration <Mathematik>SystemprogrammierungHalbleiterspeicherProdukt <Mathematik>PufferspeicherTwitter <Softwareplattform>AudiodateiAggregatzustandBinärdatenMini-DiscSpeicher <Informatik>DefaultKonfiguration <Informatik>Stochastische AbhängigkeitFormale SpracheCodierung <Programmierung>Prädikat <Logik>Operations ResearchBefehlsprozessorAppletCodeTelekommunikationPhysikalisches SystemOverhead <Kommunikationstechnik>Lokalität <Informatik>CachingUmsetzung <Informatik>SkriptspracheZeitrichtungGruppenkeimGruppenoperationRechenwerkWeb-SeiteDatenkompressionDateiDatenstrukturDatenmodellGebäude <Mathematik>HalbleiterspeicherVirtuelle MaschineFormale SpracheGüte der AnpassungGruppenoperationDefaultAppletGlobale OptimierungAudiodateiSpeicher <Informatik>DateiDatensatzDatenkompressionCachingImplementierungBinärcodeMomentenproblemMetadatenDatenbankBitBefehlsprozessorPhysikalisches SystemStatistikRechenwerkDateiformatProgrammierspracheGemeinsamer SpeicherCodierung <Programmierung>URLZeitrichtungUniformer RaumProjektive EbeneMultiplikationsoperatorTeilmengeProdukt <Mathematik>Stochastische AbhängigkeitProgrammbibliothekZellularer AutomatMini-DiscIntegralZeitreihenanalyseWeb-SeiteZahlenbereichp-BlockPhysikalische TheorieDateiverwaltungPufferspeicherTwitter <Softwareplattform>DatenstrukturSerielle SchnittstelleKonfiguration <Informatik>Nichtlinearer OperatorStreaming <Kommunikationstechnik>AbfrageLipschitz-StetigkeitWeb logMereologieTaskUmwandlungsenthalpieMapping <Computergraphik>AggregatzustandGasströmungMAPNebenbedingungComputeranimation
AxonometriePrädikat <Logik>Keller <Informatik>BefehlsprozessorDigitalfilterGruppenkeimCodierung <Programmierung>BenutzerfreundlichkeitData DictionaryCodeWeb-SeiteDatenkompressionInhalt <Mathematik>Stochastische AbhängigkeitGruppenoperationFitnessfunktionGüte der AnpassungSpeicher <Informatik>StatistikGraphfärbungStreaming <Kommunikationstechnik>Mini-DiscCodierung <Programmierung>Zeiger <Informatik>DateiData DictionaryDickeWeb-SeiteKeller <Informatik>DatensatzProgrammbibliothekLesen <Datenverarbeitung>FunktionalLokales MinimumPrädikat <Logik>MatchingBefehlsprozessorZeitrichtungMetadatenOrdnung <Mathematik>DistributionenraumProdukt <Mathematik>Kartesische KoordinatenMAPMomentenproblemZellularer AutomatNichtlinearer OperatorDatenkompressionInterface <Schaltung>DatenunabhängigkeitMereologieURLHalbleiterspeicherOrtsoperatorProjektive EbeneMultiplikationsoperatorHeegaard-ZerlegungDreiecksfreier GraphLastSelbstrepräsentationTypentheorieGreen-ITGleitendes MittelKategorie <Mathematik>KategorizitätImplementierungPi <Zahl>Computeranimation
BenchmarkCodePartitionsfunktionVerzeichnisdienstDatenstrukturNummernsystemSpeicher <Informatik>Streaming <Kommunikationstechnik>p-BlockDatenbankAbfrageTreiber <Programm>PASS <Programm>ZeitrichtungReelle ZahlProxy ServerStapeldateiMaschinelles LernenDatenflussTypentheorieNummernsystemInformationDateiVerzeichnisdienstZentrische StreckungTurbo-CodeDatenbankDatensatzSpeicher <Informatik>Prädikat <Logik>EinfügungsdämpfungStandardabweichungImplementierungTabelleStapeldateiTeilbarkeitProzess <Informatik>AudiodateiDämpfungSchlüsselverwaltungMultiplikationsoperatorMessage-PassingZeitrichtungMereologieStreaming <Kommunikationstechnik>AbstraktionsebeneComputerarchitekturRPCDifferenteDatenstrukturHalbleiterspeicherBildschirmmaskePartitionsfunktionTotal <Mathematik>Endliche ModelltheorieVirtuelle MaschineObjekt <Kategorie>NeuroinformatikPaarvergleichHeegaard-ZerlegungDateiverwaltungAbfrageDatenkompressionp-BlockDatenloggerInterface <Schaltung>Dienst <Informatik>StellenringDateiformatPhysikalisches SystemProgrammbibliothekZeitreihenanalyseTransformation <Mathematik>Web logAnalysisEinsLastMultiplikationURLMatchingEin-AusgabePatch <Software>TaskMarketinginformationssystemRechter WinkelReelle ZahlOrdnung <Mathematik>Textur-MappingHecke-OperatorComputerspielComputeranimation
ZeitrichtungFormale GrammatikWeb SiteProgrammbibliothekBenutzerbeteiligungTaskNeuronales NetzLastProjektive EbeneZeitrichtungFunktionalTaskGasströmungComputeranimation
TaskFunktionalWärmeübergangSerielle SchnittstelleTurbo-CodeData DictionaryZeitrichtungDatenstrukturProgrammbibliothekRahmenproblemAudiodateiCASE <Informatik>Grundsätze ordnungsmäßiger DatenverarbeitungAnalysisDatenflussCodierung <Programmierung>RichtungCoxeter-GruppeDateiSchedulingLesen <Datenverarbeitung>Produkt <Mathematik>MultiplikationsoperatorSichtenkonzeptZeichenketteFitnessfunktionTreiber <Programm>ResultanteSpeicher <Informatik>DatenbankInterface <Schaltung>AggregatzustandHydrostatikFortsetzung <Mathematik>Mechanismus-Design-TheorieUmsetzung <Informatik>Besprechung/Interview
Coxeter-Gruppe
Transkript: Englisch(automatisch erzeugt)
So thank you for the nice introduction Yes, I'm a data scientist or software engineer at blue yonder in my day-to-day work. I work a lot with Python I work with sequel. I'm responsible for the databases in our company You can find this the slides of the talk and some other stuff
Some software that we have open source in our github repository at blue yonder on github When I'm not working I still like to fiddle around with Python and the Python community because I really like the community So I became one of the co-organizers of PyCon DE 2017 and 2018
Also a co-founder and co-organizer of the pay data Castaway group we just started so we a little small group at the moment, but we really plan to become bigger So before I go into my talk, let's pitch a little bit the conference
We'll have the second time in a row the PI data in Karlsruhe. That's in southern, Germany Near to the Black Forest. It's always a combined event. It's a PI data was a PI other picon De was a PI data track Last year we had about
450 people there we had three tracks with talks one track with tutorials We had a really nice venue with the Center for art and media in Karlsruhe. So here are some impressions We had great keynotes We talked about the universe and how it's made of and what it is made from we had in parallel the great art
Exhibition the open codes and we had another great keynote speaker, which I was very proud of the founder of dusk Matthew Rockland So probably if you think last year we had Matthew Rockland and my talk is about dusk and pandas
So probably who would be the nice keynote speaker for this year. So that's brand news We have gotten Wes McKinney as a keynote speaker keynote speaker. So he's the founder of pandas He is a PMC member at Apache arrow and paki. So that's what I'm talking about. And yeah He is also with the Apache software foundation
so I'm very proud that we have got him as a keynote speaker because it touches lots of area that we do in the Company and I'm personally interested in We have already saw sold 150 tickets So if you are interested, you really should hurry up because once we are getting a schedule out
Yeah, the tickets will just go by with a blink of an eye and we expect to be sold out once we push the push the talks out So let's come back to my talk Today I will talk about
Columnar data in Apache par key and how we use pandas and ask to work with it API wise It's not that much in Pandas and ask to work with it, but I will tell you a little bit more about The challenges and the motivations we had in our software architecture
To that we need to change and then we came to the conclusion that a picture Apache par key is a nice storage technology technology for us and Show how we use pandas and ask to work with it and how it improved our architecture in our company
So, what do we do at blue under we are machine learning company we work in the retail space So on business model is that the customer that's a big retail change a change They send us their customer data So mostly sales and stocks and product information and store information
We put all this data in a big sequel database. Then we have our machine learning algorithms They we train it on the historical data and then we can make predictions for the future about future sales in different stores and then we use this probably density function
To build replenishment on top of it, so we tell Supply chains every day. You need to order for this store this amount Of products for the next day so that you are not going out of stock and you are not having too much waste
That's all business model. We work quite with quite some data We use mostly supervised regression for our models so we have We get lots of data from the customer but we have also some other additional data sources like events or weather data and
Based on that that we basically build a huge matrix With features for each product location and date combination and we use the historical values to train our data to train our models and then run on a daily basis or machine learning pipeline to
Each day generate new sales forecasts I'm not going anymore into the machine learning today the only thing that is important, so Normally, we have historical data from our clients about three to five years and that's up to 100 billion records
So that's quite some stuff Let's look at maybe we should go back one or two years in time So how was our data flow at our company? So We had this order. We have this huge clusters
of in-memory columnar distributed SQL databases, it's Proprietary software called Exa soul we normally run this with four to eight nodes database based clusters and it keeps all the data in memory So the customer sends us data every day we put it into the database
Then we have machine learning cluster based on Apache Mesos and Aurora We run an open source software like Jupiter or pandas on top of it We pull the data out again from the Exa soul from the database to the machine learning and after this We insert it back in the database and then the customer pulls it from the database
To give you some numbers if we calculate predictions for triple kill Retailer that's about 20,000 products about 500 stores and we calculate 20 days of horizon So we calculate each day 20 millions of decisions
for each day for each customer and all this data has to go out of the X out of the database and go go Back into the database again. So that's where we already see where we have a bottleneck on the one side We can really trivial parallelize our machine learning algorithms based on clustering on product groups or locations
So we really can scale this out to 500 workers with no problems But in front we have this database where we can at max pull out data with 10 or 15 connections And then the data database is on full load and even worse
We want to insert back the data into the database because of full table locks. We only can insert with one connection at a time We are inserting with up is about 25,000 rows a second if you calculate it up We already use about two hours only to get the data back into the database
If you working in the retail space And you look at the operations clock from a typical retailer Two hours delay because we need to get all the data in the database. That's a huge amount of time for them So we figured out okay, we need to go better. We need to be better Our conclusion was always
Python is a really good companion for the data scientists. They are really happy with it. They like to build Data data models all the stuff they really like it But it has not been the best to move large data in and out
So why do I care or what do we think about? Most of our data are huge time series They Do not necessarily need to be in a database So we would really like to have this time series in a distributed file system with a non uniform schema
We like to execute queries on top of it But most of the time we are only interested in a subset of the data and not all the data There are lots of technologies out there that work very nice With this kind of constraints, but they are mostly in the
Java ecosystem, so they are not available in Python like pesto or impala or drill So that's amazing that they are very available in Java But we needed something that's available in Python to really get our data in and out
So the obvious solution for us was two years ago Okay, let's look into a Pucky as a file format and look into a blob storage as a storage technology to get the data better into our machine learning pipelines
two years ago There was not a really good Pucky library for Python With some work from us and others this has changed so we are really happy with this Before this was a this is a general problem in all all languages if you want to access
Data data or data formats you have often very good interoperability within your Ecosystems so in the Python world is mostly backed by NumPy But you have poor integration with other systems
And theory memory copy is very fast. You have a little bit of D on Serialization on top of it, but it wouldn't be that big of an issue If you go back two years in time most of the time The solution for cross-language data sharing was you we need to go with CSV
But CSV is not a file format. It's just text and Yeah, I don't like it so The obvious solution is Pucky Pucky is a binary columnar data format. I'll give you a little bit explanation what this means so
typical records that we operate on Always like on this day in this location We have sold this amount of products and of course in our real machine learning pipelines We have many more features, but it's always a little bit the same looking at this So if you look at traditional
Database systems or the memory buffer how it is stored. It's in a row based fashion, so you write row after row into your memory Whereas columnar storage takes a different approach so it writes all The cells from one column after it and then the next column and then the next column
This might not be the Optimal approach if you are only interested to work with one row because then the traditional memory buffer buffer is better You can select one row and just read the records that with this row But the columnar storage is much more efficient if you want to work on huge blocks of your data
Because you can fetch all the data it passes into the caches of your CPU and you can Work with this blocks much better. So Paki is a columnar storage It was started in 2012 by Cloudera and Twitter
so it's in the Hadoop ecosystem or the born in the Hadoop ecosystem the first implementations were in Java and The first 1.0 release was in 2013 It's now an Apache top-level project and starting in fall 2016 There was the first Python and C++ support
At the moment in the Hadoop or in the big data Java ecosystem, it's the state-of-the-art default IO option So why should you use Paki as I said earlier it's a columnar format it's very good for vectorized operations It has very efficient encoding and compressions
You Can use predicted push down to only read the data that you need I'll tell you later a little bit more how this works and it's a language independent format So you have lips in Java Scala C++ and Python a big advantage Over for example CSV. It's that it's a splittable
Format so that you can only read parts of your file that you're interested in So for example in CSV, you can't just to jump to the row 100,000 and read a blog of it and you have always to seek to all the data to see where you are
Their Paki is much better You had have compression at a column level, especially with the column format of your data This is very efficient and you have also rich metadata. So you have a schema attached to your file and Statistics which you can use to efficiently read your data
At the moment there are two implementations that you can use in Python to work with Paki The one is the fast Paki library. It's more from from the task People they have implemented it and the other one is of the Apache arrow project. That's mostly driven by West McKinney
What is Apache arrow so it's a specification for an in-memory column not data format So it maps very good to the Paki on disk file format It's an language independent implementation of a memory format so
The implementation is in C or C++ and you can expose the API truth to other programming languages and so you can the support for Paki was brought to pandas in park AC plus plus without any
Additional code to read it and you use can use the Apache arrow project with also other technologies like spark or drill so Let's have a look at the pocket file structure As I said, it's an on disk format each file
Splits up into row groups. So a row groups are about 5 megabyte to 1 gigabyte of data and it's a number from I say would say 10,000 to 100,000 rows within a row group
You have column chunks So it's a columnar file format So for each column you write the data for each record one after another and then you have the page units within a column chunk that is used for compression and efficient storage and
At the end of the file you have statistics. This is On the metadata of the file. This is a very nice feature because with this way You can stream the data into a Paki file Just keep records about the metadata and the statistics and once you are done just by the statistics
So you don't have to write the statistic at first or even worse keep all the data in memory before you write it But you can stream it out Now if you want to work with Python and parquet file, how does this look like so it's
Very easy you just have the pandas top-level function read parquet. You can either pass it file handle or file name And the first thing if you are interested in in I'm in efficient read of parquet files is that you should only specify this
Columns that you are interested to read in so if you have a parquet file with like I said earlier Five columns and you are only interested in the day and the sale If you pass it in you will also only read the data from the disk From the columns that you have read that you have specified in the column projection in the column order
This is also a feature that I contributed to pandas and it's been my first contribution to pandas I'm pretty proud of it, but it was basically just passing it through the underlying libraries Yeah, and as you can see if you pass the columns you will only read from the disk the columns that you requested
What's next what you should care about if you want to work with efficient with parquet is predicate pushdown So it also you don't not only want to read the columns that you are interested in but also
Split data skip data that is not relevant for relevant for you the saves of course I o load as the data doesn't need to be transfers and this also saves CPU cycles Because the data doesn't need to be decoded
Example here if you want to know Which products are sold in dollars you can only choose? to select the columns products and the dollar column and Then you can filter out the euros and if you have the statistics on for the on file
Representation you can only read the cells that you are interested in Predicate pushdown is at the moment only available in the fast parquet library in with the Python interface It's planned for pi ro2. It's already already there in the underlying parquet C++
implementation, but it's not yet exposed to the API in Python or to pandas So what you can do you can have a very Easy filter syntax so you can specify the column an operation
so equals bigger smaller in not in and then the value and this is then used to Skip the reading of certain parts in your file For An example if you use so you can switch the engine between fast parquet
Pi arrow in pandas to read from the file, so if you want to read the data and you are only interested in certain locations You can use this to Skip reading of whole row groups and just read the row group you are interested in
How is this done if you write the data to a parquet file? It also writes statistics and metadata about the row groups So for example here it writes for every Column the min and the max values and if you read this metadata first
Then you can use this to skip the whole row group if it doesn't match your predicate This works only very good if you work on search on sorted Data because otherwise if you have randomly distributed columns
More or less in each row group you will have The min max values from your whole distribution of your data, so if you want to use this feature You should make sure that if you write the data to disk that it's in a sorted order And then this will really speed up your your read But otherwise you won't gain that much
Another trick that you can use on the one hand To minimize the data storage size on disk and also to speed up reading later is to use dictionary
encoding This is a very good fit for pandas categoricals, so if you have for example for your product The color of your product, and it's always from the same category like blue red blue yellow You couldn't can use a dictionary encoding and you can specify this and parquet will then store the
Data with the dictionary encoding on disk so you only Store the name of your value once and then just store pointers to the value And this is much more efficient and on top of this you also could use
Run lengths encoding on the values, so this will also get smaller and then if you want to filter again Based on on the color or here The example was a sales type, and then you can just check in the dictionary for this row group if
your predicate Matches this row group then reached read the data and otherwise not read it So basically the dictionary encoding you can work with it like a bloom filter, but you don't have the false positives it works the The
Performance penalty for adding dictionaries is not that that high. It's just about 1% of the reading reading time to to use it all The values in the condom need to be dictionary encoded so for all row groups and all page chunks really must use the dictionary encoding
And this is especially helpful if you have not sorted The column so you still can benefit from partial reading of your data so again if you look at the Dictionary of each row group first and then you can decide ok to skip a whole row group
and not read it from this and this will massively speed up your applications Beside compression and encoding on a column level Parquet has compression on the whole Parquet file
So you can shrink the data independent of its content, it's more CPU intensive than the encoding but encoding and compression works better than compression alone and Available are G sips nappy and Brodley Brodley is the newest one of the compression
Engines if you don't want to go deep in what what compression engine is the best for your data Just stick with the snappy and that's what we have seen Here are some search comparisons that we have done So it's I think it's based from the New York taxi data
records The columns are I think a boo boo la daytime float and I end and we have tried to store it in different different formats and You see with Parquet you are really Have a compression of factor 10 and you still read it much faster than for example CSV
Another trick that we very often use and this time we'll switch to Dask is data partitioning So until now we have always read from one file and just to read parts of a file
This is fine if you work on a load local machine or work with the data only from one consumer But as soon as you have a distributed system and work with multiple consumers and especially with multiple Producers it makes sense to split the data into multiple files There is it's not a really a standard but it's based on the hive partitioning scheme so that you basically
Encode the partitioning scheme in your file name or in the past So for example, so we have a directory date 2070 0 1 0 1 and another directory within it location equals location 0 and then you have multiple
parts of your Parquet files and With dust you can create this kind of structure But also dust can be used to auto detect the structure and read the data back in so you just handle it to Dask directory and it also
Yeah, it recognizes the structure of the directory and you can also use predicates to Predicate pushed on in this in this way to only read the files where the predicate of your file pass match
If you're working in the distributed system Sooner or later you have to move away from your local computer and split computation from storage The approach that we did was so we didn't want to host the HDFS file system on our own So we are running on Azure and then we decided to use the Azure blob storage
To store all our data and all the Parquet files You Can pass into to Parquet or to pi arrow either a file handle or a local file or Something that behaves like a Python file object
So this is an example if you just get a stream on a file object from the block block service This is the way how to you access as a blob storage then pass it into the Parquet read table And then go to pandas and to get the data into pandas This works very well if you want to work with the whole data
But you'll lose if you take the streaming approach you use all the benefits that I explained before That this predicate pushed on and only to read the data. You're interested in so we have also written an interface for
The Azure blob storage this implements the Python file interface So we have an IO interface where you can tell where you can seek where you can read We have open-sourced it at part as part of the simple KV library so simple KV as an abstraction layer over different storage technologies and
This helps us to gain the benefit of predicate push down And also store the data on a remote file system Another real-life improvement that we have seen with Apache arrow
So as I told earlier Our architecture is that we need to you to get the data out of the database and then analyze it in pandas The data in the database is stored in a columnar form Until two years ago if we wanted to pull the data out
We used pi ODBC as an layer to talk to the database But pi ODBC works in a row wise form So all the data was transferred from a column and data storage to the row wise And then when we pass it into pandas, it was again converted from the row wise to the column wise
What we did and probably what many people do if they want to work with data in pandas Yeah Take the ugly real-life solution that was to export the data from the database as CSV bypass ODBC In total and then read it back into CSV
Yeah, but CSV you all you lose all the type information and it's just a heck. It's not a real thing how to do it So what was our solution? One of my colleagues Michelle Koenig. He implemented turbo DBC. That's an ODBC Implementation for Python that can also work with columnar data and also supports the Apache arrow
Standard so you can get the data from the database in a columnar fashion and then expose batches of pi arrow tables To the pipe to the Python world and you could either then write it away as par key again with nearly zero memory
Transformations or you can get it into pandas and then work with this If you are more interested in about in this technology or turbo DBC in general You can see the talk from Michael Koenig last year the picon or you can check out our tech blog where we have some really
Deep dive how he implemented this So Where are we now? So we moved away from the direct access to the database because it was the bottleneck. So we now Export the data from the database into the Azure blob storage and then from there
We run our machine learning models. This is now the huge benefit that we really can scale out our machine learning models Because the Azure blob storage Scales much much better than our database And the same thing with inserts on the Azure blob storage. We don't have to synchronize the inserts
every every job can just insert into into one blob and then we can Work on the data to get it back to the customer Also a nice benefit is that we now use
File format that is shared by different implementations that we can now also use hive pesto or drill To do analysis of the data and that we don't have to store these large time series in our In-memory database, which is much more expensive than the Azure blob storage. So the as a blob storage
For terabytes, it's not even yeah, I don't 10 or 15 euros So that's that's nothing and the the query engines on the on the bottom, right? These are ones that you can fire up for the analysis and once you are done down You can fire them down again, and you don't have to pay like the database
The whole day to keep it running So what we have learned is safe in one load in another's ecosystem This really helps us But always persist the intermediate data So if you are interested in this topic too, there are still lots of stuff to do
In the Apache parquet project, but also in the arrow project There are still lots of functions that are not yet exposed to Python or pandas or dask. So Yeah, some of my colleagues Apache
Members and they work on this project, but there's still lots of thing to do That's it My talk using pandas and dust to work with large column data sets
So if you have any questions, feel free to ask me now or later or just ping me at the conference If you have questions, can you quail on this side of the room, please? I'm really happy I sit on the left side of the room Thanks, I remember two years ago you were also here at Python but you were talking about spark
Yes, I sparked back then now I see is that picture of your stack that doesn't have pi sparking anymore It's replaced with dask. Yeah So just a question in that sense dask is ready for primetime Not I would not say not yet So just go back to pi spark
Also with arrow pi spark has really improved all the Python access in inside spark has really improved Also two years ago if you go from the JVM world to the Python world You have very expensive Encoding and decoding every well view and it was really that slow that basically you can't use it at all
I think this spark 2.3. They have also introduced arrow for the conversation so that now you can work in pi spark in Python With to get pandas results and the whole serialization is done in arrow So that's much faster now, but real God wasn't available. So we went for the dask direction
We are still not using dask everywhere in production So we still have an own scheduler and some own libraries But we plan to do it and we do it because it's much more lightweight and much more fits into our stack because we are Python company so that everything can still be Python That's the reason for us. The footprint of task is much much smaller than
Thanks If you have a question you should come here you're not passing the mic from sorry Sorry Great talk. Thank you. I wanted to ask about the turbo DBC I think have you tried to use the support for SQL can mean pandas to actually get the data like that
So we are to receive to be DBC also works nicely with SQL alchemy But you always so SQL alchemy is other you can use it from pandas But you always need the driver below to talk to the database and we either use to be DBC or pyodbc and
That's the thing that is fast or slow So if you want to use SQL alchemy in pandas, that's just a layer layer above the other stuff Thank you How would some questions from my particular case
There's one in our you know data pipeline flow there's one step we need to read in the The fail which is in nested JSON, JSON format, nested JSON format So the first time we try to use a pandas, you know read JSON, read JSON The JSON that function cannot handle the nested the problem. So so the solution is
Quite an evil basically, I have to write in the like internal very static static static library myself to handle these short strings, but you know All these dictionaries are kind of a stable so make a transfer them to the pandas data frame. So I was thinking
whether in future you want you you can or you would like to think about so maybe adding the feature like Reading the nested JSON So maybe the solution could be firstly you you have no light analysis the structure of the JSON to identify
What is going on and then you use some like a string cutting mechanism? I mean, it's from my It's my temporary solution for that. So basically Parquet supports nested data structures Nested so in the Parquet files, you can really store nested data structures
Yeah, as a not JSON but nested data structures Parquet can handle this but I don't think this will be exposed to Python to Pandas because Pandas is a tabular two-dimensional Interface so it doesn't fit the ball to the data. The reason is because the the nested the JSON
We will finally turn out to be a Two-dimensional, okay, so that's why I think is it would be good to consume, you know, and the transfer direct to data frame That's why I write a library to I mean, it's an internal library to directly transfer this Jason nested JSON file to a data frame and pass to Pandas and
I think that's the way to go like you did it. Yeah Do we have another question? Then about so let's go downstairs and
Okay, so I want to thank you again Peter for the nice presentation