We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

DuckDB: Bringing analytical SQL directly to your Python shell.

00:00

Formal Metadata

Title
DuckDB: Bringing analytical SQL directly to your Python shell.
Title of Series
Number of Parts
542
Author
Contributors
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this talk, we will present DuckDB. DuckDB is a novel data management system that executes analytical SQL queries without requiring a server. DuckDB has a unique, in-depth integration with the existing PyData ecosystem. This integration allows DuckDB to query and output data from and to other Python libraries without copying it. This makes DuckDB an essential tool for the data scientist. In a live demo, we will showcase how DuckDB performs and integrates with the most used Python data-wrangling tool, Pandas. The talk is catered primarily towards data scientists and data engineers. The talk aims to familiarize users with the design differences between Pandas and DuckDB and how to combine them to solve their data-science needs. We will have an overview about five main characteristics of DuckDB. 1) Vectorized Execution Engine, 2) End-to-end Query Optimization, 3) Automatic Parallelism, 4) Beyond Memory Execution 5) Data Compression. In addition, users will also experience a live demo of DuckDB and Pandas in a typical data science scenario, focusing on comparing their performance and usability while showcasing their cooperation. The demo is most interesting for an audience familiar with Python, the Pandas API, and SQL.
Gastropod shellDemosceneComputer fontCharacteristic polynomialDatabaseEmbedded systemData managementServer (computing)Process (computing)Heat transferRead-only memoryData storage deviceData compressionMathematical optimizationQuery languageParallel computingProduct (business)Band matrixMiniDiscSemiconductor memoryActive contour modelConnected spaceRevision controlProjective planeOpen sourceDatabaseFile formatStudent's t-testAnalytic setCharacteristic polynomialHeat transferData compressionSemiconductor memoryQuery languageData managementSystem callData storage deviceProcess (computing)BitServer (computing)1 (number)Type theoryEstimatorCore dumpDecision theoryPhysical systemParallel portMathematical optimizationRadical (chemistry)Category of beingMiniDiscRow (database)Band matrixGraph coloringDifferent (Kate Ryan album)Computer fileData transmissionRelational databaseGastropod shellPower (physics)Continuous functionLine (geometry)CodeField (computer science)SequelObject (grammar)Formal languageWeb pageComputer animation
Query languageReading (process)MiniDiscEntire functionData storage deviceData compressionSimilarity (geometry)AlgorithmLogical constantData dictionaryProcess (computing)StapeldateiTupleSemiconductor memoryBefehlsprozessorOverhead (computing)Materialization (paranormal)Cache (computing)Principle of localityMathematical optimizationRegulärer Ausdruck <Textverarbeitung>Stack (abstract data type)Digital filterAxonometric projectionQuery languageMultiplication signDatabaseMixed realityPoint (geometry)Data compressionPhysical systemMiniDiscMathematical optimizationMaxima and minimaGroup actionProcess (computing)NeuroinformatikRow (database)Endliche ModelltheorieData storage deviceStapeldateiPlanningOverhead (computing)BefehlsprozessorTable (information)Utility softwareLocal ringFitness functionCache (computing)Operator (mathematics)TupleProduct (business)AlgorithmResultantNumberString (computer science)Projective planeStack (abstract data type)ExpressionCASE <Informatik>SequelDifferent (Kate Ryan album)BitSemiconductor memorySimulationSet (mathematics)Line (geometry)Reading (process)Inductive reasoning2 (number)File formatSlide ruleExecution unitWorkloadComputer animation
Order (biology)Query languageCrash (computing)Read-only memoryParallel computingRevision controlOperator (mathematics)Parallel portBelegleserParallel portOperator (mathematics)Fault-tolerant systemOrder (biology)Query languageRevision controlBitLaptopSemiconductor memoryComputer animation
DemonDisintegrationoutputFunction (mathematics)Default (computer science)Theory of relativityMultiplication signError messageFrame problemObject (grammar)Formal languageScripting languageSemiconductor memoryChainFunction (mathematics)Table (information)Query languageProcess (computing)Mathematical optimizationLibrary (computing)Default (computer science)Heat transferCountingProjective planeLogical constantMetadataBitConnected spaceOperator (mathematics)Vector spaceQuicksortDemo (music)Computer animation
Subject indexingLaptopLetterpress printingFunction (mathematics)World Wide Web ConsortiumCountingGEDCOMGamma functionPlot (narrative)SummierbarkeitSemiconductor memoryFunction (mathematics)Beta functionVirtual machineRight angleReading (process)Multiplication sign2 (number)Alpha (investment)DistanceMedianComputer fileFrame problemTupleDifferent (Kate Ryan album)Query languageFigurate numberBitDemo (music)Set (mathematics)AverageConnected spaceGroup actionRun time (program lifecycle phase)NumberCASE <Informatik>Type theoryResultantPlotterInformationFormal languageGoodness of fitSequelFood energyPoint (geometry)Computer animation
Physical systemDatabaseKeyboard shortcutJava appletRelational databaseAnalytic setData analysisQuery languageOpen sourceDatabasePhysical systemGoodness of fitJava appletType theoryDatabase transactionOpen sourceWorkloadINTEGRALSign (mathematics)Heat transferFrame problemTerm (mathematics)Process (computing)Source codeOperator (mathematics)Theory of relativityMathematical analysisProxy serverData warehousePoint (geometry)Multiplication signAlgorithmWindow functionQuery languageCASE <Informatik>Analytic setWebsiteFitness functionBitSemiconductor memoryData transmissionACIDTable (information)CodeComputer animation
Computer animationProgram flowchart
Transcript: English(auto-generated)
So let's welcome Pedro Holanda for his talk on DuckDB and a magnificent snake duck Yeah, you guys can be surprised of anything you can find as a rubber duck these days, you know, all right
So I'm Pedro Holanda I am one of the main contributors of the DuckDB projects Which is an open source database system and also I'm the CEO of DuckDB labs And today I'm going to be talking a little bit about how DuckDB can bring analytical SQL power directly into your Python shell
So to give you a little bit of an idea of how this talks look like I'm going to start with what is DuckDB So I'm here talking about one more database system I'm gonna motivate you guys that we actually needed to do one more database system. The other ones didn't solve the problems we had And then I'm gonna go over the main characteristics of DuckDB
So what actually makes it special Then I'm gonna go over DuckDB the Python land. So how DuckDB integrates in the Python ecosystem Gonna go I'm gonna do a little demo. The basic idea is that we're gonna use the infamous New York City taxi datasets and we're gonna try to do some
estimation of fair costs and you're gonna use like DuckDB pandas and PySpark just to see a couple of the differences of the Things I'm going to be talking over and then some summary of the talk So what is DuckDB? well DuckDB was actually born at CWI which is the Research Center of Mathematics and Computer Science in the Netherlands and
What we actually had there is that a lot of the projects the PhD student projects the master projects They are very data sciency so usually you have a data science problem and you want to throw a database management system at a data science problem because You handling data so initially we're like, okay, we can probably use a database server
Use the database connection and then just transfer the data from the relational database to your Python terminal for example I'd like where your analytical tools are and it turns out that's quite a bad idea because you are transferring a lot of data So that's pretty costly and then you're like, okay. This is really not solving our problem Can we draw inspiration from somewhere else? And then of course, there's SQLite the most famous database out there at least the most used one
And it has a quite a nice property Which is being an embedded database system being better database system It means it can run inside your Python process so you can eliminate this data transfer cost
SQLite comes with one design decision. That is a But it's not really optimized for analytics so we kind of wanted to do SQLite in terms of like being easy to use and Eliminating this data transfer cost but focusing on the little analytical queries
And that's kind of how the DB was born and that's also why we frame it as a SQLite for analytics It also has like a very simple installation So if you think about Python and just do a bit so and you're good Since it's embedded there's no server management So let's say you just want to I don't know query a per-key file two lines of code. You can write query it like there's no
Starting a server. There's no Schema creation the schemas inferred from the object So it's very easy very fast, and we also really focus on this fast transfer between analytical Languages and there are tools like in Python and R2 duck DB duck DB is currently in pre-release
Think the last version we released was 0.6 0.7 is coming up soon I need the web page. There's like a little bit more details about All the things that are in each release Alright, so I'm gonna go over some of the main characteristics of duck DB Particularly like the call or data storage a little bit about compression. I'm gonna talk about vectorized execution
So these are all like core database stuff actually talking about vectorized execution engine. It's a bit It's gonna be difficult because Professor bonks is here, and he actually created that so I'll try to Do it correctly a bit of a little bit of I into inquiry optimization parallelism and beyond memory execution
So color data storage well There's basically two ways that you can do it one is a row store as con store As an example first or we have sequel eyes and the whole thing about the the whole idea is they're storing your data Consectively in memory
For a row so that basically means that if you want to fetch an individual role That's very cheap because it's continuous in memory however You always have to fetch all the columns so Analytical queries usually you have very wide tables, but you just want to really get a couple of these columns So what if you only you want to use a field so in this example?
What if you just are interested in the price of a product, but not the stores is sold, right? In a column store you actually Have your layouts that the data of the column is conceptively in memory So if you want to access just a couple columns you can actually have immense savings on disk IO and memory bandwidth
So that's why this type of format is really optimized for analytics so to give you a more concrete example Let's say that we have a one terabyte table with 100 columns For simplicity let's say all the columns have the same size
And we just require five columns off the table in our analytical query so in a row store like sequel lights Reading this whole table if you have a disk with around 100 megabytes per second will take you three hours if you were using A column store model, which is what pandas inductivity does for example reading these five columns from disk Takes you eight minutes, so there's a huge
Improvement by just setting up the correct storage formats for your workload Compression well, I'm not going to go into a lot of detail about the compression algorithms that we implement inductively But what I can tell you is because of the having a column store you're going to have your data from your column
Continuously in memory, which gives you a very good advantage to compress units because usually the data from the same column Is somewhat similar so you can apply cool things like RLE FS FSC and chimp for floating point numbers Oh, sorry chimp for floating point numbers FSST for strings
So you can start applying like all these algorithms and really decrease the size of your data so in this table here We actually have a I think this is from one year ago one year and a half Zero point two point eight from duck to be we had no compression at that point and then a year and a half Later we actually managed to implement all these things which got us five times better compression line item for example
3.18 better compression in The taxi data set that I'm going to be using later, and why is compression so important well if we go back again To the same example where we're reading our five columns And it was costing us to read them from disk eight minutes because of the the storage formats if we compress these columns
Suddenly you don't have to read 50 gigabytes anymore right you read less And then of course you apply like the best case from what I showed you from the last table five times They are as increases the cost to one point one minute and 40 seconds
so execution well There's three ways of doing a query execution There's actually one more, but it's not in the slides But SQLite use the top of the time processing which means that you process one row at a time Pandas who's uses column at a time processing which means that you process one column at a time Inducted B uses kind of like a mix of the both which is a technique developed by Peter
The vectorized processing where you process batches of a column at a time So basically the two point of time model from SQLite It was optimized for a time where computers didn't had a lot of memory there was low memory To be used because you only need to really keep one row in memory for all your whole query plan
So the memory was expensive That's what you could do But this comes with a high CPU overhead per tuple because you constantly resetting your caches You don't have any cache conscious algorithm running that piece of data up to the production for query results If you go to the column at a time, which is what Pandas uses
There's already brings like better CPU utilization it allows for SIMD But it comes with the cost of materializing large intermediates in memory It basically means that you need the whole column in memory at that point to process for that operator And of course the intermediates can be gigabytes each So that's pretty problematic when the data sizes are large
And that's why you see for example that Pandas if your data doesn't fit in memory. What does it happen it crashes? And then if you go to the vectorized query processing is actually Optimized for CPU cache locality you can do SIM destructions pipelining and the whole idea is that your intermediates
Are actually going to fit here in a l1 cache So basically you're going to be paying this latency of one and a second to be accessing your data throughout your query plan Instead of paying the latency of a main memory, which is also the case of a database, which is a hundred nanoseconds it seems like small difference, but When you constantly execute it as this really becomes a bottleneck
End-to-end query optimizations of course something that we have in DuckDB So we have stuff like expression rewriting, join ordering, subquery flattening, filtering, projection pushdown which is a bit more simple, but it's extremely important and Brings a huge difference in the cost of a query. So here's an example of a projection pushdown
Let's say you have a table with five columns ABCDE and you want to run a query That's pretty small, but the query is like it selects minimum From column A where there's a filtering column A saying the column A is bigger than zero and you group by B So the whole point of this query is that you're only using two columns of the table, right?
So what the DuckDB optimizer will do is like okay in the scanner. I know I don't need all the columns I just need N and B, and you just don't have to read the other ones. If you do the same one in pandas For example, you can apply your filter and then you have the filter to group by the aggregator
But at the time you're doing this filter, you're still filtering all the other columns you're not going to be using your query Of course, you can't manually make this optimization, but it's pretty nice that the database system can do that for you Of course DuckDB also has automatic parallelism and beyond memory execution So DuckDB has parallel versions of most of its operators
I think all of our scanners including with instruction order preservation are parallelized now, aggregations, joins Pandas for example only supports single-threaded execution We all have like pretty good laptops these days, right? So it's a shame if you cannot really take advantage of parallelism and DuckDB again
Supports the execution of data that does not fit in memory. It's kind of the never give up never surrender approach It's like we're gonna execute this query We try to always have graceful degradation also that it just doesn't suddenly crash the performance and the whole goal is really to never crash And always execute the query
All right, so a little bit about DuckDB in the Python lens basically, we have an API it's a DB API 2.0 compliance, so Very similar to what SQLite has for example, you can create a connection and you can start executing queries But we also wanted to have something similar to the data frame API
People that come from Pandas for example could still have something familiar to work on So here in this example, you can also create a connection You can create this relation which kind of looks like a data frame It's just pointing to a table You can do a show to inspect what the table is inside and you can apply for example
These chaining operators, right? Like a filter, a projection. So in the end this is all lazily executed And this also allows you to take advantage of the optimizer of DuckDB even if you do the chaining operations Of course I talked to you about a memory transfer
So we were very careful as well to be very integrated with these very common libraries in Python So with Pandas and PyArrow for example What we actually do is that in the end for Pandas the columns are usually now PyArrows Which turns out they are C vectors which turns out that's also kind of what we use
So with a little bit of makeup in the metadata, we can just directly read them and they're all in the same process right, so we have access to that piece of memory which in the end means that you can actually access the data from Pandas in DuckDB without paying any transfer costs at least
Constant transfer costs just for doing the metadata makeup, let's say. And there's the same thing with PyArrow We also have what it calls your copy so we can read error objects and output error objects without any extra cost With NumPy, we also support SQLAlchemy and in IBIS we're actually the default back end from them
I think since six months ago a As you can see this is our PyPy Download count. The Python library is actually our most downloaded API. We have APIs for also all sorts of languages
And you can see that in the last month we had like nine hundred thousand downloads So there are a lot of people that are trying out and using DuckDB in their Python scripts So now it's the demo time. Let me get this All right, this looks like you can see so this is just installing DuckDB, PySpark and
Getting our yellow trip data data set, our executer this already demonstrates just importing this stuff we're going to be using and Here is just like getting a connection from DuckDB Getting a creating a relation. That's just okay
We're gonna as a parquet file DuckDB can read parquet files and then you can just print to inspect what's out there Right, so if we run this we can see like, okay These are the columns we have, we have vendor ID, we have pick up dates, time, passenger counts You have the types of the columns You can also have a little result preview to have an idea of what your data looks like
So I think this data set has about like 20 columns maybe and there's just information about the taxi rides in New York in 2016 and then you can also for example run a simple query here I'm just doing like accounts to know how many tuples are there and we have
About 10 million tuples in this data set Alright, so this function here is just to do a little bit of benchmarking coming from academia We do have to do something that's kind of fair I guess so I run just five times and take the the median time of everything and Then this is actually where our demo starts. So we start off with a data frame So pandas can also read parquet files and the whole thing about DuckDB again
Is that it's not here as a replacement for pandas This is not what I'm trying to sell but something that can work together with pandas So the cool thing is that we can again reads and output data frames without any extra cost So let's say that in the square here. We're just getting the passenger counts then the average trip trip amount of
Trips that had a short distance right and we group by passengers by the number of passengers So what we want to know is for short trips Does the amount of tip? Matters by the number of passengers in that right
And what you can see here is that you can again read from the data frame That's what we're doing and we just have to use the data frame name in our SQL query And if you call that DF from the query results You also output in a data frame and it's pretty cool because the data frames have this a plots bar
So they have plotting capabilities that DuckDB doesn't have and you can get easily a very nice chart So you see here apparently? There's some dirty data because people are getting tips when they don't have anyone their rights Not sure what that is But apparently like if you have more people seven to nine maybe like the more expensive cars you get a higher tip
And you can do the same thing in pandas, of course, right? Like so if others you don't have sequel you're gonna have to do to use their own language to do the group by the average and you can directly use the plots and the whole point here is to show the different execution time like now we're waiting Okay, so took a second and that DB took 0.2. So this is like a 5x right 0.25
So like 4x and you you also have to consider that we're using like a not a very beefy machine, right? This is a collab machine. Imagine if you had more course This difference would also be bigger and then I added spark for fun So actually spark can also read data frames
But it crashes out of memory in my collab machine so I had to give up on this and read directly from parquet files, but it does output it as a Data frame I think we're gonna have to wait a little bit but as me it's best
So, of course spark is not Designed for small data sets but turns out there are a lot of use cases where you use these small data sets As you're going it's warming up a little bit it's good for the winter produces some energy I think All right. Okay, so it took two seconds 2.2 seconds
The actual execution and that's already like what's More than two times with pandas was so yeah Anyway for the demo, of course, like I showed you something that's fairly simple. Can you do actually very complicated things?
Maybe not very complicated but more complicated So here I'm not really gonna go over the query But the whole idea is that we can just like for example use duck DB to run a linear regression, right? So can we predict? Can you estimate the fare? with the with the trip distance and turns out you can just calculate the alpha and beta with
Not such a crazy query, right and then you can again export it to pandas and you have a very nice figure there So you can really combine these two To get the best out of both All right, that was a demo Summary. Oh, that's my last slide good. So yeah, that could be as an embedded database system again
This completely open source is under the MIT license since I came from academia This is something that we were always worried about It's to also give it back to everyone because it was initially funded by taxpayers money So everyone can use it a hundred percent of what we do is actually open source
There's nothing that's closed source It's designed for analytical queries, so then analysis data science Has binding for many languages. So of course, I'm at the Python dev room talking about Python But we have our Java turns out the Java is like one of our most downloaded API So I guess that's an interesting sign
JavaScripts and a bunch of others it has very tight Integrations with the Python ecosystem again, the whole idea is that you eliminate transfer costs implements the database in relational API the relational API again Is this more data frame like it has full SQL support? So you anything you can imagine like window functions or whatnot?
You can just express them using duck to be and that's it. Thank you very much for paying attention
Happy to answer questions Thank you, Peter. So we have five minutes for questions Yeah, you mentioned. Thank you. Thanks for the great presentation. You mentioned beyond memory execution and
Degraded kind of that it tries not to degrade as much Can you shine a little bit more light on kind of what happens under the hood and how much degree? Yeah, of course
I think that's there's only the ordering operator that actually does that we have Lawrence That's doing his pieces here. So there's a lot of operators that needs to research To be developed that that's more of a goal than something that actually happens now But the whole goal is that you really don't have these Sudden spike in the future, but there's research going on
The future there will be more to be shared for sure Thank you very much for the talk and it's very exciting to see such a tool such a powerful tool
I'm working usually with that data warehouses and I saw on the website that you Just do not recommend using this with data warehouses. Is it I would like to know why? so of course There's no one solution for our problems, right? There are cases that they warehouses are very good fits
It turns out that for data science, for example, which is kind of what we preach the most They usually not good because then you fall back to the senior data outside your database system Right, like you're not really gonna be running your Python codes inside the system. You can do that for you Yes, for example, but they are messy. They're bit messy. So you want really to have it embedded in your Python process
So you completely eliminated their transfer costs because usually what you do is like, okay, I have a table ten columns I'm going over four columns, but I'm really hitting really like huge chunks of it So that's a bottleneck. We try to eliminate
How do you handle up this? Yeah, although we are in the little database system We do do updates So mark, I don't know where he is, but he's probably he's there He develops MBCC algorithm for olaf. So we have same acid transactional capabilities that you would expect from a transactional database, of course
If you have a transaction workloads You should still go for like postgres or SQLite or database that handle this type of transactions But mark has developed like a full-on algorithm to handle updates completely
Yeah, how do we compare to Vertica? Good question, I think in terms of analytical queries TPCAs probably similar performance But then again the whole point is that if you go again for the Python process the data transfer costs will take most of the time there and then
It's really catered for this type of scenario, the embedded scenario. We have one minute left for one more question. Oh Yeah, I actually have a rapple somewhere for a bunch of examples as well. I'm very happy to share it
I don't know where I'll post it. Ah, the falseland thing, I guess All right. All right. Thank you a lot Pedro. Thanks a lot