We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback
00:00

Formal Metadata

Title
DuckDB
Subtitle
An Embeddable Analytical Database
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We present DuckDB, our new, Open Source embedded analytical data management system. Data management systems have evolved into large monolithic database servers running as stand-alone processes. This is partly a result of the need to serve requests from many clients simultaneously and partly due to data integrity requirements. While powerful, stand-alone systems require considerable effort to set up properly and data access is constricted by their client protocols. There exists a completely separate use case for data management systems, those that are embedded into other processes where the database system is a linked library that runs completely within a ``host'' process. The most well-known representative of this group is SQLite, the most widely deployed SQL database engine with more than a trillion databases in active use. SQLite strongly focuses on transactional (OLTP) workloads, and contains a row-major execution engine operating on a B-Tree storage format. As a consequence, SQLite's performance on analytical (OLAP) workloads is very poor. There is a clear need for embeddable analytical data management. This needs stems from two main sources: Interactive data analysis and edge computing. Interactive data analysis is performed using tools such as R or Python. The basic data management operators available in these environments through extensions (dplyr, Pandas, etc.) closely resemble stacked relational operators, much like in SQL queries, but lack full-query optimization and transactional storage. Embedded analytical data management is also desirable for edge computing scenarios. For example, connected power meters currently forward data to a central location for analysis. This is problematic due to bandwidth limitations especially on radio interfaces, and also raises privacy concerns. An embeddable analytical database is very well-equipped to support this use case, with data analyzed on the edge node. The two use cases of interactive analysis and edge computing appear orthogonal. But surprisingly, the different use cases yield similar requirements. In this talk, we present our new system, DuckDB. DuckDB is a new purpose-built embeddable relational database management system created at the Database Architectures group of the CWI. DuckDB is available as Open-Source software under the permissive MIT license. To the best of our knowledge, there currently exists no purpose-built embeddable analytical database despite the clear need outlined above. DuckDB is no research prototype but built to be widely used, with millions of test queries run on each commit to ensure correct operation and completeness of the SQL interface. DuckDB is built from the ground up with analytical query processing in mind. As storage, DuckDB uses a single-file format with tables partitioned into columnar segments. Data is loaded into memory using a traditional buffer manager, however, the blocks that are loaded are significantly larger than that of a traditional OLTP system to allow for efficient random seeks of blocks. Queries are processed using a vectorized query processing engine to allow for high performance batch processing and SIMD optimizations.
Analytic setDatabaseSystem programmingDatabaseQuicksortData managementFocus (optics)Product (business)Database transactionOrder (biology)Computer scienceDescriptive statisticsStudent's t-testState of matterMathematicsException handlingSoftwareBuildingComputer animation
Data managementData analysisAbfrageverarbeitungFile formatData managementBinary fileException handlingPresentation of a groupData analysisDescriptive statisticsData storage deviceState of matterCodeIntegrated development environmentDatabaseData structureComputer fileDirection (geometry)Physical systemTask (computing)ArmSpeech synthesisQuicksortHD DVDComputer animation
Query languageProcess (computing)Run time (program lifecycle phase)Server (computing)Data storage deviceFile formatSingle-precision floating-point formatData managementHeat transferGoodness of fitData managementData analysisDemonServer (computing)Contrast (vision)System programmingBitDatabaseData transmissionSound effectFile systemCommunications protocolTableauSingle-precision floating-point formatComputer fileDatabase transactionProcess (computing)Client (computing)Flow separationSequelWeb browserData storage deviceCollisionLibrary (computing)Computer animation
Line (geometry)Server (computing)Link (knot theory)Wrapper (data mining)Enterprise architectureProcess (computing)Cartesian coordinate systemEmailImplementationProjective planeTerm (mathematics)Library (computing)Wrapper (data mining)DatabaseWindow functionServer (computing)Web 2.0BuildingResultantComputer programmingData analysisSequelGoodness of fitAdditionUser interfaceSoftwareComputer fileSystem programmingComputer animation
Cursor (computers)DisintegrationLibrary (computing)SpacetimeComputer fileDatabaseSystem programmingNeuroinformatikINTEGRALSoftwareInstallation artMaxima and minimaQuery languageDifferent (Kate Ryan album)SequelRight angleInheritance (object-oriented programming)Program slicingComputer programmingComputer animation
Price indexSingle-precision floating-point formatData storage deviceComputer fileProcess (computing)Product (business)Simultaneous localization and mappingDatabaseParsingMagnetic-core memoryProcess (computing)Computer animation
TupleProcess (computing)Query languageMultiplication signRow (database)Process (computing)Query languageDatabaseProcess (computing)Right angleHierarchySemiconductor memorySequelComputer animation
Process (computing)BefehlsprozessorCache (computing)Semiconductor memoryQuery languageRight angleQuery languageSemiconductor memoryBefehlsprozessorCache (computing)Limit (category theory)Mathematical analysisComputer animation
Process (computing)Programmable read-only memoryTable (information)MiniDiscError messageSemiconductor memoryQuery languageLimit (category theory)Mathematical analysisError messageBenchmarkSemiconductor memoryBitDifferent (Kate Ryan album)Computer animation
Twin primeBenchmarkProcess (computing)System programmingModel theorySuite (music)Different (Kate Ryan album)Standard deviationQuery languageMultiplication signBenchmarkRevision controlGreatest elementProjective planePhysical systemComputer animation
Continuous integrationQuery languageExecution unitSoftware testingComponent-based software engineeringPersonal digital assistantBenchmarkQuicksortMereologyStandard deviationBenchmarkFlagLinear regressionCASE <Informatik>Server (computing)Continuous integrationQuery languageSoftware bugResultantSoftware testingParsingFile formatSequelFuzzy logicSystem programmingFormal languageEvent horizonFlow separationWebsiteComputer animation
FreewareOpen sourceFeedbackDatabaseOnline helpData storage deviceData compressionData storage deviceWeb pageElectronic mailing listHD DVDRight angleProcess (computing)Point (geometry)Connected spaceWebsiteSequelFormal languageDecimalNeuroinformatikFunctional (mathematics)Multiplication signData transmissionSystem programmingComputer configurationLibrary (computing)QuicksortSound effectQuery languageGoodness of fitVector spaceFeedbackComputer fileUser-defined functionOpen sourceFile formatSingle-precision floating-point formatSoftware bugComplete metric spaceHistogramComputer animation
Point cloudOpen sourceFacebook
Transcript: English(auto-generated)
Hello everybody, welcome to Forstam Lightning Talks in Building 8. I want to introduce you to Hannes Mühleisen, who will talk about DuckDB, an Embedded Analytic
Database, and give him a warm welcome. Thank you, welcome everybody. So a quick introduction. So I work at CWI, which is the Dutch national research lab for computer science and mathematics.
I also teach computer science students about the wonderful world of databases. But I have found out that a good way of learning about databases is building them, and therefore I also do that. And today I'd like to talk to you about one of these products, and that is DuckDB. Obviously DuckDB is not my own sort of sole creation, but there's other people involved,
most notably Marc Rasfeld, who is not here today. So we're going to talk about DuckDB. DuckDB is a database management system, and it's new, it's completely new. And it's focused specifically to be embeddable, which means not embeddable as in hardware,
but embeddable as in embeddable into other software. And it's analytical, which means that it's focused on crunching through large amounts of data, as opposed to dealing with transactions like orders in your online shop. So if you want to do orders in your online shop, go up to the Postgres people next door.
If you want to crunch large amounts of data, you can use DuckDB. Now I have to find out whether my clicker works. It does. It is common to start these kind of talks with a description of how terrible the state of this world is. This is no exception. The present is very bad.
The data management in data analytics is a huge mess. I don't know if anybody of you has ever tried to use things like Pandas, and that's great. It works with the five examples that they have on the website, but one of the problems there that is really overwhelming is in the data storage itself.
People tend to have these text files, where there's a well-known folder structure somewhere which has a bunch of CSV files in it, and then there's maybe some code on top of that that decides which CSV file should be read. Once we have loaded these files, we have these crude query processing engines, for example
the one that is in Pandas or the one that is in the R environment. Once people decide that CSV files are too slow, they start inventing their own crude hand-rolled binary formats that are on disk, maybe, and start processing those. There's been a recent push in the direction.
In general, this is sort of a zoo of one-off solutions, and that makes secondary problems like, for example, changing anything about the data that you have very difficult. This is bad. We don't want this. These things are all solved problems. We have data management systems that have been around for 50 years or so.
What we are trying to do with DecDB is make them usable also for these data analysis tasks that are so common. Now this is the contra. The future is bright, obviously, with DacDB.
Who has used SQLite? This is very many people. In fact, everybody has used SQLite because it is in every browser, every phone, and every device that you can imagine. What we're trying to do is build something similar to SQLite, but very different in the
intended features in the sense of what kind of data analysis questions you want to ask. You want to do data analytics in contrast to with SQLite, where you do transactional data management. How do we do this? We have built a very fast so-called vectorized data processing engine.
I will explain to you in a bit what that is. We have stolen a lot of good ideas from SQLite. For example, DacDB does not require you to run a separate server. This idea that you have to run a daemon that is your database that you have to set up
and configure and restart and whatever. No, it's kind of database as a library. You run the DacDB system inside your process. This has a nice side effect that data transfer from whatever you are using to talk to DacDB and DacDB becomes very fast. This is for data analysis.
This is really a critical question. We've written a paper. It was quite fun measuring, for example, the client protocol speed of various popular databases. The guys next door from Postgres, they came pretty badly. What we also have stolen from SQLite is the idea that you have a single file storage format. Basically, all your database, no matter how complex it is, no matter how many tables
it has, is in a single file. And we've also stolen the idea that it should be simple to install more on that in a bit. So this is the bright future. How do we make that work? So DacDB is a library.
Think of just a package, a library that you embed into your application. We have zero external dependencies. This is really something that took a lot of work, but it is something that we believe is actually quite necessary for a library to be successful, is that you don't have
to install 57 other programs before you can use it. In fact, we have a special way to build DacDB that results in two files, one header and one implementation. DacDB on the base layer is a C++ API. We have full SQL support, so I went through this wonderful job of implementing things
like window functions in a database system, which I can tell you are not fun. So you don't have to do it because you can use DacDB. We also have built a wrapper for the API that SQLite uses. So in principle, what you can do if you have an application that talks to SQLite,
you can do some library preload tricks and it will use DacDB instead. So this is something that we have done to make it easy to switch. We've also learned from previous projects how important it is to integrate with the tools that people are using. In terms of data analysis, people use R and Python.
So there are packages for R and Python, I'll show an example in a bit, that basically include everything that you need to run DacDB as well. And just to wrap it up, there's a command line interface, and for the people that wanna do web stuff, we have a REST server as well.
Let's show some examples. So here is an example for Python, which by the way was also invented at CWI, so we are kind of obliged to integrate with Python. You say pip install DacDB, that's very complicated, and then you have it installed. There's no additional software required, all the batteries included, and then you can
just use this wonderful Python database API where you connect to a database, in this case the database is a file, so this would be a file, and then you can run SQL queries, which is a required skill that you have to have to work with DacDB.
Or maybe not, because in the R world we have a similar integration where you load up the database, you connect to your database file, and the R people have invented this wonderful dplyr system of actually programmatically expressing queries, which is quite nice.
And finally, the C++ API I wanted to show you for the people that are more in C land is really just that, this is the actual fully functioning minimum integration of DacDB into C++, where again you specify which file you want your database to be stored in, and then you can merrily run SQL queries.
So that's the outside view, right? So it's not very exciting, I realize this, I mean not many people get excited about databases, I'm one of the few, but it is a tool that you can use to store your data, and you can actually, and this is the big difference, you can get it out again quickly, and you can run queries on large amounts of data
on your local computer quite quickly. Now how do we do this? Let me talk briefly about some internals. So we have something called vectorized processing, I'm not gonna talk a lot about the other things, but this is the core of the engine that makes it fast. And you have to understand vectorized processing, you have to understand that
database engines comes in different flavors, that is traditionally tuple at the time, this is what Postgres, MySQL, SQLite, everybody uses, is basically we look at one row of data at a time in the process of running queries. That's great, however it's slow. Then we have the Pandas NumPy R way of doing things where we look at one
column at a time, which is faster, but has issues when the data becomes bigger than memory, and then finally we have vectorized processing which is kind of the middle ground where you look at chunks of data at a time. And this is a very nice thing because that means that the data that we look
at in the query fits into the higher in the CPU cache hierarchy, so here on the right you see a short overview over the CPU caches, and basically what we're trying to do with DuckDB is keep the data that has been worked on up here in these very fast L2 caches, and actually avoid going into main memory for performance reasons.
And this is very nice because it allows us to process data that is bigger than main memory, this is one of the limitations of things like Pandas, is that once your data becomes bigger than memory, you're screwed with a vectorized execution engine, you actually have a reasonable chance of still completing your analysis questions.
Yeah, and you don't get wonderful out of memory errors. So now I'm gonna actually skip something. So you would ask, then you would ask, okay so why should I do vectorization, it's great that Hannes is excited about it,
but what kind of difference does it make, and does it make, and this is like a very crude benchmark, we run like a standard benchmark TPC-H on different systems, and this is based on an old version, we have gotten faster in the meantime, but basically if you look on the bottom there, you can see the time it takes to complete these benchmark queries between
the different systems, and then there's DuckDB up here, which clearly is much faster. So generally you would say that this is 40 times faster than a traditional engine that is working in a top-level time fashion, but then you would say, but yeah, Hannes, you're an academic, and you have
a nice pet project, but you know, I'm interested in something that I can use, maybe even serious ideas, and this is why I briefly wanna talk about our quality assurance that we are sort of doing with DuckDB. So basically we have continuous integration running, where we have millions of SQL queries run on every
single release, we know the correct result for every one of these queries, so whenever we get something wrong with Instantly Flagged, we have verified benchmark results for large standard benchmarks that we also check for, and basically we went around and steal everyone's test cases. So with SQL engines you can do this, because they all have the same sort of query language, so the only
thing you have to do is you have to write a parser for whatever result format they have. My favorite part was to write a scraper for the SQL server website, because they have example queries with answers, and from that we generated a bunch of test cases as well. We also do query fuzzing, where we auto-generate queries to try to break our system, which always
works if you run the fuzzer long enough, but you find very important bugs in the meantime, and we also have something that we call continuous benchmarking, where every release is subjected to benchmarking, and we can flag performance regressions quickly. So DuckDB is free and open source under the MIT
license, we are currently in pre-release, which means that you can't yell at us if we change APIs internally, but it is fully functional, you can use this to run queries to store data, it is all there. We have a website, there's a GitHub page where you can go file a full request if you want,
we are very interested in hearing feedback and if DuckDB doesn't do something that you wanted to do, then please tell us. If you're even more database inclined, then you can send us a full request with new features, bug fixes, whatever. We have a long list of issues in the issue tracker that have tagged with help wanted,
or good first issue, so these are good places to start, and with that I'm happy to take questions, thank you. Can I ask two questions? You have to ask him. Do you do something for internal data compression
as you say, it's used for a big amount of data? Yeah, okay, so the question is do we do something for internal compression? What we are working on is two things. One is the storage on disk is going to be
compressed, so whatever we write to disk through the single file format is going to be compressed, but we also, and this is really something we're working on right now, is working with compressed intermediates so that vectors, for example, if you have a vector of a thousand values and they're all the same, then we have compression that will actually not move these thousand values around, but the fact that it's the same.
The second question is do you support any statistical functions like computing percentiles and getting histograms back from the database engine? That's a good question, so our philosophy there is that because the data transfer between DuctDB and the host is so fast,
that if you want things that we don't support, it's actually you're not going to die pulling a chunk of data into Pandas, for example, and running it there. There is support for user-defined functions if you want to add anything. We have a fairly complete aggregation functions library, so there is multiple options there, but the general idea is that we don't punish you
for pulling a large chunk out of the system. We don't hold the data hostage. Hi, I have a question, thanks for the talk. Do we have a connector for SQLAlchemy? For example, in Pandas, you have a connector for SQLite, so you can write a SQL query and then.
Yeah, that has been, I'm not sure what the status on that is, but people have worked on this. I think eventually, if it's not working already, it should be working pretty straightforward because we support the exact same query language as Postgres, so I suspect it should already work. And it's just a question of plumbing the connection.
Okay, thank you very much. I'm outside after, if you want to talk to me, I'm outside. Okay, perfect, thank you for your talk.