We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

chDB: The Blazing Fast SQL Engine for Data Science

00:00

Formal Metadata

Title
chDB: The Blazing Fast SQL Engine for Data Science
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
chDB, powered by the robust ClickHouse engine, is a high-performance in-process SQL engine designed for the Python data ecosystem. It supports direct interaction with databases like MySQL, PostgreSQL, SQLite, and MongoDB; streaming platforms such as Kafka; and key-value stores like Redis. With extensive support for various data sources and formats, chDB is highly adaptable to different data environments. It seamlessly integrates with structured formats like CSV and JSON as well as complex binary formats such as Parquet, Avro, and Arrow, enabling efficient data manipulation and analysis. The engine is ANSI SQL compliant, ensuring broad compatibility with existing SQL codebases and facilitating easy integration into diverse data pipelines. This comprehensive SQL support includes complex querying capabilities, enhancing the engine's utility in sophisticated data analysis tasks. chDB is particularly designed to accommodate the rapid prototyping of applications, allowing developers to transition smoothly and quickly from development to production environments with minimal adjustments. With its deep integration into Python, chDB interacts directly with popular Python libraries such as Pandas, NumPy, and others, streamlining the data science workflow. This integration enables direct data manipulation within Python scripts and notebooks, drastically reducing the need for data movement and conversion, thus accelerating the data analytics process. For organizations and developers leveraging Python, chDB offers an unmatched combination of performance, flexibility, and ease of use, making it an invaluable tool for building efficient, scalable, and robust data-driven applications.
State diagramDatabaseCore dumpPrincipal idealOpen sourceSubsetProcess (computing)Multiplication signProduct (business)DatabaseExpected valueDirection (geometry)Descriptive statisticsPhysical systemProjective planeOpen sourceData dictionaryDirectory serviceCircleMereologyServer (computing)Imperative programmingLangevin-GleichungHidden Markov modelQuery languageLoginPresentation of a groupPrincipal idealComputer animationLecture/Conference
Router (computing)Query languageReading (process)SoftwareFile formatFunction (mathematics)Frame problemProcess (computing)Library (computing)Keyboard shortcutWeightQuery languageLibrary (computing)CASE <Informatik>Link (knot theory)AnalogyComputer architectureFrame problemMusical ensembleMessage passingCircleFerry CorstenFunctional (mathematics)Server (computing)ResultantPower (physics)Multiplication signKeyboard shortcutInstallation artAreaProjective planeState of matterFile formatPoint (geometry)Computer animation
DatabaseProcess (computing)Lattice (order)Source codeQuery languageSummierbarkeitReading (process)Cursor (computers)Metropolitan area networkTable (information)File formatCircleString (computer science)Electronic mailing listObject (grammar)Link (knot theory)Local ringCodeComputer fileMereologySocial classSelectivity (electronic)NumberBookmark (World Wide Web)Limit (category theory)Variable (mathematics)Inheritance (object-oriented programming)Type theoryBlock (periodic table)Functional (mathematics)Frame problemQuery languageProcess (computing)Order (biology)Data dictionaryMessage passingLattice (order)Row (database)Uniform resource locatorLogicMultilaterationArrow of timeRevision controlReading (process)CASE <Informatik>Insertion lossGrass (card game)Langevin-GleichungNumeral (linguistics)DatabaseCountingComputer animation
Vertex (graph theory)Revision controlWeightLambda calculusRaw image formatTransport Layer SecurityVirtual machinePhysical systemBenchmarkPairwise comparisonFrame problemStructural loadComputer engineeringCivil engineeringCASE <Informatik>Exact sequenceCloud computingMusical ensembleKeyboard shortcutConfiguration spaceLambda calculusRadio-frequency identificationLocal ringDiallyl disulfideCircleQuery languageComputer engineeringData storage deviceData fusionVirtual machineObject (grammar)BenchmarkNeuroinformatikSurjective functionDatabaseAndroid (robot)Functional (mathematics)LaptopFrame problemSystem callMobile appIRIS-TComputer animation
Frame problemBenchmarkTotal S.A.Range (statistics)DatabaseData storage deviceLambda calculusInheritance (object-oriented programming)DatabaseData storage deviceBenchmarkFrame problemSystem callMultiplication signData compressionString (computer science)Codierung <Programmierung>Message passingTouchscreenFunctional (mathematics)Row (database)Diallyl disulfideVideoconferencingDifferent (Kate Ryan album)LaptopBefehlsprozessorQuery languageVector spaceCycle (graph theory)WordLevel (video gaming)Lipschitz-StetigkeitMetropolitan area networkMultiplicationSemiconductor memoryCompilerCodeInterpreter (computing)Thread (computing)RoutingBitCountingStructural loadComputer to plateComputer animation
DatabaseData storage deviceData compressionPersonal digital assistantSystem callDependent and independent variablesQR codeFunctional (mathematics)Real-time operating systemDifferent (Kate Ryan album)Formal languagePoint cloudServer (computing)Analytic setCodeEndliche ModelltheorieLaptopPrototypeVirtual machineNeuroinformatikVector spaceBenchmarkDatabaseMobile appCircleShared memoryBefehlsprozessorLink (knot theory)Run time (program lifecycle phase)Dynamical systemProduct (business)CompilerFitness functionQuery languageClient (computing)Lambda calculusUniform resource locatorReal numberINTEGRALPlanningLine (geometry)Inheritance (object-oriented programming)Cycle (graph theory)Integrated development environmentSerial portHidden Markov modelSocial classComputer animation
Scripting languageEmailLatent heatReading (process)Personal area networkAddressing mode2 (number)FreewareMiniDiscSemiconductor memoryDifferent (Kate Ryan album)INTEGRALQuery languageProcess (computing)Product (business)PlanningMultiplication signSoftware testingData typeBlock (periodic table)Translation (relic)Table (information)CircleFlow separationArithmetic progressionMathematical optimizationGrass (card game)Source codeEinbettung <Mathematik>Instance (computer science)LaptopBenchmarkFunctional (mathematics)Presentation of a groupPoint cloudDatabaseCellular automatonSequelResultantShared memoryMatching (graph theory)Computer animationLecture/Conference
Transcript: English(auto-generated)
Thank you. It's my first time to Prague. Yeah, so it's an amazing city. Yeah, I'm going to introduce some product from ClickHouse and so I'd like to know Who over here know what is ClickHouse?
Yeah, it's more than my expectation, yeah Yeah, so my colleague Okay, so ClickHouse is a Super super fast or like a circle engine. So CHDB is a blazing fast circle engine for data science, so
So little about me I have some experience on recommendation system and also database and and also I'm the technical director of ClickHouse Core team now and also the former principal engineer at Shopee Mainly doing some database for recommendation system and also
really open source instills Aztec, so contributed a lot of project and Which I'm really honored is ClickHouse and the GM log and Kubernetes like something like that And also I'm a creator of CHDB and also enter another database like common circle
You can find me there, oxden.com And what is CHDB so CHDB as official Description is a import size circle or lab engine powered by ClickHouse. So What is the import size? Let me
Get it. Hmm. So I really know everyone know Postgres, but CHDB is something like Postgres, but we have no server Hmm if it is less something like just a pricing direct direct directory No, but we have some
support on it so you can query you can read circle on some query present direct dictionary and If it's something like SQLite Yeah, pretty close, but it's columnar database as we know SQLite is some robust SQL based And
So it's something like a marriage of ClickHouse and the pison so the baby is CHDB little shorter There's an allergy in our company so it's something like a rocky engine on your bicycle
Yeah, the pison is a bicycle yeah, you know country and Center we like bicycle we read bicycle area But as we know, yeah, if you read a bicycle for a long time to be retired and it's not as slow as you wish sometimes So why not put a rocky engine on circle?
Yeah, bicycle so so it's Yeah It's certainly so you do not need to run or install ClickHouse server or something else to To get the power of ClickHouse and also it's
ClickHouse inside so all the ClickHouse function and format we support Also, yeah, it's a pison so we support pison DB API points and 2.0 So I also data frame we can run circle on data frame. I will explain later and Also, we can do something like stateful query recession and you can save your data and the query on your data
And you can also clean up your data with exit so So the most thing I focus yeah is Minimize the data copy from C++ to library binding. So
Yeah, we have a lot of bindings. Yeah, I forgot to remove go Russia Yes, a pison conference. Yeah Okay There's some article about Project background. Yeah, mainly something about C++. I guess you may not got interested
So I could just put a link in case you are interested. Of course, it's Apache 2.0 So you can you can almost do whatever you want with it. Yeah So the simplest architecture is something like that you can impose each DB You can see DB query with some circle. So we will send circle to play house. He will do the heavy lifting
So after it's done, we will get a result and give you on the pison side. So the result is here Yeah, quite simple Yeah, yeah, this is my favorite
So I Is in process database, but what can I do with it? So Something like we just treat a lot of things. I almost everything as a table Parkweed you can read by create and treated like table do not need to load it or just insert or something just a query and put the pass on it and
CSV of course some CSV TS we all go all goes, right? Yes, Justin Justin like some out of Jason like Joe Jason our Jason each row or some Jason with column. Yeah, whatever. I don't know too much and
Also, this is quite amazing. I think yeah, you can you and query on the circle dump a lot of Inserting sorry insert you can just query on the text file and we thought it really inserted and Also if you someday have data sound on HTTP or s3 even on hdfs you can quite directly you just
Give me the pass Yeah, pi reader. I will explain this later. This is quite Interesting. Yeah per arrow so we can query on per row directly and also did frame we can rent a quantity frame And NumPy of course and my circle you can you can even connect on my circle and trick treated like a table in
CHDB and of course postgres. Yeah, I don't know why postgres is more female Yeah, you're more popular than a circle. No, yeah And also sickle right? This is my favorite so hot. I almost I do everything with sickle right?
Yeah, we have it on CHDB and also more than 80 more formats you can check with the link So let's throw some code And basically you can query on some passing objects like that Import so we always import spandex as PD PR as PA so you can
Create a dictionary like a and with a list Well over some numeric value and also with a B with some string and you can just create As a table so as you can see a is a column name B is also the column name
yeah, for now you have to give me a column name and You can just a query select B and some some a from Python and the data data. It's a variable name and do anything or Aggregation what you want and show also same similar thing
You can put the data and create some prior table also panda's data frame We can query it directly Yeah, but yeah, I actually make this small Only numerical and the string column is spotted for now. Yeah, it's it's a beta one So if you create give me some column type like with some dictionary column type or some order
object We So, I think this is a quite interesting thing So as we treat our lot of things as a table so you can join all the table in one circle
something like This is a table from URL table engine. So this is some URL and you can put Some some URL with pocket and Yeah, and they made a bigger table. Yeah, and you can join some local files some like CSV and
This is a big big UID equals local CSV UID and you can also join something like Yeah, make it some panda's this frame as also a part of the table Just once ago join them all and it's all done
Don't forget limit you will get disaster. Yeah Okay, this is my favorite part Yeah, I just added About five minutes ago in case you like it. I don't know
You can create your own table engine in Python Yeah, it's Over simplified Version of examples, but yeah As you can create some table engine. Yeah You yeah, it's too. It's already very simple now, so you can create some class and inherit
Pure reader class and do some you need Like you can set closer to zero and you can implement the read function Yet, you can retain the column name where I want and also the count and
So you can give me a data block Any format like a list or something this frame you can retain it We can make it as a table Basically, you can define any logic in the passing code and the click house will treat it like a table So it's super fast, but yeah not very fast
Yeah, you're the man called you Yeah, maybe I think a very hairy man knows. Yeah, something like you is pretty evil in Python yeah But yeah, I think you can create your table engine yeah, that's all worth it
Basically, yeah, and I have all the bindings like Python blah blah blah Yeah, you can now see yeah Whatever so And yeah, you can almost run CTV anywhere like something like
There's some story I actually run my run CTV on some Raspberry Pi and also Some user trust to run CTV on his Android phone. Yeah, pretty crazy I don't know what what he's doing. And also you can put CTV on some lambda function
That's that's a that's a pretty quick something we're really interesting So you put your data on some object storage and you put your compute computer engine onto lambda So when no query so you can you can only pay for the actually an object storage
No computer resource needed. So once you need query something just call the lambda function. Everything is done It's it's fast Also, yeah Jupiter notebook passing
Yeah, I am even trying make it run rainbow on some chart GPT Yeah lot of database now a lot of DB so When we when we sing some DB so you have the question is it fast? Yeah everything
Everyone is fast. So we have some benchmark. This is a website created by Name something like click bench basically, you can find the irie irie famous how infamous DB on it is there's a lot of benchmark and a lot of
Database running and it's fast. You can check it. Basically. I just checked all the circle and pocket solution, so there's CTV also the app greenhouse local and something like dark and the data fusion and we have some benchmark. Yeah
Yeah, I have first glance. We are the fastest. But yeah, I have some note that DB do not upload their C6 a metal Configuration, so we do know if it faster on the large machine for now. We are the fastest. Yeah
And also, yeah, this is some benchmark I did on data frame and Just like the sickle here and I asked a GPT to translate a circle into something like data frame and like a like pandas Yeah, so super super long lambda function
So as you can see the Yeah, it's a bit back here. We are much faster and does it from the total time? We are 4.6 time faster and They are and for the whole 43 queries We are 33 faster and the pandas is faster for some 10 queries and
You can see we are pretty stable on whatever cycle you you just query But not for pandas if you write some really something really really complex pandas will slow down. Yeah
And yeah, if you make a data size large The difference will be bigger. But yeah, I just run it on Colab notebook, so it can't be too large. Yeah pandas cost a lot of lot of memory. So If you can say you if you have some data like some CSV or pocket, you can query it directly. You do not need to
Load it as something like pandas data frame. Yeah Okay, but why and I have been answer this question for about four days hmm, but why and mostly because click house is faster, so
Mmm like this. Yeah rock engine on some basical So Why CTP is fast, so just make sure Python does not slow it down so
The most thing I do is that how this How the you will log you this got the person slower. So basically, yeah, maybe I need some Explanation GL is some grand interpreter log in Python. So Basically, if you create something more slowly in Python, actually, it's right in
parallel less like something you one man can do but you Create a for or ten man to do it. They have to eat in turn slower. So basically, you need to precise some data you need to hold you and let a passing interpreter sleep and
Do the resting so I need to route everything in C++ in parallel Yeah, this is pretty much everything For example there are some screen in Python but You may not know Python has a lot of
screen coding inside There's a screen but it might be some utf 32 or you have 16 or utf-8 Whatever, but the person will start but when we query on some string column we have to decode all the stuff but decode it without a call any C person stuff because
Every every time we call the C person API it will Hold the GL. So everything will be a single thread even worse Yeah, so do it all with C++. This is what numpy and
Something like MDL coda and a lot of things do Okay, another question but why click house is fast This is a very hard to explain because there's so much detail. I have asked our CTO the creator would click house Mm-hmm. Hey, just send me a video. So I watched for an hour. So this is my summary
Basically Clear house is column oriented storage. So doing some aggregation so it's fast So like the animation here you can you you you need to scan all the rows You if you even you do not need the column you have to read it in some row oriented database
But in column oriented database, you know that you just focus on your column needed And also data compression. Yeah Can you see it's very very good at data compression and we also optimize a lot of the count data
Compression lips like the lg4 we have create our own word on lg4 So there are some article about that. You can search and read it. It's pretty low level and Of course air is rector query execution and but we do far more like Any other database did?
Mm-hmm There are some background What is rector? rector lies query execution less something some of the instructions are like a wax or a wax to a wax five one two, yeah There's a pretty much
rector function on some modern CPU like whatever Intel AMD is but In you when we have the latest lvm compiler to compare your your C C++ code most vectorized execution will be not applied so
it's It's always that even for the latest warrior, so We just combine a lot of some lot of things like GI TR Vector query execution and create something like a dynamic dispatch Like something like that we just
compare circle into some native CPU instructions, so Every cycle you'll just send to click house engine. We run like some program and optimize a lot but We have to do the runtime CPU spec check and we will find out what you what else
supported and we will dispatch all the function to some Very specialized the function like reacts a wax to a lot of things like that. So it's super fast we will screen house just something like running like a compiler and run to
Adopt and decide what is the best fit for your CPU? So With all the things we did so And a lot of a lot of more I can't explain it and here If I do that if I do that is here There will be a C++ 101 classroom and you will stay here for ten years
So basically There's a not use something like a silver bullet to make it too fast So we did it for 15 years Benchmark, I really is I really see a pipeline where we all benchmark it if anything gets lower
Unexpected we will fix that so So it's about a 1 million lines of C++ code So there's also URL you can check the detail if you are interested, I guess not
So there's a recap So C STB what you can do with C STB basically something like a pure pure performance if you have some big data and you want to create some app on it you can
Use it to be you you do not need to maintain your server or database so database or something like that and also as we do not need the client and the server and the Code and send the request to the server and send the response back We did not to do something like a serialized or deserialized. So it'll be very fast
Yeah, I think Crazy guys. Yeah, maybe 10% of the world Electricity are caused by the deserialization serialization. Yeah, it's pretty crazy
Like yeah different language And the seamless integration so As it's click house, we have developed developed it for 15 years There's a lot of requests asking add this function. We need that function. We need that function So we have been keeping adding all kinds of function for 15 years. So basically you can find any function you want
Hmm So, let's see with the earth It's clear house. I think you can reduce some consumption Like I said before you can write on some lambda function when there's a query you're right when the query is done
So clean house is a real-time analytics database. So so it's easy to be safe to be Clear house easy to be a father So we can
Query also all kinds of data here I was remote or you can you can even connect to a click house. So our class server so you can Put some real heavy Computing to the click house cloud. You will do that. You can just fetch the result and join them with other data so
Super fast and you can create some prototype with here TB. So like You you you want to doing some? Create some Machining model so you have to prepare some data But yeah, you can't have to do it on your laptop. It's some more data But when it get large, you have to modify all your code circle and put it online with some big cluster
But we see TB. Yeah, see TB shared the same cycle dedicated with clean house cloud So you did not to modify your your app to adopt production environment so It will be quick
Mmm, yeah, you can just try there is some link and also a QR code link to our documentation and All the links here Mmm, yeah Pretty much. Yeah, that's all from Thank you so much for your talk. Yes. We have still five minutes for the questions
Does anyone have any? Thank you, I would like to ask what's the difference to duck DB Yeah, I have been asking answering this question for several days
What's the difference? Yeah, basically? Duck DB and say TB. He is a real some competitor, but We have doing a lot of benchmark with the CTV and duck DB The performance Frankly speaking performance duck DB I see two pretty clothes. It's not
Human, I can't feel it in most circumstances, but we have a lot of lot more data type or data source support I think that's a main advantage and also We are on the same situation, but we come to the embedded database from different way
Clearhouse from the large cluster big database Production and tested for decades and come to the embedded from big to small But that DB is something like from small to that DB cloud or something from small to big This is a different way to the embedded function. So yeah, I think it's pretty much the difference
Yeah, thank you. Thank you and maybe one more question about the memory So so when I run out of memory and I mean the RAM memory Can see DB work like or can you can it use disk memory or yeah?
Yeah, I have been asking answering this question for two times and Basically The modern database something like testing down production is always creating some pipeline and run In the pipeline this means if the data is bigger than your memory
We do not need to load the whole data into memory and the process it. This is the pandas way. We do not do that basically, we are doing something like fetch some blocks from process it and merge block and put the result and the put Really do it as a pipeline if you have match memory, this will be quick
But if you do not have too much memory, it's also okay We just put some recent block and it's also is something like a pipeline. We do not need the Data feed the memory. Yeah. Thank you. Thank you Thanks a lot for the talk. I wanted to ask if you query Postgres or my sequel
Is it just some kind of translation of the sequel query or are there further optimizations in the processing? Yeah, frankly speaking We treat progress as a table so just we will create some circle to progress just select Necessary columns to clear house engine and the process it so
We have to we haven't do some too much push down filtering. So it's pretty much we just liked Postgres as something like data source and also queries pull the data from block to block and
pretty much like this but yeah It's pipeline. So every single data is precise. We need to select from Postgres So we do not wait for all data and the process it it Well, I didn't test it. It might be faster than Postgres itself, but whatever who knows
Human writer on a notebook. Yeah, so if I run on a notebook, that's like a percent sequel sequel magic. Yeah
Yeah We have optimizing a lot running sharing C STB on notebook This is from some notebook. So yeah
That's clear. I meant like the magic sequel percent sequel in a cell. Yeah, we support it I remember we have some integration Leave it like some something like the named plumber. Yeah, we can support some the magic Character in notebook with plumbers. Yeah
All right. Thank you for your questions and thank you a lot for your talk