We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

ClickHouse: what is behind the fastest columnar database

00:00

Formal Metadata

Title
ClickHouse: what is behind the fastest columnar database
Title of Series
Number of Parts
60
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
An open source columnar database ClickHouse is in many ways exceptional - it is exceptionally fast, exceptionally efficient, but also, at times exceptionally confusing. Its approach to handling data goes against many principles and concepts that we use in other databases. To give some examples: its primary index doesn't index each row and doesn't guarantee uniqueness; a secondary index is used to skip data and doesn't point to specific rows; JOINS is a complex topic and transactions are supported partially, not to mention that its SQL dialect holds a couple of surprises up its sleeve. But, all that said, if used correctly, ClickHouse is a superb solution for online analytical processing (OLAP). The goal of this talk is to help you get the most of ClickHouse and avoid the pitfalls. We'll talk about OLAP and columnar databases. We'll touch topics of indexing, searching and disk storage. We'll look at the reasons behind the most puzzling concepts of ClickHouse, so that by the end of the talk you find them not only logical, but maybe even fascinating. If your challenge is analysing terabytes of data - this talk is for you. If you're a data scientist looking for tools to work with big data - this talk is for you. And, of course, if you are just curious about what makes ClickHouse crazy fast - this talk is for you as well.
DatabaseMusical ensembleOpen sourceDifferent (Kate Ryan album)DatabaseData warehousePhysical systemCustomer relationship managementSlide ruleDiagramLecture/ConferenceMeeting/InterviewComputer animation
DatabaseOrder of magnitudeVolume (thermodynamics)DatabaseProcess (computing)Term (mathematics)Materialization (paranormal)Multiplication signShared memoryPredictabilityInformationGroup actionData storage deviceProjective planeLimit (category theory)Online service providerDemo (music)Event horizonComputer architectureInternet der DingeCharacteristic polynomialDecision theoryDatabase transactionMeasurementDifferent (Kate Ryan album)BitTable (information)Analytic setMathematical analysisPower (physics)Lecture/ConferenceComputer animation
Line (geometry)Analytic setProcess (computing)DatabaseCustomer relationship managementData storage deviceServer (computing)Vector spaceSubject indexingQuery languageLattice (order)MassPhysical systemSpacetimeAdditionMiniDiscCodecBefehlsprozessorContrast (vision)System programmingMathematical optimizationData compressionInstance (computer science)Event horizonComputer architectureRow (database)Variety (linguistics)Different (Kate Ryan album)Fraction (mathematics)Product (business)InformationRule of inferenceAnalytic setGroup actionData storage deviceCharacteristic polynomialPhysical systemOperator (mathematics)DatabaseStreaming mediaProcess (computing)Decision theoryField (computer science)Type theoryMultiplication signBitDatabase transactionReal-time operating systemTable (information)Reading (process)Open sourceSingle-precision floating-point formatCASE <Informatik>Proper map10 (number)CalculationCuboidLatent heatMultiplication tableWeb pageSelectivity (electronic)NeuroinformatikResultantShift operatorTerm (mathematics)Address spaceVolume (thermodynamics)Focus (optics)Lecture/ConferenceComputer animation
Data compressionCodecBefehlsprozessorSeries (mathematics)Physical systemCustomer relationship managementDatabaseVolumeSystem programmingContrast (vision)Mathematical optimizationProcess (computing)Analytic setTable (information)Query languageHypercubeServer (computing)AdditionSpacetimeMiniDiscAlgorithmAdaptive behaviorLattice (order)Replication (computing)Data integrityZugriffskontrolleData storage deviceParallelverarbeitungSanitary sewerVector spaceReal numberSubject indexingCalculationFloating pointCharacteristic polynomialFlow separationComputer architectureType theoryNeuroinformatikSpacetimeIntegerMultiplication signRow (database)Extension (kinesiology)Block (periodic table)Mechanism designElectronic data processingError messageTime seriesDefault (computer science)WebsiteDatabaseMiniDiscMathematical analysisLatent heatInformationState observerComputing platformComputer fileTerm (mathematics)Event horizonTimestampSubject indexingData storage deviceBasis <Mathematik>Key (cryptography)Query language1 (number)String (computer science)BitData compressionOnline helpProduct (business)Process (computing)NumberDatabase transactionCellular automatonDimensional analysisSparse matrixAnalytic setAngle2 (number)Raw image formatSoftware testingOrientation (vector space)CuboidComputer animation
Menu (computing)Time domainWeb pagePlastikkarteMountain passAlpha (investment)Spring (hydrology)Table (information)Order (biology)String (computer science)Event horizonFormal languageNetwork topologyDisintegrationStandard deviationFile formatoutputLine (geometry)Interactive televisionWindows RegistryRepresentation (politics)MereologyWeb pageDemosceneFile formatNetwork topologyFamily2 (number)Materialization (paranormal)Variety (linguistics)Process (computing)Link (knot theory)Type theoryInsertion lossOpen sourceLine (geometry)Multiplication signReplication (computing)Computer fileDatabaseTable (information)INTEGRALGroup actionDependent and independent variablesSet (mathematics)Product (business)LogicMarginal distribution1 (number)NeuroinformatikInformationMenu (computing)Block (periodic table)Electronic data processingDifferent (Kate Ryan album)Shared memoryPower (physics)Demo (music)Physical systemQuery languageCASE <Informatik>BenchmarkCore dumpComputer animationLecture/Conference
Table (information)Order (biology)String (computer science)Menu (computing)Event horizonFormal languageData compressionNetwork topologyLattice (order)Limit (category theory)State of matterTable (information)Physical systemConsistencyType theoryDatabase normalizationOpen sourceDifferent (Kate Ryan album)Electronic mailing listCodecData compressionSet (mathematics)Lattice (order)Data dictionaryComputer fileLatent heatRaw image formatQuery languageMaxima and minimaSemiconductor memory1 (number)Analytic setTouchscreenComputer animation
DemonMaizeComputer-generated imageryGame theoryLimit (category theory)GEDCOMMeasurementMessage passingElectronic mailing listTouchscreenMereologyMathematicsSound effectLengthSource code
Local GroupOrder (biology)CountingMenu (computing)Query languageLengthLengthMenu (computing)Functional (mathematics)Multiplication signInformationInheritance (object-oriented programming)NumberBenchmarkSet (mathematics)Link (knot theory)MathematicsReal-time operating systemComputer animationJSON
Mathematical optimizationAnalytic setINTEGRALKey (cryptography)Physical systemElectronic mailing listSubject indexingLatent heatTerm (mathematics)Process (computing)Stability theoryInformationData warehouseSparse matrixDatabaseCollaborationismSemiconductor memoryScaling (geometry)Real-time operating systemData storage devicePower (physics)MathematicsTable (information)StapeldateiDemosceneEvent horizonVacuumFile systemFunctional (mathematics)Multiplication signConnected spaceOpen sourceCASE <Informatik>Mechanism designComputer fileArithmetic meanData structureVariety (linguistics)Data dictionaryDifferent (Kate Ryan album)Row (database)Core dumpCustomer relationship managementTransformation (genetics)MereologyMobile appComputer animation
1 (number)Link (knot theory)Presentation of a groupWebsiteOpen sourceComputing platformDifferent (Kate Ryan album)MereologyINTEGRALProduct (business)Bookmark (World Wide Web)Computer animationXMLLecture/Conference
Arithmetic meanComputer fileLecture/Conference
Maxima and minimaLimit (category theory)NumberSelectivity (electronic)Flow separationFile systemRule of inferenceMeeting/Interview
Connectivity (graph theory)AreaLecture/Conference
Newton's law of universal gravitationMusical ensembleMultiplication signMeeting/InterviewLecture/Conference
Diagram
Transcript: English(auto-generated)
So hello everyone. My name is Olena. I work at Ivan where we support and contribute a lot to open source technologies. And today I want to talk about a technology which is so cool
and in so many ways different from what we are used to. And you can guess from the starting slide and the name of the talk that I'm going to talk about Clickhouse, a data warehousing solution which in my opinion is one of the most promising and exciting open
source database management systems. And some of the reasons for that I would like to share with you in this session. Could you raise your hand if you heard about Clickhouse
before? Okay, we do have not that many hands but still I think. And could you raise your hand if you actually had an opportunity to work with Clickhouse? We do have several hands which is amazing. It's way more than I expected. So today our journey will look like
this. We'll start by talking briefly about the challenge of storing enormous magnitudes of data long term. And in particular how analytical processing of the data is different from using data for transactional purposes. Next we'll look at how Clickhouse fits into the scenario
and here I will stop a bit longer on some of the Clickhouse architectural characteristics and how Clickhouse differs from other solutions you might be familiar with.
After that we will move to the demo and see a couple of scenarios of Clickhouse in action. And then I'll share with you also the things you need to be aware of when using Clickhouse so that you can avoid pitfalls and won't get into a data swamp. And finally I will leave you with
some materials you can check to start using Clickhouse for your projects. Can I ask if it's okay like if my microphone is good because I kind of like hear some noise but okay then everything's fine. So data is everywhere and our lives wouldn't look the same
if not for the amount of data which is involved at every step of our daily activities. And you can relate some online services which we use every day, the measurements from different IoT devices which surround us, weather forecast data and for the projects that we build
the more data we can handle the bigger value it can bring and the better predictions, analysis and understandings we can retrieve and therefore better business decisions we can make to move us forward. And I guess we can all agree on that.
However in the past when dealing with data we often processed what we needed selected bits and pieces and kept them aside in some table in a database and sadly sadly had to throw away
all the rest of the data because keeping all raw data felt unfeasible not only because of the storage volumes but also because of the limited processing powers needed to navigate all that amount of information later in time. And this is especially relevant for the data used for
analytics. Take for example event driven architectures where we look at the data from the prism of continuously coming events like here. You need to help me.
Maybe my ears are not positioned correctly. Yes okay it's still there. Can you actually help me with like on the ear because maybe I didn't put it correctly.
Oh amazing now I don't hear myself breathing that was terrible. So where where I was okay so for instance event driven architecture I will look at the data from the prism of continuously coming events and these streams of
data help us process real-time information and make fast decisions but what to do with this data in a month maybe in a year in a decade how can we keep this data later for analytics and there is a variety of different storage which we can consider but when it comes to dealing
with big volumes of data which later need to be scanned in huge chunks aggregated and used for complex calculations and all of that needs to be done fast we need to be very careful
to select proper storage and tools which can handle those type types of requests and this is not the case when you want to rely on a Swiss army knife database such as Postgres which we admittedly really love and respect however in this situation to solve
problem well we need to use specialized tools and to understand why we need a separate solution let's briefly talk about two different types of systems online transactional processing and online analytical processing and in particular I would like you to pay attention to the
granularity with which these systems process the data and to the types of operations which are relevant in these systems and let's start with OLTP scenario and imagine that you are running an online shop and each individual user in your shop has some
kind of characteristics some data associated with that user and just there's a data in a table in a database or in tables in database for example such information as their address
delivery preferences payment details and so on and when a user goes and updates their address you want to find read and change that information quickly so you send a request
to target specific roles in a table in a database and you find those roles most probably by customer ID and technically speaking you can ignore the rest of the roles related to other customers so it's a very narrow and precise operation and of course because probably multiple
tables to keep data consistent you rely on transactions now let's compare it to a different type of scenario where we keep the data for future analytics we observe user actions their interest in different products their visits to different pages everything which will help us
understand the behavior of our users we don't think any longer in terms of individual roles rather we are focused on the computations and aggregations and the focus shifts from updating
single rows into reading and processing millions or even billions of rows at a time and what is important only a fraction of fields is usually necessary for those requests not the
complete rows but only the fraction which we need to aggregate and unlike in the previous example for OLTP in analytics we usually don't update the past data so these two are so different scenarios and of course when we are building systems for OLTP and for OLAP we can
use those characteristics and target their performance for those systems separately and in fact building a highly performant solution for online analytical processing was the goal behind click house and as a result now it can process billions of rows and tens of gigabyte
per second which not only by far beats OLTP solutions but also performs really really well comparing to other existing OLAP systems and with all of that click house is an open source
solution so to achieve this very impressive performance click house has to think outside of the box and literally turn around some of the characteristics and concepts some of the concepts which we are familiar with so actually if you look at most prominent characteristics of click house
and you can find them in click house documentation you can notice that so many of them actually focus on efficiency and on performance so to understand a bit more how click house works let's look at several of those architectural characteristics and let's start at how data is
structured and stored and when we look at the traditional transactional database such as posgres or mysql they store data in a row oriented approach so they store data
on the disk row by row and this means that even if you want to read a specific cell you still have to scan a complete row when reading the data and also as a consequence
of row oriented approach you have mixed type of data on the disk in the files and most of the time it's actually pretty fine for online transactional processing because usually we benefited having all information together however if we are talking about analysis of
millions of records at a time when we rely on grouping and aggregation mechanisms this row oriented approach very quickly becomes inefficient and wasteful because we have to read
significantly more data than we need and also have to go into way more granularity than is required that's why it's not surprising that click house uses a column oriented approach and columnar databases store data by columns every column is stored in a separate file
this gives us an easy way to analyze only some columns while omitting the rest and click house is also a truly columnar database because not only stores the data of the same column on the disk
next to each other it also keeps those values clean with no extra data attached to them and this is quite important for us because this means that we will have files on the disk of
the same type of data whether you have in the column an integer value or timestamp or some maybe short strings which kind of a huge dictionary technically it's the same type of data so we can apply the next click house feature very effectively data compression and
this help us maybe if you have the integer values which are kind of increased on a regular basis so you can kind of knowing all that you can really compress it and save not only
disk usage space but also process the data faster once we have that next thing is to do is to find what is relevant for our request and what is not and click house relies on indexing to find relevant information and you might say indexing
all the rest databases like it's nothing really new however for click house those indexes look quite different to what we are used to for example let's look at the primary index which
we also call sparse index because it doesn't really index every row instead it indexes every 8192nd row by default you can change it if you want but it's kind of like a big number and it means a bit more if you look at it binary looks more logical and we use a sparse index
because in analytics we often have to deal with millions of rows when responding to requests and we just can't afford to think in terms of individual rows we have to think in terms of
big blocks of data and when we move searching or navigating data we move way way faster but this also means if you are looking for some particular item for some kind of event id and then you have a bit of an issue because you will be able to quickly navigate to the block
with that item but then to find the exact row you will have to iterate almost over 10 000 item looking for it and we'll talk about pitfalls later but understanding the architectural characteristics give you this knowledge to understand in which scenarios click house will
shine and in which ones maybe not so much okay so primary key is probably the most important for the speed of a query however there always be some queries which won't be able fully benefit from the primary key and usually in our data we have different dimensions
so for example for our online shop we have time series data so the time of the event is technically really good for primary key however you might want to approach data from a different angle and think in terms like individual products or individual websites which you have
and for this usually in traditional transactional databases you would attach one or more secondary indexes but this classical approach won't really work in click house because there are no
individual rows on the disk to add to the index so click house has to be creative and find a different way of doing secondary indexes and it actually defines which data can be safely skipped when processing the request so that we don't have to scan the columns
completely and secondary indexes are really good for those popular queries where you can actually predict which data is actually retrieved pretty frequently for example if you have an
observability platform and you know that your users are often interested in specific types of errors say 502 so what you can do you can define a skip index to check for those particular errors in each of the blocks and you can say okay does this block contain information about 502
yes or not this one and so on you put this information aside and then when request is coming we can quickly tell which blocks should be processed and which shouldn't be so secondary indexes are pretty smart way and very powerful way of doing the request faster even though they
might be to some extent tricky to properly test when it comes to computation and data processing click house relies on vectorized executions to be fast and this means that during data processing
the columns are broken into pieces and the different cores take responsibility to process the data in parallel and this is how we can speed up against the queries
and it's possible to talk about all the logic and all the algorithmic work which makes click house fast for hours and hours but i think we want to move on and i want to show you some click house in action however in the additional materials later i will leave links to the benchmarking
where you can see how click house actually performs comparing to other systems and i know benchmarking is a very very complex science and we should be very skeptical but it's also open source so you can also verify how it is done and i think it shows amazing powers of
click house when used for write scenarios but now let's move for to the demo and i i really hope you're not that hungry the lunch will be like in 20 minutes because i actually wanted to show you how we can use click house to look at the menus of various restaurants
going back to the 18th century and this data set which i really like is collected and provided by the new york public library and click house docs offer a very convenient set of instructions on how to use it for experiments and those links i'm showing by the way i'll share with
you again in the additional materials so you don't have to try to write them down okay so the original data set for this consists of four csv files we have data for menus individual menu items individual menu pages and dishes which are in those menus
and then click house we use sql to define corresponding table in the database here we also specify what table engine we are using not to confuse with database engine by table engine defines the um the behavior and the features of the tables which we have in
click house and there are variety of targeted engines which we have which you can use based on your scenario the size of the data the source of the data i'm using the merge tree i am actually using replication merge tree together with ivan for click house so that
it supports data replication and but there are other date table engines for example integration ones if you want to connect to other data sources or special engines to target particular
type of data again to be faster or log engine for log data but it still finds that merge trees are just like magic so merge tree engine actually allows us to bring the data very fast but then also to read the data very fast and to understand how exactly this
happen look let's look actually at how data is being inserted into click house so every time you are insert the data and by the way when you insert the data there is a common recommendation with click house is that try not to insert individual lines like don't go with
granular inserts go with bigger blocks like 10k items or something like this like and you understand why it is so in a second so every insert we just take the data and put it into click house and it becomes a part so you don't think much it's just very very fast to move
those inserts like one in certain parts second in certain part so how many inserts you have they will result in those parts in click house and this allows us to quickly take the data into click house but this of course is will be quite slow when you start reading the data
so behind the scenes click house will merge smaller parts into a bigger one then of course you are adding more and more data more and more parts come they also are being merged behind the scene into a bigger part and those bigger ones are eventually merged
in even more bigger ones so you can see actually where the name merge tree is coming from it looks like a tree i mean almost you have to rotate it so but the yeah eventually we aim to have bigger and bigger parts but also logically speaking if you are
adding constantly data to click house it means that that process of making bigger parts from smaller ones will never be finished but worry not click house will do the leftovers of the merging of the part or at least you can indicate to click house if you want to
finish all the merging or during the request time whatever was not merged in the background because of continuously added data and what is interesting so there are different members of this family and they can do some pretty cool things for example technically updating information might not be the best use case for click house however if you are using replace replacing merge
tree what happens that during the merge you can for example you will have two parts with information about the same same product with same product id and you can replace the older data with a new one so this replacing merge tree it won't remove the data as far as i know but
you can also clean it later separately for the simon merge tree it will just add the new data for the old one aggregating will do some aggregations on top of the data when merging those parts so a lot of magic which is kind of pretty smart way just to do it during merging of the parts and this is how more or less behind the scene merge tree works but how
exactly we ingest the data into click house and bring in the data from some external sources it's actually quite a common scenario for click house and in my opinion click house actually
does a really amazing job of being super flexible on whatever format of data you have whether it is a csv as we have or tsv or json there is a wide set of instructions which you can tell click house how exactly to treat your data when it's being brought in so whatever format
you have click house is happy to accept it um yeah so and once we ingested the data we will have four tables in the system with the data which is somehow normalized and having normalized data is quite important for all tp systems to avoid
data redundancy preserve consistency use less memory however the thing is having normalized data is exactly the opposite from the ideal state of the data for click house and this is
because click house works with data differently and it's not prone to the limitations of oltp systems and it is so just designed to work effectively with the normalized data so in click
house the more your data is the normalized like kind of raw data the faster will be the queries and there are many reasons why you shouldn't really worry to denormalize or to have the normalized data first of all in click house there is no extra
cost for having too many columns and as you have seen that every column is stored in a separate file so we can easily add or remove columns with almost no cost and columns which are not used in those queries they don't affect their performance also the normalized data repeats a lot
usually so it's pretty good at compression especially if you have properly defined the type of the data the compression codex and also you sort data wisely which is also very important
for the compression and also since we don't really usually change the old data we don't have an issue with breaking consistency in analytics so with all of that we can denormalize our data about the restaurant menus and we can use couple of joins for doing so technically speaking
you can also apply some ways to avoid join joins for example external dictionaries but also joins are fully supported and click house there are different types of joins existing like some general ones but also some specific to target specific scenarios so you can kind of get the
maximum with the performance if you have to do the join yeah so once we have our data in a table we can start sending the requests and we can do what they call a potato experiment to show you
an interesting tendency which i noticed when i just was playing with the data set and i was looking what exactly to show so let's get the list of the dishes we will get it first from the year 1850 and it should contain potatoes in the titles and we will sort it by popularity and then
limit it to 10 so that it fits nicely on my screen and you have a list of dishes which are apparently very very popular two centuries ago paradise of those of you who love potatoes but the interesting part actually comes if we run the same query but for the year 2000
and even if you don't really read the titles i know it's pretty tiny you can see the change in the length of the title from the year 1850 to later so i was wondering is it a general
tendency or some kind of a potato effect so this together with click house we can run and make a scientific conclusion of this tendency to let click house look through all the menu items we have across all the years to show us how the decades correlate with the average
length of the titles and if we do that and we use some of useful click house functions you can indeed see that over time the titles got longer and longer naturally speaking it's not really scientific evidence and experiment there are totally many reasons why it's actually became
like that but still an interesting fact also with this data set we can observe the change in number of sweet dishes over time and check the proportion of dishes which include sweet or sugar in the title that we can measure across menus are aggregated per decades
and honestly i really expected to have a clear evidence i'm such a like i think sugar is evil so i was like definitely i will see the tendency not so much it's super random
and if you are really curious what was so sweet in 80s 1860s that was mainly sweet potato so yeah and for every request we have also information on how many items were processed how long it took to process that request by click house admittedly i really like this data set
and it's super fun to play with it's really on the small side for click house ambitions however i will leave the links to other bigger data sets if you want to run some benchmarks cool so um but how about click house and real time data and click house actually works quite
well in collaboration with event driven systems and in fact it's so quite nicely used together with apache kafka and those two systems apache kafka and click house even though they are
so so different however they do have certain things in common both of them work best with immutable data and also both of them scale really really well so if you connect
click house and apache kafka you can print the data from kafka topics into click house table for long term storage and analytics and in click house there are already a variety of integration mechanism you can do to connect apache kafka and click house
and first one i want to talk about this kafka engine it comes with click house functionality and will pull the data it also will batch the data before like so it's kind of the most efficient for click house to consume and you will probably need some kind of a materialized
view so that it will play like a trigger noticing some data in kafka engine and then taking the data maybe processing it if you need to do some changes on that and putting into destination table with that that's not the only way of connecting apache kafka and click house i think click house is like like a transformer it can do so many things so there
is another right now alternative to kafka engine which is called click house kafka connect which will do the job it's a fresher approach to connect those system so you can use that as well or gdbc you can use that as well also sometimes you don't really want to bring the
data into click house you're like okay the data will stay there in postgres and i just want to make a request with click house and get information for my postgres database you can do that as well and not only for postgres but for a bunch of different databases and this is not the whole
whole list i think this is i selected what is officially supported but the whole list of other integrations is like super super long so whatever systems you have they can be integrated with click house and another thing which i really like is external dictionaries because they allow us
to use click house all app system together maybe with all tp together and external dictionaries can be an external file or it can be my sql database postgres or some other sources of data and what click house will do if you define external dictionary it will take the
data and put it into memory and then use it for any requests you need to run this click house however it will go periodically back and check for the latest update in the data so you can kind of get the best of two worlds of all tp and all app yeah so um now that we
had some glimpse into click house what else you need to pay attention to and there are several things you need to be aware of when using click house to get the most performance out of it click house is designed to be fast it's designed to be super super
super fast however uh the optimizations those many optimizations which made click house optimal for all app systems for analytics might made it suboptimal for other use cases this doesn't mean
that you can't use it for other use cases however normally for example i wouldn't recommend using click house for all tp solutions um click house ideally expects data to be immutable which means that so which actually there are right now features which allow you to
remove the old data and lightweighted deletes make it actually pretty nice so if you want to remove some data then it will actually mark the data to be removed and later in time vacuum the data from the table so kind of all behind the scenes and when actually again the parts are merging it will remove the data so there are possibilities to do uh deletes to do
mutations uh technically you can do everything you want with click house but still i do have a feeling that when you use it for all tp the performance will not be as impressive as when you use click house for analytics and if you remember the sparse indexing finding a specific
items by id won't be as fast if you would index every row so yeah also click house is not a key value database management system again you can try using it might be actually quite effective
but again it's not the core design also it's not designed to be a file storage it's not a document oriented database either it uses predefined schema and you need to define the schema during table creation there are ways actually again to use a dynamically defined
schema for you however the better the schema is the faster will be the performance of the requests so the structure of your data really matters for click house how you will arrange the
data how you will compress the data so all those stuff if you really want to have a high performance data warehouse you need to pay attention to the structure of your data and when used properly click house is a great solution it's shown to be super super fast and also it has all potential in my opinion to become your favorite database so i want to
leave you with links to the resources i used in this presentation and some extra ones so you can start your own journey using click house also check out ivan.io this is the website of the company where i work and we are the trusted open source data platform for everyone and in
fact now we are supporting 11 different open source solutions as a part of ivan platform so you actually have not only click house kafka postgres and the rest but also integrations between them so you can create end-to-end products using your favorite open source data tools
um and with this thank you so much for listening to me and i'm all ears to your questions
thank you very much oleana do we have any questions yes so you said that when you add too many columns that doesn't come with an extra cost
but if you select star or too many columns it doesn't mean that you need to go to all the different files to get the value for all the columns so is it i i think this is a question usually i mean actually not usually that's probably the rule between the first 10 in
click house you don't do select star and also like so i said that the there is like you can have as many columns as you need technically it's not correct because there is some kind like this is like the number i don't remember right now it's like super super huge but you shouldn't i don't remember like if it's in millions or something but there is some limit where you don't want to have because also you deal with file system but that number of
columns can be really huge so you don't reach the maximum value where it's actually starts to be inefficient usually but yeah don't use select star it is because usually like for aggregation you really know what data you want to aggregate i i don't really see usually
aggregation done like more than 10 i don't know like but maybe it depends on the scenario but usually you will have actual aggregations for several columns maximum yeah okay thank you
uh i really enjoyed the talk um i have some people who i've worked with that use uh click house and one of the problems that they've run into is uh using zookeeper or click keeper i think it's called as the back as you know a component of this um click house itself seems really powerful but that always seems to trip them up do you have any suggestions
or advice of zookeeper click keeper if i'm getting that name correct anything in that area unfortunately i don't have that knowledge because this is actually so we do have the whole infrastructure for click house and i know that the zookeeper is a big question right now also the
latest like click keeper so it's it's something you have to deal with but personally i don't have the experience uh fighting with those so i'm happily delegating that very well thank you very much do we have any more questions from the audience there are no questions online so
you have time for more questions if you want to if not you can find me always somewhere here i will answer your questions also later all right then thank you very much oleana thank you for the talk uh thanks for coming thank you