Billion Tables Project (BTP)


Formal Metadata

Billion Tables Project (BTP)
Alternative Title
The Billion Tables Project
Title of Series
Number of Parts
Tortosa, Álvaro Hernández
Tallón, José Luis
Heroku (Sponsor)
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross
Release Date
Production Place
Ottawa, Canada

Content Metadata

Subject Area
Aka how long a "\dt" takes on a 1B tables database Usually "large" databases are considered as such for the high number of records they hold, reaching billions or even more than that. But what about creating a billion... tables? Sometime ago, this apparently crazy question was found in a database soup. It may not be your day-to-day task, but the task of creating them exposes some topics about PostgreSQL internals, performance and large databases that may be really worth for your day-to-day. Join us for this talk, where we'll be discussing topics such as catalogue structure and storage requirements, table speed creation, differences between PostgreSQL versions and durability vs. table creation speed tradeoffs, among others. And, of course, how long a "\dt" takes on a 1B tables database :) This talk will explore all the steps taken to achieve such a result, raising questions on topics such as: The catalogue structure and its storage requirements. Table creation speed. Durability tradeoffs to achieve the desired goal. Strategy to be able to create the 1B tables. Scripts / programs used. How the database behaves under such a high table count. Differences in table creation speed and other shortcuts between different PostgreSQL versions. How the storage media and database memory affects the table creation speed and the feasibility of the task. If it makes sense to have such a database. It is intended to be a funny, open talk, for a beginner to medium level audience, interested in large databases, performance and PostgreSQL internals.
Source code
Source code
Source code
so all 1st of all 3 to be here hope your this time and then just interrupt me any time you have any questions then the tough questions refer to so here don't about long 1 billion
tables a little bit of background 1st land and and a lot of time and therefore of our watery it's easier for you I worked their company based in Madrid Spain call noses work open systems and we basically do consultancy and training in areas such as possible scale costs a little bit of data sorry and and also our architecture and development and troubled by computing like must have at yes we're enterprise partners we do a lot of training and will bury have coordinates if you want to be in so
on 1st we need to answer question with large data because well insulin large database is considered something which can hold up to derive soaring those love provides missing on well with the number of records of billions to maybe trillions of record the largest you go to multiple nodes then the same basis as virtual image and then there are cases well known cases of databases up to hundreds of provides an even better rights so lots data but this time not about big databases it's not about big indeed but something about the presentation the showing what holding what for a change in the
solution no for going or because
going to what OK yeah that's it so
the what's going whole there's nothing really on right the middle of the so I housing and this study is not about data of all the data it's just about being but in a different way indeed associated we're talking about big metadata and we're probably going to be showing the worst data to make data relationships at well it seems that anyway so what types of of databases you beautiful with number what was meant so I have been researching the number of tables founding in some databases we have done so what I call the schema-less like single-table database but some designers try to exercise which has wanted what I have also found no 1 customer of mine that had an extremely normalized enterprise CRM which with just 2 tables those are harder exercise good stuff when property then we have small databases with final around 20 tables you on the details large around 200 then nothing like all the all Iran's go wild databases if not thousands 5 thousand databases tables in the database multitone all you can get a higher number of your multi-tenant they're doing will the Tennessee around 2000 they massive both Tennessee around 1 million the tables and if go to the to be the billion tables for it returns about ability tables and
so you can see the number of tables for these different databases in a logarithm scale this is just for you to show up the so I were
merely itself was skill which I cannot there is a limit on the number of attributes in a given table depending on the type and the type of the columns right there is some element of the 2nd the attribute and as a consequence of of both of of them the 1st 2 there's a limit on the size of 4 of there is no limit in the number of rows there's a limit on the table size of around the book going to turn right end the number of tables in the database in in symbol database it's a limited and also the maximum size of data we're going to the testing isn't basically the purpose of the data of the of the wall of the of the of the so that the invasion 40 all when getting we're not there yet it's just 1 million so where it all
started I didn't e-mailed for but if on it and basically we can be pulled it doesn't matter so it wasn't in an e-mail but Simon wholly and India had mailing lists guessing that the number of maximum number of tables limited so the desert that realistically it could get to trillions of tables while I can tell you that not you you cannot get Trilling lists of today but let's see if we can get a little closer and of course messaging
there there was also responsible for stadium and published the book was that inspired us is to try to do this great thing reading ability tables in the database so 1st
of all to prove that was still has no limits the number of tables with the statement that no 1 has ever been so it's trade with distressed skill in an unusual way goes and stressed many different ways of going on in the Danish in all he has persevered I wanted to test it before going to production so what's with all calibrated but these are people facial let's
look at sure sharing that's yeah yeah I will stress that the disk subsystem of the of the of this but this on official but in reality 1st of all we wanted to be just America's created grading tables faster you that's the main driver behind this this spot of course what is bigger than yours how the database and finally because we can so we need to change
definition what's the PPS while from Wikipedia I'd say is the transactions-per-second refers to the number of all the actions performed From now on it's going to be tables for 6 but I 1st events but in 2011 this Americans again try tried to get some some tables created fast the created correct me if I'm wrong with this set your blog posted you were reading female tables around 83 tables per the server crashed because run out of disk space and you were creating table with the serial into go cold in the text there was a 2nd and then they no creating 4 . 6 million tables before the server across because running out of my notes at the around 1 K TPS they had and the dropped the the sequence that just using an integer called in the text of after doing some mass of we we that we found that is it's not possible in the middle of space to be stable will talk about that later and so we drop the text in I did some initial events and how I was able to do 10 million tables at 2 K 2 tables per 2nd stopping the the ran and 100 million tables of raw 1 K 5 tables this is all also supported us a comment on that past so while also use a K I we exercise a little bit of the system but well there were some problems
what's the frequencies whatever what the 1st problem is that we running out of this space In the cradle is made of the adult you yeah I know so were run out of this space very soon 1st of all because tables are stored in MPEG class what of the themselves that this factor of the the nation the name so while reasserting a Boolean grows and feature class and that requires a Start sh was there it's there's a diagram there about their there page structure but well basically you'll note that there's some of the items in which they were going to be using 1 so we need a lot of storage but we also need a lot of storage and house because it's it's stables going to be creating a list of file and the file and even those files and there's a space required for the I know the name of the file amended my files the metadata and it's also so we need to find a large storage so on how to get to 100 100 million tables we also need to add this is what happened on some test we're running out of memory on the order of memory subsystem from the Linux kernel was killing that is still ah sorry what the the script weighting the table we also need a file system who handles well a large number of files we found out that will was like the gas that phrase offences that is the only 1 was able to cope with such a lot a large number of in the fossil and we also need a good table creation strategies the 1st as we were doing where all of them where creating an a C is the file or SQL file pre different populated and and the to the SQL that was working very well read didn't even think about using anything at all with driver this using local connections of going to work what was not fast enough and so the final was to basically our lunch appears the man and the decree table statements via it's standard so we can create a program that writes that great statement statements and not done them to the time input the PSQ that turned out to work quite well and quite fast of course there's some work to do on a bus rescue kind and of course this are some settings number recommended for all the users rather than creating tables fast like and single called synchronous at all full-page riders off we need a little bit of around for wall buffers automatic yeah although it but 1 of the also but depending on the number of tables you while also that later at how we were restarting the processes once in a while after a number of tables created created and depending on on but number of statements we are centered to the will come out of there was a different use of all war records in blocks per transaction need to increase the if you have some around free of would have the next server increase also serve buffers and increased checkpoint segments of that that's the only necessary to a large so that this 1st there's a dining 2011 while we're done with a wall single-server recent find until to so you look for the rights of RAM and 3 8 1 terabyte desks special about 40 and yet that's so approximately the storage requirements for doing this exercise where using the reservoir system and of the scaling with this script it's gonna be hard to read from the I guess but in smaller something like this it was a python script that basically and decided that to the number of titrations for each iteration it's and it's creating a number of skilled statements and research in this field process and it's not spawning new of his skill process is kill command for eating the retailer statements in the chain at work and then logging all of the information about this usage radiation but memory members that's and and state all in effect basically out very basic
spread very proud of it but it works quite well and so this is our moral as the results are inbred it's the TPS remember tables per 2nd and in blue it's the time in seconds so to reach 1 we'll 100 million needed around long thousand times in a minute so when around 1 1 thousand minutes and the tables per 2nd at all well around 2 . 5 K but the speed was was not going down and it will continue this trend you will end up around 6 100 tables % something after the 400 million this is not going to work to reach the billiard table was it went away from the social saying this mission to so quite high Sony 10 tensors 2 . 6 . 2 . 6 there apart from that we found some problem automatic but wasn't turn off yeah itself but not it's a problem and the reason is that you is also kicking in independently of the when 1 interest all about him for his next stage it correct me if I'm wrong but I think this forever was not in i when 0 29 . 1 0 . 2 or something but can walt by default is set up to 1 billion so it's quite risky to try to create 1 million tables with with this setting to 1 billion so we have to to raise this was all about him kicking the I'll give him the benefit of was the reason that so yeah I was thinking and they or like you and basically make the whole process to stop because that's going to scan 200 300 million tables the fun soldiers were some really bad and the fight that with the sleep of is he the shapes and and effects will see from a transaction we create this up remember that we're feeding they may commence over as from the inputs and there's no transaction which we tried that to create maps of transactions but the performance was quite worse yeah and the use of Wall was it this I tried creating back itself 10 k ee and 100 k of maybe with more transactions with but so far I have found that the best performance was not using transactions at all but since contrary to the problem with that was that when reading something kind different and the other enemy it's not very related postures skill but taking can also and completely green all the test of the unknown for by the updating yeah well help enables and waited for him did a beginning what the advocates yeah the file system database you scanning 200 million files right and then 1 of my file-based database not funny so what else do we need to do to reach the 1 million sort sh critical points here 1st of all separate varies from the tables the this is very important because of the files created let's saying in the table space and the files for the kind full of people in different profile here there's 1 creating small files with little or large number of files was used as the panel of information to a table because with them on set unset separating conditions them so we created a table space or more will see later that we did more at in a raster file system it's a petition that the data and for best performance was achieved using a having the base petitioned the base files on an exodus files which we call the it's it's where did all the PG class stuff is going and finally with put the wall record some random who cares databases not durable already so well In the server doesn't criterion that table creation process it's OK and we call that X. what else do we need
larger pits the 1st server was not enough I wanted to go fast fortunately we can afford so we're running out of this CPU you with 16 cores 48 gigabytes of RAM more than so operating system using Debian Linux and was Gilman quintupling for 1st machine just takes 6 seconds to powerful skills source code recall and that we set up an ice storage subsystem basically our data the race fossils the petition was running the local storage with 100 divide cache memory controller that for the desk running right 10 the x log was about 19 gigabytes during some test we used up to have are using their the maximum available amount of RAM and also using a sort of system that's that will fall as a by capital case we exploit will because we have a lot of server crashes because of that on the Big Data petition all was not at our ethnic server using out tended to be definite channel so pretty fast at work for the storage with around 6 right and use with this song on right so operator of critical step for just for then
the table spaces except for raising funds them every all the pulses them suffer lot with a large number of files even greater degrades after several millions of files so the only obvious reason reasonable way and I think it's not tricking the original statement of creating 1 billion tables because after all they are in the same data so how we were creating tablespaces are you know they're insane partition if it's in the same race of was the and 4 . 1 billion rand that we will chosen number it's it's a K which is 1 . 1 so it's stable pairs will be holding just 1 million files just 1 million times so it's it's performs quite pretty well we also assigned our script to more or less be rotating during their tablespaces so that very likely no process will be hitting the same tables that set of saying that that the warranty that there were well concurrency it is of course not 1st of all what was he went away long time to use the same approach so we need to go compressed want to this conclusion because creation is not just limited indeed if if you look at the average this stupid were doing less than 5 megabytes per 2nd on the 100 million tables so this is not a problem what starches not just so was limiting while our guess is that we have 2 limits the 1st 1 is CPU speed while we're running this on single process the city running at 100 % B so it's creating tables as fast as the CPU can process and statements and then everything all we could create proper statements because the statements are dynamic in so In its passing out what parts of the preparing but how can do that so sick use intensity a Foster City is were doing more cables per 2nd but especially there's there's some Convention when we use more processes we can create tables faster that CPU usage drops what was it will see that later but anyway the the way to get faster and we're going to get a faster CPU sees more costs were to use and and the more asking in part so will answer several processes in background and the spot centuries 16 it is not surprising where does this machine had 16 true cost we also test 30 and it was so performance was lower was working on which means that city you start using more was quite by well now we have another problem which this In the 1st of the exercise were during each process was longing and the number of tables created and some information about this because storage and free memory and but if we ran several processes well things get tougher thus we cannot have every process block independently Howard merging solid all that information so of basically we chose to run and another process to look today so in order to make this work the loader process has resistance argument PAT every of worker process and 1 the longer ones so long the data you basically sensor use 1 single to the water processes which are presented and this worker processes the answer and the and the longer after sending this I is going to start reading from from from people from the falsity named after the KDE of the worker process where the word process is going to bring the the number of plus some information about the state of its parents that basically what they finished so this is the way we collect the data now consolidated fashion so how is
the source code while the worker is a Python script derived from the original 1 but I so area it divides the number of tables assigned to the worker in titrations for each i duration is Swanson you are PSQ processes and starts feeling that create table table statements you stand the input when in the traps that this deciduous 1 signal and writes the number of tables that and when it's sent the sentence-final it exits that's makes when the task and because the motor process may still want to read them the table information the log information so it's going to be that the longer process who will sensor signal to terminate the work and well there the I iterations the spawning 1 of the few secure processes they also run on their own thread will in right there because there not in cycles so 1 month and the 2nd 1 of the main threat threat non-tree probably I O of the process of the fact then the longer the longer sell script that basically 1 1 I receive son and user 1 signal starts slowly sending the sign of the the work of course and in their suffered script the main which is similar so script which basically lattice a lot everyone both the worker processes and under lunch and the longer process and cycles to us and signals to the longer process for for you to look the data basically every 10 seconds were were running the things that we're
not going to be able to see this you have to look at the post slides but this is basically a diagram of have been explaining how the beekeeping main processes lot launching the worker processes and the and 1 and that the worker processes are processes dropping the final and writing to the flow of the information about the number of tables and a lotta collecting that information and logging into the file whatever I'm so sorry you can look at as you can but all this called looks like what don't skip this but this is more or less the Python code of the sorry the cell code of the main processes which basically because certain so it's going to be used to this stress
here no yeah OK so I
don't want to get into the details of the of the process but the main test this so the main idea here is it offers full well we actually the number of process we lunch here did worker processes with all the planets and the need to run like that table also where they start the identification number of tables per process they do and then the number of tables spaces we got gathered there that PAV and then we launched the motor process with all their precise ideas of the worker processes and finally
we look but looking every 10 seconds sending a used user 1 signal to the longer process corporate law this is basically the main idea worker processes or work
more or less like this and all the true they trap the cycle basically open the the fellow and right to the the folks both that if they're finished and they created number number of tables so far by the way if we look at the data so it's no we're doing a synchronously the creation of tables those were right into the sun in but the number of the great stable statements and then I'll kiss goes the impostors are posing as fast as they can but if we try to report a greater number of tables the are primal prolly be reporting that numbering multiples of the number of tables for process because the program is this faster that please kill sending statement commands rather than processing but it's on and basically this is the the I Iteration were we just so you see that create tables statements
table space we also made their know the name of the tables this minimum minimal as and while we launched the iterations in independent threat to be that the process of creating tables the interrupted by the same and will basically the if if whenever we're done when we're safe that their terms final we remove that people and connect them this is basically that the worker
and finally that the motor process but something like that
this is where but since society no to them the process when when when we need to to aggravate the kind of tables since signing onto the worker process read on them read and if the fell except if
necessary and increment the number of tables this is just something for more peaky programmers who want to know what's doing that make a guess wife worlds just some code for logging information nothing really interesting and
basically this is the main loop so what we detect when we're finished looking at this this set in memory about the number of workers that have already finished and we just every here and we're done we just kill them can work so this is basically that they have to work on the on source-code using 2
the tests so the big big
question it work I'm sorry to say that most of the test we have been doing failed and so we've been doing this like ground among more than that and the last last taste test we did was started on you stay on Wednesday this week the and it failed but with from the last 1 with some parameters that but 1 billion tables can without making thank you and was bypass very 3 hours sorry OK OK will do that I will do that the last but let's say I'm on a show you some more information on this data while the average number of TPs was surrounding average thus created it k tables for so funny the cedar logs growing and growing creating millions of tables in minutes and how well we would get the transaction counter it was a little over a billion which makes sense well let's see how many inches on the 2 class the expected number was 1 million 288 and it took a little bit more than 2 hours to make the square but this is the files the music the base their used 2 . 6 there right so the 2 classes 2 . 6 terrabytes of very useful meta information yeah yeah sure but end up dependence luminosity depends amazing depends and then there are all files were already at they top 1 around 6 16 megabytes and the raster file system with a lot of and files used 97 the again useful might information here we just grab some statistics about the their I O operations and here we can see that the band was using a lot of false of work and data I don't know exactly what but it's it's quite funny and of course we try to use the database In work quite well 1 here I don't know I was someone could have measured there was this was question but also saying it's very fast and search or to select from a given the finding MS inserts and point to will 2nd squared so look at all Maria conference about inspected and the big big big big how long this set the team but basically this could this is the nation the what the the whole yeah yeah yeah yeah that's right I waited for about 3 thousand minutes and they don't to some of you the only right of course not but if we look at the all like you mutual speaking in and suffering so much for a single table something like you could take a long long time and it was of course this they're making the output to up to a file list to avoid using a lot of memory the of the we can try you could try it will look at the performance of this running test for creating the BL tables was pretty good there the table presents TPS was dropping exist for is likely it top that think paper 2nd and this moved around a little bit over a year ago the memory usage emergency but basically all the green the green line stack and the red 1 and 1 Serb officer named for e what was that about earlier about comparison as it with 16 processors the average CPU CPU load of the system is 11 . 7 and the aboriginal their aggregated back and processes aggregated all across the test this 57 % was still quite good in my opinion because there's there's some of those same there are contending all the time in their last the ship non-concurrent regions in the whole process and is still doing great job my opinion those bring it results I guess all only yeah how of we you know you yeah well MIT running that up intervention from this database that and what we the of what is this what you think it's all of yet it's interesting again will try to do that was the but even right now yeah it running running the test I have sometimes due but the problem the the destructible dropped database and unusually killer and just around in they again this fact I I I I don't know if they had basically that that the weights restarted best was was not and servicing and probably the best eating the posterior is always around jumping tachinid tried the want it's going to be the worst way worse than the that None of them finally was we have created this fantastic database with this from testing data with a 1 of those so let's make the database toward cells of the server will be to dB in ground remember to impostor skill that comes a single synchronous on full betas right on yeah bad don't turn on a of here but even then I'm sorry about that the money like you Mom whatever tables using I guess they're not going to be 1 those people so the manually and restore yourself and of the database so some acknowledgements to well just purchase and and so and John who happened to be responsible for this all OK it's not all the stuff on the side thank you so for for this this idea was quite inspiring us also be make sense to possibly so young will work with me and bring us the server we were on the topic of severance starts of system and is being call 3 equal architecting programming and that's what we call enjoy and of course to the beach economy organization and the sports and 1 final thing then I did
reduced then I think we used this then it the of the so
that's without them so and there From this here this is the using yes of this means that there is less this is 1 of the areas on yeah unintelligible what happened but they were running out of memory so when doing exactly what this I haven't you yeah the details but the of what if you want to but then only decrease touching love him this is OK of yeah the same the yeah the but it's interesting I have heard of some some stories about databases approaching a million tables Italy and almost tennis environments and what it's there are there may be some work involved to make things like the T. all all the jobs that list some some perceived to work better with a large number of if something is an interesting field there are also the some I want to that if yes the what about the and the value of the right to that's part of you want yeah yeah it's because the amended storing information and when this directive cable yeah and and I guess it's also a huge savings spec this space we the function of the area what is it that this back in 1 of the from and in I was of the this now when words yeah OK so some of the yeah there is worth were here used I have a of the all of yeah but probably but if you look at the average to it's not that much I you require I know that performance in those balloons great arising yeah yeah I know there is a good option I was looking at this thing cases with the server was from that I will in some options in AWS Anderson to machines that fits quite well to produces an scenario 1 is the first 1 machine the past 24 best of 2 termites each so you can write learning right 0 just setting up tables spaces it's tables is indifferent and you can get a lot a lot a lot of I O local bookstore what is the most expensive machine have that what the people the of the world you both the elements of the test with just 1 million table so what can through the the use of of well the 1 FIL we this you what you here are some of you have ever used by over what that that all when you're not is the yeah and it already quite some time to do this so while the I have to get out of the printer scattered well you the the so in the and the other thing that you would use the only thing you can also this are the things that we you have on the part I know this could what the rest of yeah yeah but what happens and this is competing with each other the you so we need to we have to here is a little bit of time the 1 the other form of it right what of the United States will help you what is all you have to be in on some of use there might be some of the more is what would be the 1 of was the the 1st of all what would you think of it as so what would you like to do that you know that we have can this all the when he was on the use of you have to be the you yes if you don't know and all this this book anyway at like that can tell for pitch handwriting that remembered search and selection time on on the table In this ability database was going fast the yeah that was in we yeah sure you have a all people the so rather it yet OK there was no question mark there yeah OK some is the use of eventually the idea these people yeah for the the environment because run relies on on a single channel so I was all 1 of the things that you can to the all what you know you and so on but how are you running analyzed are you mentally doing the analyzer just letting akin the by the idea of all right so you are it was really if you I guess that if you're running analyzed manually on tables of selected tables at selected intervals and I guess and that then the number of tables is not really going to affect your performance rather than the analyzed forces so maybe that's the case it would be even better up more tables but the partitioning of table and the analyzed and in those petitions Republic interesting yes the the only thing you well yet but FIL thousand tables not be and so it's not a big deal ahead so I get so is not the number of tables is going to be the problem rather than the the analyzed it has to stop the scan and if you need to do that so maybe it I was saying partition table we what and yet all its collecting all the names of the tables in memory of what all of the you 1 I of the things was I want of and what was that I think about all only of and what is the size of the to be more they come to the well what about the origin but partitioning come on in horizontal partitioning going to different things is a if it's a multi-tenant application the to well I mean if you want to assign different resources different databases or different schemas and databases the choice of the like all you can use different tablespaces different performance this it can limit the variables memory like for couple physical D but if you really want to separate resources there are many different cluster and being a multi-tenant environment should indeed that heart the scale by using different data different cluster you have to have the and all of the power of the state of the you may what wanting the data to In yes to you well yeah that's great tool to have yeah maybe this sounds very nice but I guess I don't know I don't there's any any tool like that yeah maybe a bit player using both with will have something like that but we don't 1st can you really need to wrap up on Finnish time firstly we're waiting for all the all thank you very much I'll be ablating the slides and also Boston's 1st go to the repository thank you


  846 ms - page object


AV-Portal 3.9.1 (0da88e96ae8dbbf323d1005dc12c7aa41dfc5a31)