Video in TIB AV-Portal: CockroachDB

Formal Metadata

Towards an Open-Source Spanner
Alternative Title
Go - Cockroachdb
Title of Series
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date
Production Year

Content Metadata

Subject Area
Point (geometry) Mobile app Table (information) Consistency Code Real number Source code Relational database Replication (computing) Database transaction Scalability Product (business) Programmer (hardware) Term (mathematics) Computing platform Physical system Pairwise comparison Standard deviation Programming paradigm Electric generator Scaling (geometry) Software developer Consistency Projective plane Open source State of matter Electronic mailing list Data storage device High availability Database Database transaction Cartesian coordinate system Scalability Category of being Word Process (computing) Personal digital assistant Table (information) Data integrity
Axiom of choice Group action Presentation of a group Data center Range (statistics) Execution unit Source code Water vapor Client (computing) Disk read-and-write head Data model Medical imaging Different (Kate Ryan album) Hypermedia Single-precision floating-point format Core dump Encryption Cloning Bounded variation Error message Physical system Compact space NP-hard Shared memory Fundamental unit Database transaction Maxima and minima ACID Instance (computer science) Hand fan Arithmetic mean Order (biology) Atomic clock Quicksort Directed graph Spacetime Point (geometry) Slide rule Open source Connectivity (graph theory) Serializability Data storage device Rule of inference Event horizon Product (business) Architecture Crash (computing) Latent heat Googol Hybrid computer Energy level Form (programming) Default (computer science) Standard deviation Graph (mathematics) Key (cryptography) Weight Physical law Content (media) Basis <Mathematik> Binary file Group action Cartesian coordinate system Timestamp Causality Word Personal digital assistant Network topology Data center Finite-state machine Table (information) Building Greatest element State of matter Code Multiplication sign 1 (number) Set (mathematics) Strukturierte Daten Replication (computing) Mereology Database transaction Encapsulation (object-oriented programming) Timestamp Facebook Programmer (hardware) Mathematics Spherical cap Phase transition Logic Square number Flag Process (computing) Endliche Modelltheorie Area Source code Algorithm Concurrency (computer science) Moment (mathematics) Open source Data storage device Range (statistics) Special unitary group Connected space Serializability Type theory Vector space Hard disk drive Right angle Bounded variation Flux Implementation Server (computing) Mobile app Freeware Overhead (computing) Service (economics) Table (information) Link (knot theory) Virtual machine Discrete element method Revision control Centralizer and normalizer Causality Operator (mathematics) Theorem Noise (electronics) Cellular automaton Consistency Graph (mathematics) Database Scalability Computer programming Subject indexing Logische Uhr Commitment scheme Connectivity (graph theory) Abstraction
Group action Multiplication sign Plotter Range (statistics) Mereology Proper map Timestamp Mathematics Bit rate Atomic number Different (Kate Ryan album) Error message Physical system Constraint (mathematics) Open source Database transaction Maxima and minima Instance (computer science) Open set Serializability Type theory Computer configuration Internet service provider Right angle Whiteboard Cycle (graph theory) Freeware Reading (process) Point (geometry) Ocean current Slide rule Service (economics) Observational study Open source Constraint (mathematics) Image resolution Serializability Theory Revision control Average Operator (mathematics) Ideal (ethics) Energy level Software testing Normal (geometry) Traffic reporting Conflict (process) Time zone Default (computer science) Focus (optics) Standard deviation Key (cryptography) Weight Commutator Line (geometry) Timestamp Commitment scheme Personal digital assistant Data center Absolute time and space
Complex (psychology) Group action Beta function Code INTEGRAL State of matter Multiplication sign Decision theory Source code Range (statistics) 1 (number) Numbering scheme Client (computing) Mereology Usability Medical imaging Mathematics Office suite Descriptive statistics Physical system Exception handling Compact space NP-hard Mapping Block (periodic table) Open source Data storage device Coordinate system Database transaction Bit Instance (computer science) Arithmetic mean Process (computing) Order (biology) System programming Software testing Pattern language Information security Resultant Point (geometry) Implementation Server (computing) Link (knot theory) Open source Connectivity (graph theory) Virtual machine Online help Student's t-test Product (business) Wave packet Number 2 (number) Revision control Latent heat Speicherbereinigung Energy level Graph (mathematics) Key (cryptography) Weight Interface (computing) Projective plane Content (media) Core dump Database Basis <Mathematik> Line (geometry) Word Software Logic Universe (mathematics) Table (information)
Key (cryptography) Information Multiplication sign Sheaf (mathematics) Online help Database transaction Bit Client (computing) Cartesian coordinate system Timestamp Logische Uhr Revision control Medical imaging Mathematics Googol Collision Table (information) Thermal conductivity Local ring Physical system
the so coverage TV is a database written in go and is the go DevRoom but does
this announcement I will be talking about any of the individual code at all what I wanna do instead is taking on a tour were about 4 of the ingredients and design principles that we have put together to build the system and hopefully to convince you to actually check out the source code may be look at it contribute to whatever you want with it but today I'm going to be about design of databases but I'm hoping not to lose any of you so if it gets too fast or something like that I mean for leisurely pace so ask questions at any point before I talk about the Jews let me just motivate why were willing just another data the so many of them probably a new 1 every day and were in this race so to modern way why we're doing this let's look at the it's very very simplified history of databases for the last 30 years of course there's SQL everyone knows it and it doesn't scale really well and replication cleaning safety data is kind of an issue as well so when the 2000 role along peoples and realizing that they knew something else and of course if something else goes with the umbrella term know SQL which can be basically anything except terrestrial but what you generally understand under database imposes paradigms that it is very scalable and places high importance on availability and usually that also means that in the process you use consistency you don't have transactions anymore in in many cases or you have something that tries to compensate for the absence of real transactions and you can work around the comparison so that's legal and ensure consistency manual joins have issues with data integrity across a cluster is something that you experience in those databases and the 2nd the 3rd generation that's been popping up just for the last as a tool to 4 years which I would just of new as well and that something that try to combine both of the advantages of the 2 previous year the being scalability high availability and the introduction of the retrieve the capsule properties the and this is go and those connected to bullets just let go themselves so pretty 2004 a lot of projects were actually running on a standard SQL databases with things like sharding ends it was not a pretty ad words having a big issues with it and so everyone was pretty happy when big table was invented at school which was basically standard no SQL database was like a column store and had eventual consistency very highly available very scalable and that was a big step up from the previous estriol stuff but only 2 years later people had realized that you were actually 1 data integrity of many and many points in your apps especially a platform database and a lot of problems products used it's kind of cumbersome of each individual programmer has to think about the issues that could happen if that data that the wrote here isn't there yet and is really complex so a pool that was basically to put another layer on this North your data so that they had and then they're kind of emulated the consistency that you needed which is a good but that I would would call the 2nd know SQL list database but it had its downsizing of the underlying data store was not designed to do stuff stuff like this was very slow and complex the and then took another 6 years the war to come up with the magical bullet which is called spanner and that's really assimilation of fully in Rizal database and doesn't really matter what that means that basically behaves like a single Paul's grows on your single-machine your app developer doesn't have to think about which transaction isolation to use or whether anything inconsistent is use it the applications you have transactions that you can use whenever you need to operate things in that that belong together as a that's really what they use today and I'm pretty sure that most new stuff will try to go as well and so there's 1 very telling quotes
from the White Paper on Spanish that were published in 2012 and the quote is that what they believe it's better to have application programmers do with performance issues due to transactional reuse sometimes but that's way better than not having transactions and having to do with it every single time you want so there is a reason reason for copper successes we really want someone that's not cool to also be able to have this right now spanner is it will end probably works great on me so you guys know it scalable highly e-mail transactional but you can have it because it's not open source there's a lot of auxiliary services is really kind of flux into the global infrastructure ecosystem so is nothing you can use the and what we do with cockroach is building a system that kind of gives you the same stuff but is not as much tied into this infrastructure also the very very different design underneath so it looks at the same but don't be fooled it's not a clone of spanner it's actually something that is wise that giving you what provides tool so if you if you were here at 9 AM Kelsey also mention that choruses trying to build use stuff from rule that you can just download it we're kind of trying to build tools database and you can just down and of course open source so this talk about what we mean by availability cockroach as a you know the CAP theorem and you know that you either have to be CAR CP or is screw up and then you need copper is trying to be a CP system and that means that each of piece data that you right goes to a under replicas and this choice of being consistent means that a client and server will not work unless it knows that what it's telling you or what is what you're asking to do is happening on the majority of replica so if you have 3 data centers and to is in is go away then you're really screwed because that 1 dataset it will not be able to talk to the other 2 and 1 out of 3 is not enjoyed if you have 1 data center going down out of the 3 it doesn't matter has a sort the start of the basic idea and vacancy that's it's not as ultimately available as a Austria data store that basically just answers you would whatever data finds but is so highly available you need a majority to disappear the and really the only price to pay for this is you have to act to acknowledge that the stuff that you write is on a majority so that gives a certain latency for your rights and of course when you do read See also to make sure that we reading from is actually in charge of knowing which once the up-to-date data so there's some some latency that you have to take a standard does that too and that's just the fundamental truth of having consistent and you can just read it without and so the set but a data center goes down the actual noise at all because the action know anything about the database behind it doesn't have to track where the data is because the database is completely consistent and no matter which entry point to choose you always get the exact same state so if I usually axes like this server in data and 1 that's closest to me and that on doesn't work it is the fact that has taken another 1 and yet the same thing and 1 word of caution here and building systems of building a system that's consistent it's really difficult distributed systems are always always complicated very prone to errors and if you have to do something at that in your app yourself it's very likely that you run to trouble and that the cell toolbox think about it it's it's kind of and the same is trying do your own encryption in your you can do it and maybe that's pretty good that this fall the probably something wrong so really we the database to take care of this competing some and really the the the main point with cockroaches that is the transactional data store so what we want want you to be able to do is to run everything any operation that you can run a data store you run it inside of a transaction and then it is completely isolated from the rest I will talk about this in the isolation of their and having having transactions at causing the obvious obvious them benefits the applications that much easier you just really right what's happening back in transaction you're done and you never have any problems with bigger updates running to like 50 per cent and then something crashes and then like old Don need to clean this up this just doesn't happen you there everything happens or nothing and another thing that we're trying to address the cockroaches have this image of transactions is being refashioned having high overhead and that really that's already reason not to have them know I like to talk about our implementation of transactions a lot today and I'm hoping to convince you that you know it seems very light weight that it's not like incur huge overhead does for having transaction this database and yet so what we do with coverage is we have a very very high the false if you're and you can I have anyone who doesn't know anything about isolation levels just use cockroach it will be fine because the default isolation so high that everything works exactly like anyone think it would be but we acknowledge that there certain situations where you might have thought on and for that we have 1 isolation level at slightly lower to snapshot isolation which my guess is pretty pretty well known and that can be used as you have specific use cases that actually have a lot of contention which means a lot of transactions kind of battling the sink is this topic island architectures so what kind of database are actually building so really what it is at the moment a what is and is score is a sordid monolithic key-value store so everyone knows what a key-value sorted mom a key-value store means that the keys in it actually have an order and you can stand for keys in that order so you can say I want all the cues from a to B and monolithic means that it's a distributed system but it doesn't matter which entry point you choose it regret logically logically it represents a single key values so so there is no different versions of anything on different service always that the exact same state but that some of the basic ingredients that go into it on the Law never level are rocksteady that's for story so that's actually for restoring key value appears on your hard drive on each harder if you have 1 instance of this Roxie thing and that's a lot structured merge type tree data store doesn't really matter what it is but it's fast it has very high sustained through the right throughput and it's Facebook's fork of who will level DP are there so as nice people working on this integrating thing we use that and if you want to read more about it just go ahead the so then the the actual um design sure that we made them according to your data is that so we have this sort of key value store and we basically logically encapsulate 64 megabyte tons of data more or less for each so that the whole key range it is he will basically be split into different parts and each part will represent about 64 megabytes of data not precisely but more or less 64 megabytes and those units kind of deal on the uh form consensus groups of a on the that level you you get your consistency replicate the data on different nodes if 1 of the units get so fat and will split into 2 units if many units that very small fuse into unit so this is called a range and this is really if you read the design document is all over the place so that the fundamental unit of replication and I'm encapsulation cockroach so the set rain each each of the ranges and forms a consensus groups so each range if the data in each ranges on typically 3 replicas and those that because joined together with the consensus algorithm which basically elected leader among them and that media will then advance the state machine and each of them this is a topic for another talk in them I will just say that we're using raft which is a very popular popular algorithm to before usually use taxes which is at the where a complicated and prone to sudden implementation errors and there we actually put a lot of work into writing a raft implementation and we found that none of those that existed were meeting our needs and chorus at the same time had an issue with go raft and so it was just natural for us to team up so basically we contributed some code to them that basically made sure that the implementation that they have is also OK if you have say 100 servers with millions and millions of consensus use that is happening with corporate you have 64 megabytes pieces of data into to consensus group and then you have millions of those on in order to have a lot right that's really so this is really interesting thing and also the very careful optimization because if you if each each consensus group talks at other consensus group than you have millions square connections and you're done already thought this is interesting stuff but I just wanna mention I we also have the concert of causality so we just don't use machines individual times that if you have looked at in distributed systems and you know that tracking causality is when the major issues because any of the machines will have a clock that's not like any of the other machines world's approach to this is actually put GPS and atomic clocks everywhere and then have an API that basically tells you what your maxim is and based on that they do everything we don't wanna do that because we want you to be able to use this thing and you probably don't want to do atomic clocks so what we do instead is we have a hydrological clock which
is also a fairly recent paper and I'm sure we have a link on a website and busy what it does it already oriented cells on the wall times of of individual nodes but actually possibly connects all the events that somehow interact with the system but think of it as a timestamp with an extra logical component the and really the core part I would say that I wanna talk about today is not free transactions so we started with the will further you would just use locking to the transactions in central places and then you would using the time signal to be able to get a consistency but we can afford that and we don't want to because we have a lot for implementation of transactions which I will talk about in detail and that's a kind of the the scope for the beta release with her putting out hopefully soon so it's a were properly about 90 per cent down so you cannot use competent production and sorry but Will the pushing hard to get there in due time and once we have that so once we have our sorted distributed great key-value store and we actually can put the structured data layer on top which is basically an abstraction that talked about tables columns indexes and once you have that as you this kind of standard stuff exactly how will the Spanish another key values on the new struck twice and you get as so that's of course more restoration right now this is really the road that's plant the so here's a certain I stick to to get away from the text slides for a 2nd on the story is a hard drive in any of the servers and then each of those College thing is is actually range that represent the ball 64 megabytes of key-value pairs and and this we color coded the individual consensus groups all the red ones and see there's 1 on sort so 1 store to so 3 I actually logical unit so they will strive to hold the same data at all times and they are actually a rough consensus groups so if a company groups who work here this 1 2 3 4 in reality of course you have millions so that's basically what the data looks like an idiot I so this is a part of the presentation actually want to get to so we will release database but how do we actually how that actually look like if you run transactions so what actually happens and I have 1 takes slide and then I'll have a graphing to hopefully iterate twice and actually put it into some heads so really what we do is a variation of a two-phase content so if you are a transaction and you're writing then you're not writing actual those values and living in there you writing values with a special flag the flag basically tells everyone else a look this is a actual value and the transaction might still be running and as a transaction ID there's also a set simple system table at every transaction them radishes in and which serves as a single source of truth to basically system space right here which says we're transactions exist and you have the values that are written by transactions which I intend so if it a vector started right that's that in table then you do your business you write a lot of stuff that will end of a sentence in the key-value store and anyone commit so I do a 2nd think of 2 things either it just right to the transaction table hand committed and then you could think about going to all of those previous intensity road and changing them all to on values because in our committed and only then returned to the client and if we did that then I would I would completely understand it is said that that's kind of a no-go because if you have a hundred values than just a common after right another 100 values know but really what we can do is just um commit the transaction leave all those intense returned to the client the client is free to do whatever they want and only then we kind of on a best-effort basis we go to all of those intensity wrote and change introns values and the reason why this is so correct is well assuming that this intent to failed and someone else would try to do something with the key that you wrote you intend to all that they was intent and it will know which and a transaction wrote the intent because that saved in the intent so they wouldn't have to go to the transaction table and take a new transactions because the transaction status right you authoritative for the actual status of the transactions atomically up there's no ambiguity if of committed then just you know up intent and read the value or not depending on if it and then do its business so if you look if you look at what what changed if we didn't have transactions the only thing that you lose out to write so this transaction table and the best cleanup which is not a blocking decline so this in itself is very lightweight Ch and because takes slide so I tried to come up with a picture that says the same thing so how do we read this time goes from the top to the bottom so what happens there further to the top in this area and so when start so you're climbing you just like a i wanna so transaction and the service they call and then what happens is that in this transaction table which is this column right here yeah there will be an introverted which basically says well the transaction with transaction ID T 1 is now pending with me the running any was started at this time and time is the logical and and not the standard water and then your client wants to do certain things in the suggestion set as let's assume you just trying to write 2 or 3 cases and so he as I as I mentioned before the transaction will write those intense so the right the key and right intent associating the value and the transaction that once the write and then you do this a bunch of times and at some point you wanna commit so you basically try to update your transaction table entry from pending to committed and once that happens you go to the client and then the cluster will try to just like look at the intensity wrote and upgrade them to on his values OK to understand this yeah yeah great so that means everyone got it on and then there's some and now that is that is you have a question he 1 enough showed so so what do we a no all the people the the so yeah so you you wondering about how those things get to all the other nodes so actually this is already on the monolithic you values so you write something here when it shows up it actually replicated so I should have mentioned is that this really because I'm fine will talk about transactions and trying to take the the rafting rout out for a 2nd but really run if the right intended actually ends up 1 of these 2 replicas in the 3rd because eventually get it may be but it's is really consistent already because otherwise you won't have any chance of of you in want some of things that kind the the for example and the how what the things that of with that guy well with rap with standard graph you always read from the leader so that will not happen but there's a confocal near users which basically this time shares through who the leaders but basically morally read from the leader of the so this is what would ask is basically a rotten consensus request are so I so now hopefully a kind of clear what yeah to the the the the action you know we do the king of so the question the question was is and if the cleanup is done way later compaction time where Sun right away the correct it's done right away because you don't want a lot of clients run to those intent because every time you do that actually go to fans action table and thinking of the intent so if course our our compaction genes they also clean intense assuming that they're still there mean might happen and no 1 actually ever use the case but really basically what you do is committed then immediately you go to the intense for the last question has it was the this is serializability so it's placed on a right was more yes of I think I've read the In this map and and you the transaction sees its own right the assures attributed to commit new reader irony of course is because I mean the worst thing that happens is you read your own intent but then resolve their own original of that was my age yeah I know about this at f to any anything that happens in after that sometime after this will your values otherwise it would be serializable and then what happened is that in the worst case you see an intended hasn't been upgraded to a value but then you will be like 0 this intent belongs to T 1 I'll check on T 1 0 it's committed so this should really be a value in the news which so there's a model that so people who read the data actually clean it up when they do but we obviously make an effort to not have been Seidel subpoena up as fast as you can OK so any further question happy happy to take at the end but I don't wanna run out of
cycle on i want to say 2 things so this is of course the the case in which nothing goes wrong but we doing lock-free transactions a lot of things can go wrong because transactions will actually be writing to overlapping key ranges and it's very interesting to see what happens so for the 1st time in a talk about cockroach election going to try to explain this to you and before I go to the next slide would actually deals with the conflict resolution as wanna mention a few things that can happen so I don't wanna get into the timestamp internals so much but really it was very very central to and action it's this time here so thing of this time stamp as the provisional commute time some of the transactions so when you start you start with that kind of the current time some of the nodes and then you do a thing and then when you come back to this time stands it's very possible that some of the rights that you did what you have increased it or that another transactional have increased that will never decrease but it might increase and is if the times and then increase in that logically means that all of your rights and reads happen at the same time and if you commit with that you will be serializable if you allow it times and to be pushed increased and you commit then what you will end up with a snapshot isolation so if you know those 2 things then I just wanna tell you that the only difference between the 2 isolation Nevelson cockroach is whether you are allowed this time to report is the only difference so there's no you know no completely to systems of transaction that we have this really a single system and just a committing behavior is different and of course in error error and when you when you have conflict between transactions they also take the course of action depending on what type of insulation which 1 has but is really really lightweight defense so let's look at things that can actually go wrong and what that what I wanna focus on is what happens if you read something that doesn't seem right to your way not sure what to do right countries easier because usually you'll have to restart or avoid some but so assuming we're transaction and In this picture here we want to read at this time and this is the thing OK and we're using NBC system so there are a lot of versions of this key potentially and have the full of and this is the opposite of before newer versions on the top so this certainly is very new and this version here is very old and the reading at this level and I now I could say why why going to the future that much 1 really point out that you can you can actually see future values so on you know it you can see values whichever times then that's higher than what you think now is and that's simply because we have a distributed system if you have a bunch of nodes and 1 of the nodes cloth is fast foods in the future added at run the transaction then will run the transaction at its local time when starts if that is sufficiently ahead any rights it does will be at the time step which might well be in that nodes future OK so you can definitely see future values here and now if you're transaction you wanna read the times then what you do if you see a value than the future I mean you wanna read it or not restart and really what you should be doing and this is 1 of the 1 of the parts were actually plot uncertainty the absolute time uncertainty comes into play is you have to decide if the value is close enough to you're all times to actually have happened before your time so if something does happen and if you're a time 10 and something have a time 11 and you know all your average plot of maxim clock offset is 5 than you know it you can't really be sure if an absolute time this guy was here or here and whether you should be seeing it or not so really what you have to do in this case is come again and you will come again with the times and that's here so that will be 1 case in which the transactions provisional come a time stamp actually increase and you will have to restart the transaction which is there is retries free light weight is not is not the same as in a board and now you could say OK but what happens is 1 thing node that in the future actually hammer is a key with rights and I'm trying to read this keys so every time I come back there will be a new value on top and will never end but that not happen because we can make it so that we only restart once per node that calls the conflict so when you see when you read a key and the someone in your in your future per note that road it you only with 1 so normally you only do that 1 single restart if that happens and then the other thing that can happen is if you see something in the future by very far in the future and you know because it may be at your you know in the cluster the clocks differ only by 100 ms at the most the study cannot possibly have happened before your read so you can safely ignore the value so those are 2 things about the future if it's from the future you have to read it is in the future and usually need to restart you need to slip past that value what happens if you read something in the past well if it's honors value I mean the value and this is system reading the next candidate so it was really the value if there's not a value but in intent well then you have to take because it could be an intent from a transaction that's a running in which case you 1 restart itself or push cent tend to the future which you can do for snapshot isolation transactions so you basically want change this value so that's actually here and then find a problem with in the future or if you find that the intent is just something that has been clean up it so the the transaction has committed then it's a value so you upgraded value read or the staircase intent belongs to transaction with was awarded for the values and the the for this so basically certain things can happen but in any in any of these situations we are you go about it but then ask again hundreds of OK but if you if you if it kind of makes sense you go to the design documents there's like a long paragraphs about this and it going for a while to make that that was part of the and you will last yeah you wanna read at the time stand 381 the value that's the next value that's done you he was the 1 half have no I mean it's because you're transaction if you're trans actually wanna do everything at the transaction timestamp if you do if you were if were in a transaction you would have to just means that the highest value that's actually there so if you if you find a committed intent or value than would be that an otherwise it's open you would not be so if not transactional all this is that the is simplified because you always just read at the current node time which guarantees that nothing's ahead of you this is this is unfortunately it still like being a sleeker light with it affiliates complicated if you begin to but hopefully that give you a small test those and let's briefly talk about the isolation that you get so as I said we opted for serializable snapshot isolation as the default and that's basically just you know if you don't know what the stuff is really just how you think it to work that's just basically what it is that's what you want in an ideal world everyone would just be using serializable or linearizable the subtle difference which I will not explain I except that linearizable has to do with absolute time and serializable allows that transactions always seemed to be happening and in a non-overlapping fashion but they might switch sometimes you have you been run 1 transaction and another 1 but the commute time stamps would actually suggest the opposite and that would usually not happening cockroach but it might happen if you have 2 separate lines working on exact on completely interacting operations on different parts of the clusters and if your timing is really really weird than that could in theory happen I'm pretty sure we couldn't even reproduce it in practice the latencies involved but In theoretically spanner which gets linearizable is doing something slightly slightly slightly stronger but we have slack build in that basically just take the maximum clock offset and yeah makes sure that the time passes between 2 critical operations so if you add the slack than coupled to be serializable and you just have a slightly longer transactions depending on what your level of service that also open up opens up an interesting venue because it it's quite conceivable that AWS and all these other small providers might actually just do not provide you with such an API in Italy as they can put at the atomic value data center there and then once they have a nice API for this so basically in think of an open source to a time cockroach could just how put into and control the system maximum clock offset based on this and that would mean that you basically get Standard because someone else doing the infrastructure part that you can't get rid of which is time for you OK so xerogel snapshot standard great stuff n the only thing with that is sold the way it works at a future if you're in a snapshot a serializable snapshot transaction you wanna commit then you cannot commit if few times then was pushed forward because that means that after does matter but it means that you could maybe not be serializable and that could lead to a lot of resisive imagine having a huge amount transactions this kind of fighting over the same keys and you could conceivably see a lot of restarts maybe that will be a problem and in those cases you might wanna Dong rate snapshot isolation which and cephalization probably pretty well known just means that a transaction during its lifetime always sees the data that was present at the time that the started and that's it also already fairly high isolation of about it breaks a lot of something so instance constraints can break because here with the reading data that might be updated by someone else and then you're relying on data that they operate in the meantime and you can do some things but it's also fit again so you really cannot do really shitty consistency with proper of just doesn't work you can't do it all you can you can just go outside the transaction then write individually and yes OK so those are the 2
and I will actually leave you with that and so basically summary Armco process really inspired by no doubt but is very different from Spanish just like the design components that go into it much more sleep and light weight and where actually an interesting pardon the project so I would say about 90 per cent done to getting the 1st beta release beta meaning that that's something that would actually encourage people to try to run and see how it works for them production but moving there were fairly active projects and but we appreciate any help that we can get especially if there's very skilled or people here doesn't matter if it's tooling or code reuse were really happy to have you look at the stuff and usually awake worse once you contribute a small of couple small things usually end up being dragged thinking about the students things which in itself a really interesting I firstly enjoyed a lot and I would be happy if you take it out us questions and if you want to have a human readable description of all this if you don't wanna go through the source code and find peace out what what's happening and also learn about other all the other components that we have been obviously I omitted huge bunch of features and design decisions just click this link will basically get the initial version of the design documents which was written by our master Spencer Campbell who's next school the work colossus lead there's basically some knowledge in our group behind billing systems like these and the if you now booked on the CP industry resistance training you don't know this block here than just read it because it it's fun someone just like spins up a bunch of different databases and breaks them just by filling with the network just above it and that is all 1 sentence thank you the 1 and on if you will or will not like always running time right the the lectures we and we need to go have all thank you the garbage collector so we want to the not in the good thing is that
we don't have to worry about the so the question is whether and we can have problems with the go garbage collection because you never know if you're threat calls for a 2nd and you know that something can happen in the question was whether we thought about this a lot what we're doing and suddenly we're not actually thinking about it because we're not relying on the time signal for correctness so what can always happen of course is the garbage collector freezes also 5 seconds then it will suck because nothing's going to happen in that time and that will cost transactions through time Howard transactions to run to conflicts more often because it will just be running longer but this is how it is I mean that would just mean that you have a little more contention in the news a little bit of performance but it's up and forced for she's not critical to correct this at all so is that that is something that we will try to optimize as much as we can but at the current stage were just like valuing the integrity and ease of use and ease of reading the code over you know I optimizing the triple pattern but that's of course something we want to images the I really had work on the so up to this point is actually although the questions thank you for the question is if we get paid to do this and the answer is slightly complicated right now because of in about a couple months time the answer might actually different so it's it's it's very conceivable that cockroach will turn into a company that centered around its open source products think of something correspond with equal guys that like go and do cool things so that's something that you might wanna watch out for that but as of today world of this for free the action 1 that I'm not more the transaction table declined knows nothing about the transaction table all the question was the question was a kind of where this transaction table is and if the client has to know about so that the client action doesn't know it's a the transaction table really lives on cockroaches tease database basically does what we do is we separate office a small bond at the beginning which we need to do anyways because we need an addressing scheme to tell to kind of map keys to note that actually have the spectacle of the key and so basis just think of it as being back 0 backs 0 th and and then the transaction ID so it's something in this on the server and that's managed only by so called transaction coordinators which is basically who you're talking to if your client running a transaction so proper so it's completely not something 1 the client our clients are actually so of course go client but people have started writing a scholar climate thing in OJS line really the clients don't do any of the word they just need some simple retry logic and not actually with a transaction have all of this in the base number 1 so many that that not every anything everything so there's some pieces of data which we actually write to the local node only but it's mostly accounting and stuff like that the transaction tails completely replicated is replicated exactly the same as any other key and that's what necessary because you cannot afford to have given birth on no but 1 range while of course it can yeah sure yeah any everything can span across a range of except for the very 1st addressing scheme so the if you wanna look up a value always go to the beginning of the key space and that always has to be in 1 place but we actually splitted the addressing scheme to find the range that shielded data into 2 levels just to be able to scale out like that for x of either something so it will always there I will review the state of the art in a way by Raphael and you are in the world would you get Miss Universe is so in the of the of the question if I understood correctly is how raft stores data if there's some are we deal with a raft log the state here with that of the race all of here by the I mean knowledge so that the placement of ranges is really I mean graph knows whether is would really only knows the range to its range i is through its node ID and because the Rangers itself are participating in roughly the no there's really never any question of where they are and what we do with the raft log is that's actually a big part of the work that we did together with chorus raft make sure that we cannot restore this in our office the underlying storage so we made sure that uh that the raft implementation has interfaces that you can just implement for snapshotting compaction and I'm complexes you truncation and of course the start of the log so that completely goes rocksteady on those specific notes and that of course is not replicated because each but the the rest of of year for 1 are right so what it wants to what I'm sorry I'm so confused by the question you you use for the graph of the you yes yes way well you know each member of the group lives on 1 node and will just use a local storage in that roughly instance on that node and just key by raft ID which is unique so basically everything's just stored on the physical machine at that instant 3 replicas but they just have 3 copies because of our work and the question was where the rough data stores and I'm not sure about the order US yeah we started all of the find the if you know so we try to handle so because it would allow the transaction for it's understood that transactions may restart and that's not actually issue expected so we shield as much to that of that away from the client as we can so usually if you're clandestine action it might bounce a little bit but then come back so usually you don't know the fact that some of the results will be transferred to the client because otherwise you have to add the retry logic in which would be higher what we only after of the the yeah but you know the transactions kind of Maryland only exists through its entry in the transaction table so you can read and write to the transactions as synchronously and any research that will happen will just gonna happen and only when he commits is a is the only the only point where you can actually affected by the recent adversity is when you commit an UGC too many research internally so that it will come back as awarded you but most of the research really show from you there is of course very few ones that the client has to handle but generally declensions thing you want to get the of you and you have a lot of this is 1 of so so you your your question is what happens David lines writing a lot of values and there's lot conflicts that you question on doing that the not I mean if the data changes in
a way that doesn't collide with our transaction then the client doesn't care yeah so I mean that's that's what I could do that if you reading keys in those keys of new intends or the change and you have the kind of but conflict handling that is solid so that might be to research and if you're a very busy keys then you can and could end up with a lot of resources that's 1 of the things that you just have to deal with if you're running system like this where don't have locking things will just you know try on the best to get a slot you no there's not known gonna be any locking for what you what you wanna do is when you go on to use snapshot isolation which has less resource so that my work or you images have your application deal with this differently because assuming you have locking if you write way more data than you can push through the locking sunken help you because accuses just gonna grow and grow and grow and so it's the same same problem really yes so you get the section there the issue of that the the transaction table has all the information about transaction correct not the keys were written but who issued by the political so if elections starts so how I would have to dig a little bit into the time stands but basically each node has a low of sorry the question the question was named to don't have the 7 and it's a microphone into the room and when after this so the question was busy when the debt times since come from yes yes so if you start transaction then you speak to a node and that node will take as a starting timestamp for the conduction its local nodes logical clocks and that's what's gonna be used so on already answered the question I so the near the end of the main were you think