PostgreSQL on Hadoop


Formal Metadata

PostgreSQL on Hadoop
Alternative Title
Distributed Analytic Databases
Title of Series
Number of Parts
Steinbach, Carl
Heroku (Sponsor)
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross
Release Date
Production Place
Ottawa, Canada

Content Metadata

Subject Area
Bridging the Divide with Distributed Foreign Tables Apache Hadoop is an open-source framework that enables the construction of distributed, data-intensive applications running on clusters of commodity hardware. Building on a foundation initially composed of the MapReduce programming model and Hadoop Distributed Filesystem, in recent years Hadoop has expanded to include applications for data warehousing (Apache Hive), ETL (Apache Pig), and NoSQL column stores (Apache HBase). In this talk we describe recent work done at Citus Data that makes it possible to run a distributed version of PostgreSQL on top of Hadoop in a manner that combines the rich feature set and low-latency responsiveness of PostgreSQL with the scalability and performance characteristics of Hadoop. This talk will begin with a high level overview of Hadoop that focuses on its distributed storage layer and block-based replication model. Next we will look at the data model of the Apache Hive data warehousing system and explain how it enables features such as schema-on-read, support for semi-structured data, and pluggable storage formats. Finally, we will describe how we leveraged these ideas and Foreign Data Wrappers to build a distributed version of PostgreSQL. This version runs natively on Hadoop clusters and seamlessly integrates with other components in the Hadoop ecosystem.
Point (geometry) Database transaction Operations research Presentation of a group Computer file Adaptive behavior Characteristic polynomial Port scanner Software industry Analytic set Bound state Mereology Event horizon Sequence Bit rate Term (mathematics) Database Process (computing) Information security Physical system Enterprise architecture Home page Product (category theory) Process (computing) Database transaction Information Relational database Projective plane Point cloud Entire function Table (information) Database Data storage device Query language Blog Quicksort Oracle
Context awareness Building Web 2.0 File system Cuboid Diagram Partition (number theory) Sanitary sewer Physical system Enterprise architecture Computer file Multitier architecture Data model Process (computing) Lattice (order) Network topology Data storage device Order (biology) Hard disk drive Self-organization Pattern language Right angle Quicksort Block (periodic table) Data management Resultant Point (geometry) Multitier architecture Characteristic polynomial Data storage device Computer Scalability Number Power (physics) Sequence Term (mathematics) Database Computer hardware Data structure Subtraction Installable File System Form (programming) Addition Multiplication Computer Server (computing) Limit (category theory) Local Group Subject indexing Search engine (computing) Computer hardware Computer network Data center Units of measurement
Context awareness Client (computing) Replication (computing) Food energy Image resolution Bit rate File system Cuboid Software framework Physical system Spacetime Block (periodic table) Structural load Gradient Coordinate system Data storage device Perturbation theory Data storage device Order (biology) Phase transition MiniDisc Right angle Quicksort Reading (process) Trail Server (computing) Implementation Computer file Divisor Connectivity (graph theory) Characteristic polynomial Data storage device Auto mechanic Mass Rule of inference Metadata Scalability Number Average Term (mathematics) Default (computer science) Computer Server (computing) Element (mathematics) Client (computing) Scalability Computer network Vertex (graph theory) Object (grammar) Local ring
Context awareness Code View (database) Strukturierte Daten Data analysis Mereology Facebook File system Physical system Enterprise architecture View (database) File format Relational database Electronic mailing list Sound effect Physicalism Data storage device Mountain pass Data storage device MiniDisc Self-organization Website Right angle Quicksort Mathematical optimization Server (computing) Existence Implementation Sequel Open source Variety (linguistics) Tape drive Scalability Revision control Term (mathematics) Operator (mathematics) Database Computer hardware Subject indexing Energy level Data structure Subtraction Home page Information Forcing (mathematics) Projective plane Scalability Subject indexing Integrated development environment Search engine (computing) File archiver Data center Dependent and independent variables Speech synthesis Game theory Mathematical optimization
Suite (music) Serial port Java applet Decision theory Direction (geometry) Equaliser (mathematics) Semantics (computer science) Formal language Software bug Heegaard splitting Maxima and minima Core dump Extension (kinesiology) Exception handling Physical system Enterprise architecture Electric generator Mapping File format Bit Materialization (paranormal) Functional (mathematics) Data model Process (computing) Drill commands Order (biology) Right angle Quicksort Information security Data type Resultant Row (database) Ocean current Dataflow Server (computing) Numbering scheme Batch processing Service (economics) Overhead (computing) Sequel Computer file Variety (linguistics) Connectivity (graph theory) Data storage device Streaming media Rule of inference Revision control Inclusion map Term (mathematics) Database Operator (mathematics) Data structure Computing platform Form (programming) Authentication Overhead (computing) Graph (mathematics) Information Cellular automaton Projective plane Planning Authoring system Library catalog Domain-specific language Compiler Table (information) Uniform resource locator Logic Query language Universe (mathematics) Vertex (graph theory) Mathematical optimization Library (computing)
Architecture Process (computing) Process (computing) Electric generator Database Vertex (graph theory) Sound effect Quicksort Subtraction Local ring Physical system
Point (geometry) Addition Sequel Wrapper (data mining) Equaliser (mathematics) Connectivity (graph theory) Sound effect Principle of locality Client (computing) Variable (mathematics) Compiler Table (information) Architecture Mathematics Stress (mechanics) Term (mathematics) Query language Device driver Vertex (graph theory) Energy level Website Subtraction Physical system
Query language Computer file Mapping Block (periodic table) Metadata Bit Directory service Client (computing) Mereology Metadata Table (information) Query language Average Synchronization Vertex (graph theory) File system Self-organization Data logger Data management Resultant Partial derivative
Slide rule Query language Scheduling (computing) Context awareness Discrete group Fault-tolerant system Replication (computing) Metadata Mathematics Database Energy level Vertex (graph theory) Computing platform Computer architecture Physical system Process (computing) Block (periodic table) Metadata Heat transfer Streaming media Replication (computing) Table (information) Maxima and minima Query language Data storage device Order (biology) Vertex (graph theory) Quicksort Block (periodic table) Task (computing)
Writing Query language Structural load Code Streaming media Data storage device Vertex (graph theory) Block (periodic table) Limit (category theory) Task (computing) Local Group Replication (computing)
Point (geometry) Reading (process) Trail Numbering scheme Context awareness Sequel Structural load Code Direction (geometry) Food energy Power (physics) 2 (number) Revision control Writing Mathematics Bit rate Hacker (term) Term (mathematics) Computer configuration Database Energy level Extension (kinesiology) Form (programming) Area Pairwise comparison Series (mathematics) Standard deviation Focus (optics) Process (computing) Wrapper (data mining) Electronic mailing list Code Data storage device Local Group Word Query language Vertex (graph theory) Website Right angle Quicksort Data management Reading (process) Local ring
Context awareness Equaliser (mathematics) View (database) File format Mereology Food energy Data compression Computer configuration Single-precision floating-point format Central processing unit Physical system Enterprise architecture Spacetime Wrapper (data mining) File format Block (periodic table) Thermal expansion Instance (computer science) Entire function Demoscene Graph coloring Data storage device MiniDisc Self-organization Quicksort Data management Reading (process) Resultant Point (geometry) Slide rule Computer file Sequel Data storage device Similarity (geometry) Portable communications device Revision control Term (mathematics) Database String (computer science) Energy level Subtraction Plug-in (computing) Task (computing) Series (mathematics) Multiplication Standard deviation System call Table (information) Sign (mathematics) Word Predicate (grammar) Query language Hybrid computer Computer network Vertex (graph theory) Local ring
Point (geometry) Building Enterprise architecture Equaliser (mathematics) Connectivity (graph theory) Characteristic polynomial Tape drive Mereology Scalability Architecture Social class Term (mathematics) Database Operator (mathematics) Energy level Distributed computing Data storage device Subtraction Computing platform Descriptive statistics Computer architecture Physical system Addition Enterprise architecture Electric generator Information Closed set Analytic set Scalability Compiler Intrusion detection system Query language Data storage device System programming Vertex (graph theory) Website Self-organization Quicksort Mathematical optimization Reading (process) Resultant Extension (kinesiology)
Digital electronics Structural load Differential (mechanical device) View (database) Direction (geometry) Source code Function (mathematics) Fault-tolerant system Replication (computing) Mereology Mathematics Strategy game Single-precision floating-point format Core dump File system Central processing unit Partition (number theory) Physical system Algorithm Electric generator Process (computing) Mapping Wrapper (data mining) Structural load Bit Perturbation theory Maxima and minima Data model Arithmetic mean Process (computing) Network topology Data storage device Order (biology) Website Right angle Quicksort Multiplication table Mathematical optimization Data management Asynchronous Transfer Mode Point (geometry) Reading (process) Computer programming Sequel Presentation of a group Computer file Open source Characteristic polynomial Data storage device Control flow Streaming media Scalability Metadata Revision control Broadcasting (networking) Term (mathematics) Database Program slicing Energy level Representation (politics) Software testing Subtraction Metropolitan area network Condition number Run time (program lifecycle phase) Addition Multiplication Standard deviation Information Interactive television Analytic set Principle of locality Scalability Table (information) Hypercube Word Query language Blog Data center Vertex (graph theory) Object (grammar) Local ring
things Carlson lock on the managers science data and also see a member of the PMC in Canada on the Apache Hive project which is related to this project related to do so and in the past averted a bunch of enterprise software companies most recently clutter over I was writing the the hygiene of also spent time Informatica and adapt work on storage security products in Oracle sort of started my you know professional we're working on databases on my other sort of the the things that I wrote that blog post about Tomlin's you know habits which I think a lot of you probably rather that to given much so before this presentation lives up to the point of I anyway
uh given talks on this topic a couple times in the past in from those experiences I've learned it's a good idea to sort of start up front of and making sure that we're all on the same page about what distributed analytic databases but I think you probably 95 per cent of people would use a relational database they automatically think 0 OLT you know the standard sort of you create update delete use case scenario with the transactions in an analytic database is definitely not the whole world he itself you typically referred to as all that database so the kind of things that you really do with all that database including your processing let's say clickstream events or processing log files or processing some kind of event or fact type data and scientifically that people were going to do aggregations on this your big joints but in terms of you the way that the actual sort of like database behaves 1 thing that sort of defines all left use cases is large sequential scans were typically looking at the entire table as part of a query and another thing is they're not really transaction-oriented so a lot of this stuff if you see in all databases don't really applied and then I think I mean this is probably the most important to the rest of this presentation in terms of the performance characteristics of all the databases they are typically I bound to a few looking for sort the performance bottleneck in the system it's not going to be used you typically going to be the rate at which the data is process can actually pull information on this and so lot of work you know over the past 10 or 20 years has gone into like figuring out how to reduce the impact of that bottleneck were eliminated altogether so you 1 thing that
also you relates to to what we're going to discuss is the enterprise storage model and this is you know something that's evolved over let's say the past 20 or 30 years and the the goal of this is to basically solve the problem of availability and accessibility for data in the context of what's a large enterprise you accompanied with 100 thousand employees and terrabytes and possibly even petabytes of data so obviously you know that the data is not going to sit on 1 computer between sit on a bunch of computers and you want somehow to make it accessible to a large number of people in your organization so you need to be able to access it over a network of some sort of and you the sort of solution that's evolved to this over time and and really been a whole industry has been built around and is basically structure your data center such that you have a pool of storage and this could be in the form of NASA or sand and then it's connected that is some kind of switching you network fabric to a pool of compute you resources on the the couple problems with this 1 problem that customers feel antialias cost this is really expensive solution is a lot more expensive than going out and buying a hard drive at Fry's Electronics or best by or something like that and you're talking about spending a thousand upwards of a thousand dollars for terror by probably you even more conventionally and then I'll add to that the fact that the the systems that have sort of built in scalability problem and the scalability problem is that really big pipe that sits between your island of you compute resources and the island of of storage resources so as you attempt to scale-out storage pool during to have to make that pipe bigger and bigger and bigger and in practice I mean they're sort of limits to how big you can make that high and there are also limits to how you consequently how much you can scale sort of the underlying storage tier and what happens is as a consequence is that people tend to sort of partition these resources trees will end up with multiple islands of of storage pools and multiple islands of new computer workers and and that adds on top of an additional manageability problem and a sort of a whole industry that's that's grown around just solving at the helping shuttle data from other 1 storage pool to another to meet Salazar or jobs that you have running in the background but as soon as they did this diagram is sort of a of grossly oversimplification tried it the main points on but only foot out it but you didn't besides just having like a storage pool there actually sense multiple different from 2 years of storage people with different sort of price performance characteristics that so often times in the computer pool you might have an all all-out database and that database will have its own sort of private storage pool which will often be in terms of you know cost perturbed by a much more expensive than sort of your a larger pool of online storage and consequently you fall into this new usage patterns where you're always sort of moving data between these 2 pools but you can't fit all the stuff you want to look at in the sort of fast a small pool of storage and and you as a result like they're a pretty big management headaches so you over 10 years ago at this
point the folks at google were building this gigantic search engine and in order to build the inverted index that they needed to power that search engines they were collecting these gigantic datasets by doing web crawls and I think you they realize pretty early on that but if they they got a hooked on sort of bananas or or sand solution that that would not be just a bottleneck for for the data center but it really be a bottleneck for the business and so the thing that they are defined as a different a different solution to that problem and and they knew that you whatever solution they ended up with it had to both be inexpensive and had to be a very scalable so this sort of design priorities that they set out were 1st of all we're going to base this on commodity hardware and once that the hardware that this is visible on top of to be as inexpensive as possible as a the consequence of that is when go with you inexpensive commodity hardware you expect the stuff to fail all the time and if you deploying a thousand of these boxes you expect probably 1 box to fill every day not multiple boxes so you have to have you a system that can be resilient in the face of these constant failures and halogens so another really at the interesting thing about this is that the solution that they came up with course referred to as the Google File System but in many ways it's not really a conventional file system it's more of what's a distributed bookstore store and then there's some sort of interesting and interview online you can find out which Sean Quinlan who was 1 of the original architects of the system is sort of talks about it you know genesis of how the people came up with this sort of set of priorities and things that they decided were not so important I think 1 of the really critical thing is is that they were worried about supporting the building of Posix file system they were worried about supporting random reads or random right so there's a lot of stuff a lot sort of like a file-system dogma that they just dispensed with very quickly and in the interview with when when he makes the point that since the people who were building GFS were actually the same people who were building the system on top of GFS that were consuming and that they were able to sort of very fluid in terms of defining what the requirements were for the system and I think that's kind of is an interesting thing to think about in terms of you how how you rebuild systems in the future right that sometimes just the sort of like little political disagreements between you know groups at different layers in the system can really Stiny advances like this so you have system that they ended up
with was referred to as the rule file system and so this is sort of a high-level overview of of how it works so that when a file is written to GFS is basically split up into a lower number of fixed size chunks so the original GPS implementation these chunks were 64 megabytes in size which is a lot larger and than the box you find a traditional file system but it makes sense in the context of storing files that you are on average a gradient gigabytes if not potentially even terrabytes insights of and the advantage of course of having big blocks Ernie's 1 really significant energy is that it reduces the metadata load on whatever system they're using to store the file system metadata there's less of it store since the number of objects that needs to to track and keep track of our time and number so another sort of significant design element here is that once that wireless would into chunks those chances and parceled out world was randomly to a set of chunk servers and 3 copies or whatever the replication factor for the files as hot new by default 3 3 copies would then be distributed across the chunk servers and the system and so there are this due in 1 sense is very different from the way that sort of traditional distributed file systems achieve reliability by using something like repairing based ranking mechanism which is more efficient in terms of the amount of disk space using you're able to live safe you use you grade 5 here basically you know trading off like if if the storage space for that additional reliability and so this is a lot less efficient in terms of disk space but because they're actually creating multiple copies of every chunk they do get some other benefits in return so 1 of them is that the load can be distributed very unevenly across the strong servers when the same file is being read by multiple clients so basically a coordinator has the option of of of saying you know these 3 clients reading the same file they don't necessarily have to all talk to the same chunk server know the advantage of this is that the you in the system that the device they expected but not just dead disk failures but they also expected to have chunks server fill expected the whole to just disappear and go down and at that point if you're basically building chunk servers on top of grade rate wouldn't matter right because you can use the chunk servers as entirety so this way with chunk server goes down you have to other chunk servers available where you can read the same block from and finally the other sort of characteristic of this is that when a chunk server goes down since the other sort you basically like all the blocks there on that 1 chunk server have replicas distributed and more uniform fashion over the remaining chunk servers the load that's required the new sort of read those chunks back and copy them to a new chunk server to maintain a replication factor is then distributed evenly across the other servers in the pool so
earlier I mentioned that you have the problem that rule was sort of contending with was twofold they were worried about the cost of of the sort of traditional mass and solution and they were also worried about scalability so I think you in if you look at just the stuff that's described in the original GFS paper and you hold it against these these goals so and this somewhat they articulate in that it was pretty clear that they solve the cost problem in articulated a way of building a very scalable of file system using commodity components but I don't think it if you look at just the GFS paper that they really solve scalability problem because fundamentally it's still client-server type of file system where the clients are all going to pull data across the network and that that network is then going to become the I O bottleneck but it wasn't until a year later when they published the MapReduce paper that the missing puzzle piece appears and it was in the MapReduce paper where they said no actually here subject this works this is not a conventional file system where you have clients the pulling data over to the client in order to do work on the data there instead we're going to take the work and push it over to the data so mapreduce is basically a distributed computing framework that allows you to send work over to the nodes for the data actually exists and by a new sort of maintaining locality you almost completely eliminate this network bottleneck all you MapReduce it does require some network traffic between nodes it has you shuffled sort phase and stuff like that but it is the set of 1st order you've basically eliminated this this a significant bottleneck of and that you I think is the is the key benefits of could do it HDFS it's I think you really interesting that if you just look at the G best paper that is missing piece is not mentioned anywhere there but if you go back and read that paper you can tell that that's what they're thinking and sort of an interesting historical things so
I knew these papers were published in 2003 and 2004 and at around that same time on Doug Cutting and Mike Cafarella working on an open source project called notch which was an outgrowth of Apache Lucene and the goal was to build an open source search engine so based on these papers and they realized immediately hey this is really the way that you will have to build that we need to build a distributed file system equivalent to G as devise our own MapReduce implementation and that's what became the so you do this you these 2 things HDFS which is you produce version of GFS and then their own MapReduce and implementation so that a new project came into existence in 2005 it was initially used quite heavily at Yahoo and Facebook in the last I was 3 or 4 years we've seen sort of more general adoption and within enterprise data centers so you might my my previous employer player has sort of build a business around and helping companies that are you know already using Oracle already using sequel server or something like that to figure out ways of sort of transitioning to this lower cost and more scalable infrastructure but with a factored you people are actually using this industry has given us an opportunity to get a sense for you know what are the practical benefits of it can also where the drawbacks in terms of benefits and I think you at the very top of the list is the weighted HDFS has commodities enterprise storage and this is that really significant effects on in terms of what enterprise organizations are doing with the data and also the amount of data that are actually retaining site so you you in the past and it actually still today date you find the most of large organizations have some data that they maintain in an online fashion in the data center but a lot of the data that they collect is getting rid the tape put on the truck the truck rights to a cave somewhere and is deposited there is no 1 ever looks at it again and you know some of you guys might think I'm joking but the key thing that you should look at Iron Mountain which is 1 of these companies they literally have a case so so that's that's only 1 major thing like companies that were formerly of putting a lot of data on tape never to be seen again are now maintaining in HDFS instead they're able to derive a lot of them from the that data another big advantage is the way that the HDFS allows companies to scale up and in a in a genuinely fashions that having to invest more expensive hardware scale up and it also gives them the ability to to run a system a fault-tolerant manner I think in in a manner that wasn't really available before with other for existing as sense solutions is also introduced a new level of flexibility in terms of the data and that these organizations can operate on so you 1 of the 1 of the things about how do is that it is erased it as a particular it's really you write once read many times file system sort like an online archive file system so the ability to operate on data in place in the format in which it lands on disk is very attractive and do some offers you a flexible tools that allow you basically to extract information from a variety of different formats so what's so the same
time I think you people or where did you as a bunch of drawbacks and chief among these is MapReduce itself but MapReduce is very powerful in the sense that extremely flexible but a lot of that flexibility comes from the fact that it's so should America's of general and also so low levels and so to do even sort of simple operations you're going to end up writing pages of Java code and once you've written this code is more or less impossible to maintain and in europe had trouble remembering a month later what what the what the code you wrote does in your ability to share it with a coworker is you close to 0 and so part of this release the fact that MapReduce doesn't have any schema schema a a really valuable I think many times when people think about schemas in the context of like your relational database they may be think in some ways schemas are the enemy right because oftentimes you'll find that before you maybe you have some data you really want to analyze in the database but before you can even get into the database you have to talk to you deviated EPA is going to spend a couple of days trying to figure out the right schema you for this data so there's a lot of latency a lot of some sort of organizational latency of built into this and and the reason for that is that if you get the schema wrong that's an expensive mistake to me so the I new stimulus in in sort of the traditional context a high-stakes game but they also provide a lot of value because they allow you to view data in a logical fashion as opposed to having to worry about the underlying physical structure of and do without schemas forces you to constantly worry about the underlying physical structure and then the other missing features of I think it quickly summarize by saying that people really want due to look more like a database they want to do that you know they like the scalability they like and the flexibility of with the same time databases look the way they do for a reason evolved over the past 20 or 30 years in response to the needs of users and provide a bunch of features that make a lot of sense when we're trying to do is analyze data from these a new things that help you write code or help you write better code like optimizes indexes and views in there's also this a large ecosystem of tools that run on top of databases but did not speak things like ODBC JDBC and which allow you to do really interesting things with your data like business intelligence tools ETL tools for transforming in manipulating the data and then the things like sequel IT user or other environments for for doing data analysis so the way
that of the Hindu community responded to these shortcomings of the sort of core platform was by building a variety of domain-specific languages on top of it to the idea was no longer have to code your job in MapReduce instead you can use this other language which we compiled under MapReduce and so there are a variety of languages that that that evolve like this rule of course as something called solves all which they published a paper on also have a sequel the cell referred to as tensing but within the 2 world we have things like a real so have a cascading which is kind of like the Java dataflow API and the the uh there there the the DSL that I think has been the most interesting adoption with for the enterprise opportunity is hot and so high you made a very pragmatic decision instead of inventing our own language we're going to stick with the people of the features you see full has some of the rough edges it has some bugs but at least we learn those bugs are over the last 30 years and are easy to work around and also in terms of the entire world basically speaks including all of those ecosystem tools like mentioned earlier so it's sort of like as soon as you was like the Esperanto data-processing except people actually speak it on my so now so so anyway that's that's the situation so 1 thing I had a really wanted was sort of like trying to convey is that most people will you mention may think OK hi this like a sequel to MapReduce compilation execution engine which is very true that's what it is but it's also these 2 other things that I don't think gets enough credit for being and so 1 of those things basically is the data model that it provides an by data model and in the way that it basically allows you to map logical structure in the form of schemas to underlying physical data that can be and basically any format 1 so in order to provide that functionality height has 2 components 1 is the height matters store which basically allows you it's it's you to think of it as hides and Table catalog so the megastore maintains mappings of the form table basically to underlying file in HDFS to HDFS location where the data stored it also maintains mappings of Table 2 Suriname where uh certeza basically these plugin libraries that allow you to handle different formats so with these 2 bits of information when high and executes a query plan which is in the form of basically of graph of MapReduce jobs each 1 of these MapReduce jobs gets sent to a data node which you could think of as a HDFS is version of chunk servers GFS and the the MapReduce job which is expressed in terms of high operators is is executed there but the data as is of of HDFS is filtered through 1 of these search the example would be let's say a CSP history or adjacent sturdy or and perhaps of RC files URI which is a high it's called the format but in this way that it allows you something really cool which I think you will start to refer to as the on read as opposed to scheme on right so this is sort of you that the really I think the really cool benefit of high and the thing that I expect will sort of beehives lasting impact on you know the that the database world not you know sequel to MapReduce but this very like a flexible way of mapping logical structure onto underlying physical data that so basically a asserting a mother with 2 reasons for serialize the surrealists so seriously or the so when you're deserializing you're basically reading from a file and what you were pulling out of it or rows splitting the columns with types of it in the reverse direction you feed in basically the structures the type of columns and then it writes about the underlying file suit you were saying could you asserting that goes from CSV 2 of its adjacent or something like that you wouldn't actually used to service for that prosecutors adjacent certainly and he history and what would you will be to construct 2 tables 1 of which is based on top of Jason the other topics and then uh 1 is about hide is that it's added a variety of sort of extensions to base equal including semantics that allow you to easily do things like stream data from 1 table into another table so that answers your question so I
you know I think high salt a bunch of problems but in many ways has also been sort of a victim of its own success because but as soon as you start making something looks like a database people like a this is great I like that looks like a database but I wish you would cover that remaining 50 % and make it look completely like dates you where are the universe missing a sequel feature that I'm used to using where is the the authentication authorization system so here's a data type that I expect you to support participle standard but I don't see it here and then another problem with high from the standpoint of many users is caused by the fact that it's basically delegating execution the MapReduce so mapreduce is a very powerful tool but it makes some compromises which are probably not the best compromise is to make in the context of of the database engine like 5 and the thing compromise that it makes is that it's batch oriented so it guarantees basically really high throughput and but it has this sort of built in latency as a result the fact that you when you start a job at the top the JobTracker the JobTracker has to go and parcel out jobs to on each 1 of the data nodes so that the as a result even if you were running a query against just a small sliver of data would say a megabyte or a hundred megabytes or something like that so the the minimum sort of current time on and where is going to be 15 to 30 seconds and if you ran the same query let's say on PostgreSQL equal on a single node is expected to complete you know in under a 2nd so on with 5 this sort of like a whole sort of way of interacting with the database where you do sort of very small exploratory queries to try to gain like an understanding of you know the way the data was actually there it's not really feasible or practical with high so but in the past year to maybe 2 years I think it's fair to say a certain generation of distributed analytic databases has emerged from which are you largely being designed sort of salt this set of problems and in particular the problem of of the latency overhead and inefficiencies they're introduced by using MapReduce and so examples of these systems would include things like pollen from quarter of the Apache drill project The Rules of Travel is as an example this also within rule insights DVD is another example of that and so on the talk a little bit about
size to be but before I do I wanna show us sort of the the common like architectural difference that sets these new databases apart from let's say the previous generation of NTT databases the worst over the previous decade so this looks a lot like you know what what we've seen recently where people have been adapting and databases to run on top of it due and what they're effectively doing is creating a connector between the worker nodes in the entity cluster and the data nodes in HDFS and the problem with this approach is that you have the the I O bottleneck between these 2 there's so you're not achieving data locality for your jobs and a system like this is just not going to scale and with the solution you know is that this is still 1 illustrated graphically it's pretty straightforward you get rid of the the dedicated and keep the data nodes and you take the processes that worker processes that were running on those nodes and you
co-locate them on each 1 of the data and so in effect we're
doing then is pushing the work down to the data as opposed to I'm pulling the work circling the data over to the work and this new allows systems like this to scale out in the same way that variable skill levels something like MapReduce so I
wanna go into more detail about how this actually plays out a system like site the so sigh Stevie is built on top of PostgreSQL people if we look at the components of the system on the left hand side we have put stress equal clients ODBC JDBC driver I think this is no 1 really significant point I want to make that as far as a client is concerned as far as the user is concerned situs DB looks almost identical to post press equal so that you can basically continues to seek war different new tools that you're familiar with the the support for sequel is you know virtually the same all of the same system tables and there and stuff like that but but in terms of changes we've made up on the master node once again is moralist stock was pressed the equal with the addition of our distributed query compiler an execution engine on and then on each 1 of the HDFS data nodes and we co-locate what is in effect a stock of copy of equal of along with a foreign data wrapper that we've written that allows us to basically of read from HDFS so and so
when they go to a little bit more detail about how this actually works by I'm giving an example of of how queries executed so to start off when you create a table on this side is the master and you would basically specify a directory or file in HDFS that that a logical table needs to map to and then as part of a prior offline step the master node sinks file system metadata off of the HDFS nano this basically allows it to map it was the master node to map HDFS file name to a set of blocks and data nodes in the underlying storage pool and then as once the table is created and what's the metadata has been saying the master node talks to each 1 of the worker nodes and it creates a foreign table per block on each 1 of the data nodes using the HDFS foreign data record so then subsequently would
say the the user of submits a query a simple aggregation query on unifying the average salary of managers you know in your organization so as a 1st step we take that Global queries and we rewrite it as a set of as a set of queries a fragment stories that can be executed on the worker nodes and and those queries vendor there's this effectively 1 query per foreign table were each in table maps to a walk on each 1 of those data nodes the worker nodes execute those fragments queries generate partial result sets which are then returned to you last year which merges those results together and returns of a final answer to on the client so there are some
sort of interesting details that I wanted to to cover of a slightly lower level so on 1 of them has to do with block awareness so you know I had that that previous slide ratio that know in a sense what we're doing is taking a traditional MDP database architecture and the most significant change making is that were co-locating the worker processes on the data but I don't want to leave you with the impression that that's all you have to do in order to achieved data-locality because the problem is that if that's if that was what you do those and kicking worker processes running on the data nodes are still going to have to basically fetch data from all of neighbors in order to satisfy the query is very important that the master node with this scheduling a query when discret scheduling those fragments queries that it sends them to the correct data node so that those data nodes can do those readable locally as we refer to as what where and so you know the other
the thing is worth mentioning is fault tolerance on so our solution to fault tolerance I think it would be summarized by saying that were basically leveraging the best features of the 2 platforms that were building on top of due and post-crisis so what happens if the master node fails well we take advantage of post vesicle streaming replication to make sure that the table metadata is copied over to a hot standby and that there no solves the problem at the national level and the worker nodes since we are actually delegating the storage problem as well as the fault tolerance problem to HDFS if 1 of those nodes fails and we already know that the actual data has been replicated it to a bunch of neighbors completely fail over to that and and then the sort of missing piece the piece that we actually had implemented on our own is the ability to that if a process if a query is in the process of executing detect it and node which is responsible for some of these fragments freeze has gone down into reassign those queries dynamic how to other replicas in the system
so you know 1 of the things that I think of that is the last of the roughly is right the news group
this is that you don't there about history and and you wouldn't care about quickly because it would determine it can't talk there can communicate with any of the 3 replicas in that situation so I mean there there is a limit to like you get out of her her her question questions on so
you back when back when I was your 1st talking to the guys at science data you about joining the company and they were talking about what they had done with the site as a deviant how they build on top of PostgreSQL sequel I wasn't very familiar with postprocedural at the time of my name you know I was mainly familiar with things like bicycle in particular with things like 5 and as they were telling me about some the different features that was received that they were leveraging mention foreign data wrappers and I got really excited about because foreign data wrappers electrically realized are but really a much more powerful version of hide series on and you are using for data wrappers we are able to achieve the same sort of like novel benefits seperate from for a traditional database is the ability to do on read as opposed to right which introduces this whole new level of flexibility in very significantly scheme on right is more 1st Monday is more than about just flexibility it's also in a sense about performance because if you can read data in place regardless of the format without 1st having to go through this process of loading in other words translating it or converting it into the native format of the database before you can read it that's a significant and new performance rate in terms of overall latency just as as a quick side the book couple weeks ago or maybe a month ago someone posted a performance comparison between high and Amazon redshift and this was you know carry on Hacker News and they're just like high level of thing was 0 you redshift like does this query in 30 seconds and it takes hide from you know 10 minutes or something like that but what they left out was that it had taken 17 hours for them the will of the data in direction before they can query a lot so serious very important to consider both lighting kind of query as well as time of query we're talking about these problems on so another significant thing about for data wrappers is this is a public API it means that but we can use this as an extension point and continue to build on it without having to worry about 14 code and without foreign data wrappers we could define this API on our but it would have been you know a very brittle thing but to have to maintain and I think more often than not what you see a DL happen in the past with other companies that have built distributed databases of of post press the and that basically 40 freely on 1 of our design goes from the very beginning has been to structure site is DB in such a way that is coupled enough from post-crisis Chris equal there with each new version of course Chris equal all we can say in a matter of weeks we based the changes that we made on the new version of and then delivered to our customers both you know the the value that we died as well as the new features that the community has provided that something that we're really excited about this know that's probably a better question from my co-workers need to answer this the management of money in my mind my mind I and I don't think this is a list list of all the energy of the nodes so we just say that I love for and interrupt I think that they are like the coolest thing ever but if there's 1 thing I don't like about 4 and interact with that name on because I think it seriously undersells the power of this API and in particular when you're trying to talk to a customer and was used foreign data wrapper like and stuff also reason for new wrappers allows us to all these cool things mean they here 2 words like a foreign
and wrapper which in this context don't necessarily sound that especially a tracking about their locality in safe for anything you know they're obviously like pulling data over the network and and rapid just sounds like some you know like his cross simple method that would be a good option but then the problem was equal matters you have to say like 0 yeah but that's like 10 or 15 year old standard and that sounds boring right you what you want it to be new and you want to have school buzzword attached to his 1st area which is a form of focus group will will work on this a good idea so what I've started saying
is that it really for data wrappers and are pluggable Storage Engines xdPX portable storage and and that that's to me is what it looks lot like it's what allows you know how and and 1 of the ways in fact in which data wrappers for more powerful than height series is that they allow you to filter pushed out and once you have the ability to filter push down on it becomes very practical to build a foreign data wrapper on top of a comma format so addressing how viral no hands how many people are familiar with like all the data formats but equal so some people but not enough that I should explain what I'm talking about so on traditional databases when they lay data out on disk the user row-major format which means that if you're doing a table scan you basically really role 1 then wrote to then row 3 etc. etc. any column-major format data is laid out like a column 1 column 2 column 3 or perhaps a chunk of column 1 a chunk of column 2 etc. and then the next chunk of column 1 column 2 so there is sort of like hybrids formats like rc file packs where these things a trunk like that the advantage of color formats especially for all that databases were most queries are going to end up triggering pool tables and is the if you have a predicate on that I will listen more basic let's say that you're only looking at 1 of the columns of your query as opposed to doing it selects you actually want to read of those other columns of debts and we know that I always work pretty much like the biggest bottlenecks if you can do is seek over just a large expanse of data on disk considered during a read as you have to do with column-major format that is like the single biggest performance benefit you can get right right there the cool thing about Palmer formats on is and the impact they have on compression so people known for a while now it if you have an I bound task of 1 way of improving the performance is actually compressor data and there is a sort of a little bit counterintuitive because you think 0 well you have found using compressed data I have to devote a CPU resources to during the compression and decompression but it turns out that that the amount of time it takes to do that is much less than the amount of time that you say in terms of doing the I O to transfer the compressed data from disk into catch and color formats make this even better because it turns out that most of the compression algorithms that people use it will actually g of a new generate higher compression ratios if the stuff that you're trying to compression is like similar similar words let's say that you have a string column in in column a string column in column so if you have to compress the data in a row-major format but the compression ratio will not be as good as if you were the other case were using common format we compress entire in column followed by an entire string call on so you this is like a you 1 reasons why were really really big on data wrappers for us the ability to be able to like plug-in multiple different color formats and it seems like a huge win especially in the context of since right now there are several different continuing cholera file format standards and it's nice to actually have the option of supporting multiple formats as opposed to just 1 and so and then this is
that my final thought about foreign data wrappers so someone the audience mentions sequel immediate earlier which was the original like ANSI sequel standard that sort of outlined this idea of adding this sort of like a standard for creating a level of indirection between the the tables that you see in 1 database actually having redirected data that's stored somewhere else and the use case that people were thinking about 10 or 15 years ago or whatever it was when they came up with this standard was really database federation and the problem that they were trying to solve was this new enterprise organizations don't have 1 database they probably have 10 or or 100 databases and that creates a big management problem especially when you really want to be able to have a sort of a unified view of all of your data so if you're able to create tables that behind the scenes actually redirected data that's managed by another table that's a huge sort of management with them but what it's equal energy you at the time I don't think anyone was really thinking about was what this would look like in more of a distributed systems context and so you think about this way like let's suppose that you have a single post for instance and you have a Mongo DB foreign data wrapper and you are able from your it's single puts received instance to push a filtered down to longer D. but anything that that results from that sort of like a query that has to be pulled back across the network so that under a single node let's say you can do the rest of the aggregation that you want to do or something like that now if you have a distributed version of PostgreSQL equal you can actually co-locate the post-stressed equal worker processes on the nodes that are used by your distributed blocks start say Mongo DB H space or for example HDFS and then you can push the filter down into that a 1 debate but was you pull the results back up instead of sending them over the network you can apply a part of the aggregation attainable local fashion so the amount of I O that has to be done over the network is significantly reduced so you know that the point frightening cases that like this federation and there's distributed federation distributed federations of cool idea yes so OK it's so support insecure updates in sight TV and that's part of being like all collected but in terms of of supporting you transaction-oriented things in the underlying block storage for example longer DB I think I don't know what would you say all actually received the questions for the and just like can that the like 2 more slides with questions OK Bullmore
point so new 1 of the other reasons why we decided to build size TNtuple post-crisis equal I mean there there are a couple reasons right like why reinvent the wheel when you can and build on top of something that's already great the reason being that you are building a distributed database is more than just building a distributed query compiler execution engine it's also like all of these tools but it sit around it and you know I definitely learned that the hard way from working on high you know where you implement 1 thing and you were like that's great what about these 10 other things that you have finished and so on but I think in another thing that you we really excited about is the fact that you have these these 2 communities on which are both very vibrant which are doing great work of and we really see size as a way of sort of combining those 2 communities and what were these providing people on either side with access to the great tools that are produced by the other so for people who are you are a part of the Hindu community here's a chance to use an enterprise-class database enterprise-class feature set on top of Hadoop while maintaining you know the the sort of like characteristics of a do in terms of scalability and cost-effectiveness and for was pressing users you choose a chance to use the tools they already familiar with but to use them you on top of a a platform that allows you to analyze you petabytes of data and also you know in addition that we 1st introduced this citing new level of flexibility with on read and features like that so I just
wanted you could sort of you get some closing thoughts I think you're the first one is that it's a little bit unfortunate that we have this this name to do which lumps together MapReduce and HDFS because I think as a result people think that like these 2 components are inextricably linked together and a pair of equal value and I mean this is my opinion but the HDFS is much much much more valuable than MapReduce MapReduce is not the only way to distribute computation across the the data nodes your cluster and in size db is 1 example of another way right instead of using MapReduce we have our own custom operator there and and you we we get a more efficient system of more expressive system as a result of but I think the other thing is that the HDFS is really what is sort of driving industry adoption of do so and because of the way that it commodities as storage you know effectively sort of frees you from the the clutches of the nascent sand mafia and also I it's really changing the way that organizations manage data I mean the fact that it's so inexpensive compared to now as in San technology means that this data you were previously just writing the tape and effectively chucking your now keeping on site and once you have an on-site of course will if we have a 1 up where it why not try to extract some information from it and that no internal is sort of a driving this work to try to figure out how we can build a new databases that look familiar that provide this familiar feature set on top of this new storage where in storage there is significantly different from the storage layer that the already existing enterprise and it's these differences this sort of require us to devise this new architecture so it may seem like a little stronger say like this is a new generation of analytic databases but I do think I genuinely think that they are significantly different from the databases that you evolved over the previous decade and I think that the new description is warranted so that
others are opposing but I'm just has to do with you know summary of sort of the core benefits of and sequel on Hadoop and and more it no more specifically postcursor quantitative and to you in terms of sort of new features things that were possible for we have this nice abilities certainty couple the logical schema from the underlying physical data so you don't have to do loads of us you don't have to wait for your deviate to model your data before you can analyze it so it has both but I think an impact on the way that you interact with your co-workers and and your ability to like quickly you run queries against things it also has a very significant impact on the way you you utilize your underlying storage if you not to load data into a database you're not just transforming data also copy you end up with multiple copies of the same data in multiple different storage systems across on your datacenter doing something on really means that you have 1 copy anything even have the uh you humanproject multiple different logical views on to that data that is sort of an additional level of flexibility on the core benefit is this since were building on top of the horizontally scalable file system and and leveraging know that sort of the principle that makes it scalable data locality were able to maintain the same characteristics and the same goes for fault tolerance on and then I think that the key thing it sort of differentiate this new generation of a virtually databases from high is that by replacing MapReduce were able to achieve low-latency which means that you are now able to do interactive queries against small slices of data that you once again really changes the way so you know you can interact with
data were hiring but we really like PostgreSQL equal we have composed equal what it seems like this is a pretty good audience you know for during recruiting and so if any of you folks are interested in perhaps getting a job changing jobs should definitely talk to us and that's the only for our website and we also have a bunch of interesting blog posts and also the cool thing is you can actually download a copy of size to be installed on your machine and pick it up for test right we have easy 2 images available on as well as of the think Watson sent us the role of documents on the questions that the license some you'd actually and to parts of the open source and parts of Iraq source so all the foreign data representation no source code that we use for sinking metadata off of the hidden meaning is open source in the past the right now are basically the distributed letter that we're in and I don't think that we use speaking my co-workers I don't think we have any sort of like inherent objection to at some point in the future possibly open source this it's more just like you know we can't do that right now right on the side with this right so I I would say like uh if you're getting close like let's say 10 terabytes of data that's probably where this becomes you practical solutions and most most cases like I think we have 2 different types of customers so 1 customer at a sort of like archetypes are people who already have had to cluster set up of and they're using hyper they're like man I can't stand the latency or I really wishes you provided better support for sequel or I our use postprocedural and non using do I really wish I could run you want on top of the other then we have the other customers who are you doing things at let's say like terabyte scale right so they have a running pose receive but it's starting to sort of slowdown little bit pick up and they know that they need to go for a distributed solution but they're not necessarily interested in setting up a on Hadoop cluster because that has this additional management there I didn't mention the presentation but sigh history actually has 2 different deployment so in 1 mode you deployed on top of the due in HDFS basically manages the storage in the other deployment model every worker node basically runs postprocedural but it runs on top of the native file system like the exterior Institute in which case it just uses POS Chris equals Storage Engines so for managing the the table for this circuit but we would blood not actually suppose Chris for wrappers would actually create basically like a fragment tables on each 1 of the uh on each 1 of the workers using basically like the size of this fragment tables is configured and then when the user to lose data into this sort of like distributed was Chris database 1 we saw it in parallel then Stream data into these different fragment tables on workers yes for the whole of no it would allow so I have push cruise break so it's is this in the up to the implementor of the 4 near wrapper to to support things like that so you you have the ability right now to on uh that's a extract of filter conditions which is something there are easy to push W after we express that in terms of the API of the thing you're talking to terms of the thing you're building the wrapper on top of the change of change the runtime of the new book at the at let's take this question offline uh I'm not sure how to answer and the question of this the questions if you have a complex queries can you gather all of data to 1 of equal size so I'm not sure I understand why would would probably friends all the mean or OK but it should so for for example the joint right to join that references multiple tables so they're different strategies in distributed databases for doing joint so 1 strategy that we support right now as a broadcast joint where if you're joining a small table for large table you can on broadcast basic your copy the small table to each 1 of the the worker nodes so it has sort of a local copy of there are some other strategies for doing some big table joins in a distributed fashion that we're working on on implementing right now but you know the the basic things that but the basic issue right we're doing a big table join is that you do have to transfer data between the different worker nodes because you can expect that any given will have all the data locally stored data it needs in order to process like it's fragment of the joint on and the the algorithms that have evolved for handling this just joint he's busy we try to minimize the amount of data that they need to transfer to the different strategies in our case from it's a little bit more complicated since but we don't were not able to basically to assume anything about the underlying data in each block and since the storage is completely delegated to HDFS that we can do some of the optimizations that you see in other sort of more traditional entity databases where they on the tree sort or they hatch partition the data across the nodes of node using strategies like that of some of the questions yes you but you basically use our version of the copy command so like literally the 2 different modes for work for operating as a deviant so if you remove 1 and where you're actually using outputs Chris equals storage where instead of using the do and then you would use our distributed copy command you would basically say here's the file that want you copy it into the following table and it would then and distributed fashion this stream that data into each 1 of the worker nodes on but you know in the typical sort of like all lab of use case scenario the data that you want to process is like a fact data or it is a log data so it's not so the direction going to manipulate you basically wanted you joins a aggregations of or analytical queries on on on that information on and then in the cases of new deployment mode to words were not have HDFS you would use of the Hadoop tools for basically loading data and so that you get something called this CPU which is distributed copies program is to use the new questions yes what the and 1 of the most of all of you have to a think of right now that actually that's minister because he's been working on this and I many of them of the of the of the some of the money the post this version we can't afford the diversion we can I knew the questions left and the problem of I'll soon but this because of that that you the you know that kind discussion of the query to reducing the master node level in order to replicate the table metadata there the master and maintained so this is basically information that maps the global table names to all of the foreign tables that exist on each 1 of the worker nodes to this table metadata both x or if global master level as well as the mappings from the global level to other work level now this is what we want to do because there's a single master a single active master data you could have through replication multiple hot standbys and we're not interested in replicating you know this is real data because no real data exists on the master it's all metadata about basically where you have the shot refuges stored on each 1 of the worker nodes the questions OK thank you


  955 ms - page object


AV-Portal 3.9.1 (0da88e96ae8dbbf323d1005dc12c7aa41dfc5a31)