We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

PostgreSQL on Hadoop

Formal Metadata

Title
PostgreSQL on Hadoop
Alternative Title
Distributed Analytic Databases
Title of Series
Number of Parts
25
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceOttawa, Canada

Content Metadata

Subject Area
Genre
Abstract
Bridging the Divide with Distributed Foreign Tables Apache Hadoop is an open-source framework that enables the construction of distributed, data-intensive applications running on clusters of commodity hardware. Building on a foundation initially composed of the MapReduce programming model and Hadoop Distributed Filesystem, in recent years Hadoop has expanded to include applications for data warehousing (Apache Hive), ETL (Apache Pig), and NoSQL column stores (Apache HBase). In this talk we describe recent work done at Citus Data that makes it possible to run a distributed version of PostgreSQL on top of Hadoop in a manner that combines the rich feature set and low-latency responsiveness of PostgreSQL with the scalability and performance characteristics of Hadoop. This talk will begin with a high level overview of Hadoop that focuses on its distributed storage layer and block-based replication model. Next we will look at the data model of the Apache Hive data warehousing system and explain how it enables features such as schema-on-read, support for semi-structured data, and pluggable storage formats. Finally, we will describe how we leveraged these ideas and Foreign Data Wrappers to build a distributed version of PostgreSQL. This version runs natively on Hadoop clusters and seamlessly integrates with other components in the Hadoop ecosystem.