PostgreSQL on Hadoop

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben

Identifikatoren

10.5446/19057 (DOI)

Herausgeber

PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross

Erscheinungsjahr

2013

Sprache

Englisch

Produktionsort

Ottawa, Canada

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Bridging the Divide with Distributed Foreign Tables Apache Hadoop is an open-source framework that enables the construction of distributed, data-intensive applications running on clusters of commodity hardware. Building on a foundation initially composed of the MapReduce programming model and Hadoop Distributed Filesystem, in recent years Hadoop has expanded to include applications for data warehousing (Apache Hive), ETL (Apache Pig), and NoSQL column stores (Apache HBase). In this talk we describe recent work done at Citus Data that makes it possible to run a distributed version of PostgreSQL on top of Hadoop in a manner that combines the rich feature set and low-latency responsiveness of PostgreSQL with the scalability and performance characteristics of Hadoop. This talk will begin with a high level overview of Hadoop that focuses on its distributed storage layer and block-based replication model. Next we will look at the data model of the Apache Hive data warehousing system and explain how it enables features such as schema-on-read, support for semi-structured data, and pluggable storage formats. Finally, we will describe how we leveraged these ideas and Foreign Data Wrappers to build a distributed version of PostgreSQL. This version runs natively on Hadoop clusters and seamlessly integrates with other components in the Hadoop ecosystem.