Apache Arrow and Substrait, the secret foundations of Data Engineering

Zitieren

Zugehöriges Material

EuroPython

Molina, Alessandro

Formale Metadaten

Titel

Apache Arrow and Substrait, the secret foundations of Data Engineering

Serientitel

EuroPython 2023

Anzahl der Teile

141

Autor

Molina, Alessandro

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 4.0 International:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.

Identifikatoren

10.5446/68639 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2023

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Apache Arrow, and its Python library PyArrow are becoming the standard de facto for transfering data and interoperability between libraries and languages. As more compute engines, storages and databases start to speak arrow, you might be relying on it without even knowing. The same transformation is happening with Substrait, that is on track to be the standard representation of query plans themselves. Allowing queries to be routed to different engines as far as they speak substrait, or even decomposed and forwarded to different engines. This talk we will provide a quick introduction to the Arrow ecosystem, showing to Python developers how libraries like Pandas, Polars and PyArrow itself leverage Arrow and how compute engines like Velox, Datafusion and Acero are embracing Arrow and Substrait. The talk will also show how a basic database system based on Arrow and Substrait can be built with a minimum amount of code thanks to all the foundations they provide.