ETL pipeline to achieve reliability at scale

Zitieren

EuroPython

Andrade, Isabel Lopez

Formale Metadaten

Titel

ETL pipeline to achieve reliability at scale

Serientitel

EuroPython 2018

Anzahl der Teile

132

Autor

Andrade, Isabel Lopez

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben

Identifikatoren

10.5446/44942 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2018

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

In an online betting exchange, thousands of money related transactions are generated per minute. This data flow transforms a common and, in general, tedious task such as accounting into an interesting big data engineering problem. At Smarkets, accounting reports serve two main purposes: housekeeping of our financial operations and documentation for the relevant regulation authorities. In both cases, reliability and accuracy are crucial in the final result. The fact that these reports are generated daily, the need to cope with failure when retrieving data from previous days, and the fast growing transaction volume obsoleted the original accounting system and required a new pipeline that could scale. This talk presents the ETL pipeline designed to meet the constraints highlighted above, and explains the motivations behind the tech stack chosen for the job, which includes Python3, Luigi and Spark among others. These topics will be covered by describing the main technical problems solved with our design: - Fault tolerance and reliability, i.e ability to identify faulty steps and only rerun those instead of the whole pipeline. - Fast input/output. - Fast computations.