We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

ETL pipeline to achieve reliability at scale

Formal Metadata

Title
ETL pipeline to achieve reliability at scale
Title of Series
Number of Parts
132
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In an online betting exchange, thousands of money related transactions are generated per minute. This data flow transforms a common and, in general, tedious task such as accounting into an interesting big data engineering problem. At Smarkets, accounting reports serve two main purposes: housekeeping of our financial operations and documentation for the relevant regulation authorities. In both cases, reliability and accuracy are crucial in the final result. The fact that these reports are generated daily, the need to cope with failure when retrieving data from previous days, and the fast growing transaction volume obsoleted the original accounting system and required a new pipeline that could scale. This talk presents the ETL pipeline designed to meet the constraints highlighted above, and explains the motivations behind the tech stack chosen for the job, which includes Python3, Luigi and Spark among others. These topics will be covered by describing the main technical problems solved with our design: - Fault tolerance and reliability, i.e ability to identify faulty steps and only rerun those instead of the whole pipeline. - Fast input/output. - Fast computations.