Reproducible & Deployable Data Science with Open-Source Python

CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/58779 (DOI)

Publisher

EuroPython

Release Date

2021

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Data scientists, data engineers and machine-learning engineers often have to team together to create data science code that scales. Data scientists typically prefer rapid iteration, which can cause friction if their engineering colleagues prefer observability and reliability. In this talk, we'll show you how to achieve consensus using three open-source industry heavyweights: Kedro, Apache Airflow and Great Expectations. We will explain how to combine rapid iteration while creating reproducible, maintainable and modular data science code with Kedro, orchestrate it using Apache Airflow with Astronomer, and ensure consistent data quality with Great Expectations. Kedro is a Python framework for creating reproducible, maintainable and modular data science code. Apache Airflow is an extremely popular open-source workflow management platform. Workflows in Airflow are modelled and organised as DAGs, making it a suitable engine to orchestrate and execute a pipeline authored with Kedro. And Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.