Validating Big Data Jobs

Cite

FOSDEM VZW

Karau, Holden

Formal Metadata

Title

Validating Big Data Jobs

Subtitle

An exploration with Spark & Airflow (+ friends)

Title of Series

FOSDEM 2019

Number of Parts

561

Author

Karau, Holden

License

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/44637 (DOI)

Publisher

FOSDEM VZW

Release Date

2019

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

If you, like close to half the industry, are deploying the results of you big data jobs into production automatically then existing unit and integration tests may not be enough to present serious failures. Even if you aren’t automatically deploying the results to production, having a more reliable deploy to production pipeline with automatic validation is well worth the time. If you, like close to half the industry, are deploying the results of you big data jobs into production automatically then existing unit and integration tests may not be enough to present serious failures. Even if you aren’t automatically deploying the results to production, having a more reliable deploy to production pipeline with automatic validation is well worth the time. Validating Big Data Jobs sounds expensive and hard, but with a variety of techniques can be done relatively easily with only minimal additional instrumentation overhead. We’ll explore the kinds of instrumentation to add to your pipeline to make it easier to validate. For jobs with hard to meet SLAs we’ll also explore what can be done with existing metrics and parallel data validation jobs. After exploring common industry practices for Data Validation we’ll explore how to integrate these into an Airflow pipeline while making it recoverable if manual validation over-rules the automatic safeguards.