We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Validating Big Data Jobs

Formal Metadata

Title
Validating Big Data Jobs
Subtitle
An exploration with Spark & Airflow (+ friends)
Title of Series
Number of Parts
561
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
If you, like close to half the industry, are deploying the results of you big data jobs into production automatically then existing unit and integration tests may not be enough to present serious failures. Even if you aren’t automatically deploying the results to production, having a more reliable deploy to production pipeline with automatic validation is well worth the time. If you, like close to half the industry, are deploying the results of you big data jobs into production automatically then existing unit and integration tests may not be enough to present serious failures. Even if you aren’t automatically deploying the results to production, having a more reliable deploy to production pipeline with automatic validation is well worth the time. Validating Big Data Jobs sounds expensive and hard, but with a variety of techniques can be done relatively easily with only minimal additional instrumentation overhead. We’ll explore the kinds of instrumentation to add to your pipeline to make it easier to validate. For jobs with hard to meet SLAs we’ll also explore what can be done with existing metrics and parallel data validation jobs. After exploring common industry practices for Data Validation we’ll explore how to integrate these into an Airflow pipeline while making it recoverable if manual validation over-rules the automatic safeguards.