Data governance in streaming at scale

Cite

Related Material

Plain Schwarz

Ortega, Sebastián

Formal Metadata

Title

Data governance in streaming at scale

Title of Series

Berlin Buzzwords 2021

Number of Parts

Author

Ortega, Sebastián

Contributors

N. N. (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/67353 (DOI)

Publisher

Plain Schwarz

Release Date

2021

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Letgo is a second-hand marketplace app reshaping secondhand trade in Turkey so re-use is the default trusted choice. We designed the data platform to be built on top of the principles of self-servicing, privacy laws compliance, data governance at business unit level, minimal maintenance and cost containment by design. We will describe how we defined our company-wise data model, leveraging Avro schemas and enabling at the same time most impactful features like: - tagging private fields for sensitive data for data privacy laws compliance - ensuring quality and structure of the data landing in the company data lake - efficient and reliable transportation and consumption of data at platform level - data catalog: discovery of available data by teams Our design is built around the Apache Kafka ecosystem—with special mention to Kafka Connect—for data ingestion and AWS services plus Spark framework for data transformations and data lake ingestion. Thanks to these principles we are able to ensure data governance over batch and real-time data, while keeping at the same time a multi-tiered data lake: the inner tier keeps the most sensitive data and the outer tiers keep only the data accessible for each single company business unit.