Data governance in streaming at scale

Zitieren

Zugehöriges Material

Plain Schwarz

Ortega, Sebastián

Formale Metadaten

Titel

Data governance in streaming at scale

Serientitel

Berlin Buzzwords 2021

Anzahl der Teile

Autor

Ortega, Sebastián

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/67353 (DOI)

Herausgeber

Plain Schwarz

Erscheinungsjahr

2021

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Letgo is a second-hand marketplace app reshaping secondhand trade in Turkey so re-use is the default trusted choice. We designed the data platform to be built on top of the principles of self-servicing, privacy laws compliance, data governance at business unit level, minimal maintenance and cost containment by design. We will describe how we defined our company-wise data model, leveraging Avro schemas and enabling at the same time most impactful features like: - tagging private fields for sensitive data for data privacy laws compliance - ensuring quality and structure of the data landing in the company data lake - efficient and reliable transportation and consumption of data at platform level - data catalog: discovery of available data by teams Our design is built around the Apache Kafka ecosystem—with special mention to Kafka Connect—for data ingestion and AWS services plus Spark framework for data transformations and data lake ingestion. Thanks to these principles we are able to ensure data governance over batch and real-time data, while keeping at the same time a multi-tiered data lake: the inner tier keeps the most sensitive data and the outer tiers keep only the data accessible for each single company business unit.