Data lake: Design for schema evolution

EuroPython

Yadav, Prakshi

Formale Metadaten

Titel

Serientitel

EuroPython 2021

Anzahl der Teile

115

Autor

Yadav, Prakshi

Mitwirkende

Holderness, Eli (Moderation)

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 4.0 International:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.

Identifikatoren

10.5446/58820 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2021

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Designing a data lake necessitates well-researched storage, management, scalability, and availability solutions. However, managing schema evolution remains a difficult task. The structure of data differs from one company to the next, making it difficult to generalize a solution to the schema evolution problem. At Episource, we faced a similar challenge - our data of interest is the output from our NLP engine. Episource's machine learning and natural language processing platform processes millions of pages of medical documents, with up to 15 ML/DL models working together to produce the results. The result of such a challenging pipeline is a complex nested JSON series. With each major update, our NLP engine evolves, causing the inference data structure to evolve as well. As data grew in size and complexity, storing it and making it searchable became a pressing necessity. We needed a solution that kept schema compatibility, versioning, and data integrity intact. We wanted to make sure data reads and writes were unaffected by the Schema mismatch problem. After several iterations and proofs of concept, we settled on a solution that uses the AVRO format to evolve our data's schema. Avro is a format similar to Parquet but can also accommodate schema evolution. To keep track of changes made to the system, schema versions are saved in a Schema registry. To read the AVRO data stored in S3, our data lake uses Athena, a distributed SQL engine based on Presto. The solution makes use of python libraries to glue various components of this pipeline. The following are some of the things that a participant can expect to learn during this talk: In a data lake, best practices for storage, control, scalability, and availability Managing schema evolution in a data lake The ability to use both "schema-on-write" and "schema-on-read"