Data lake: Design for schema evolution

Cite

EuroPython

Yadav, Prakshi

Formal Metadata

Title

Data lake: Design for schema evolution

Title of Series

EuroPython 2021

Number of Parts

115

Author

Yadav, Prakshi

Contributors

Holderness, Eli (Moderation)

License

CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/58820 (DOI)

Publisher

EuroPython

Release Date

2021

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Designing a data lake necessitates well-researched storage, management, scalability, and availability solutions. However, managing schema evolution remains a difficult task. The structure of data differs from one company to the next, making it difficult to generalize a solution to the schema evolution problem. At Episource, we faced a similar challenge - our data of interest is the output from our NLP engine. Episource's machine learning and natural language processing platform processes millions of pages of medical documents, with up to 15 ML/DL models working together to produce the results. The result of such a challenging pipeline is a complex nested JSON series. With each major update, our NLP engine evolves, causing the inference data structure to evolve as well. As data grew in size and complexity, storing it and making it searchable became a pressing necessity. We needed a solution that kept schema compatibility, versioning, and data integrity intact. We wanted to make sure data reads and writes were unaffected by the Schema mismatch problem. After several iterations and proofs of concept, we settled on a solution that uses the AVRO format to evolve our data's schema. Avro is a format similar to Parquet but can also accommodate schema evolution. To keep track of changes made to the system, schema versions are saved in a Schema registry. To read the AVRO data stored in S3, our data lake uses Athena, a distributed SQL engine based on Presto. The solution makes use of python libraries to glue various components of this pipeline. The following are some of the things that a participant can expect to learn during this talk: In a data lake, best practices for storage, control, scalability, and availability Managing schema evolution in a data lake The ability to use both "schema-on-write" and "schema-on-read"