Feature store: the missing data layer in ML pipelines?

CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/44265 (DOI)

Herausgeber

FOSDEM VZW

Erscheinungsjahr

2019

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Data may be the new oil, but refined data is the fuel for AI. Machine learning (ML) systems are only as good as the data they are trained on and getting the data in the right format at the right time is a challenge. ML systems are trained using sets of features, a feature can be as simple as the value of a column in a database entry, or it can be a complex value that is computed from diverse sources. A feature store is a central vault for storing documented and curated features, ideally with support for access control. A feature store enables automatic feature analysis and monitoring, feature sharing across models and teams, feature discovery, feature backfilling, and feature versioning. The feature store is a data management layer that fills an important piece in the modern machine learning infrastructure, it empowers enterprises to scale their machine learning workflows and make full use of their investment in machine learning. In this talk, we will present key points on how to take your machine learning workflow to the next level using a feature store, and demonstrate how the feature store fits into the larger machine learning pipeline. We will introduce HopsML, an open-source, end-to-end machine learning pipeline built on the world's most fastest and most scalable Hadoop distribution, Hops Hadoop. With HopsML you can build production-ready machine learning pipelines using open source software, where features are stored in a shared feature store that is automatically backfilled as new data arrive, where machine learning models can be trained on datasets in the order of billions examples using distributed deep learning, where data scientists can follow engineering principles by using versioned and reproducible experiments, and where models can be automatically deployed in an elastic manner using auto-scaling.