Supervised Learning with Missing Values

Institut des Hautes Études Scientifiques (IHÉS)

Varoquaux, Gaël

Formale Metadaten

Titel

Serientitel

Journée Statistique et Informatique pour la Science des Données à Paris Saclay, 2021

Anzahl der Teile

Autor

Varoquaux, Gaël

Mitwirkende

Soussen, Charles (Moderation)

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/54727 (DOI)

Herausgeber

Institut des Hautes Études Scientifiques (IHÉS)

Erscheinungsjahr

2021

Sprache

Englisch

Produktionsjahr

2020

Inhaltliche Metadaten

Fachgebiet

Informatik Mathematik

Genre

Workshop/Interaktives Format

Abstract

Some data come with missing values. For instance, a survey’s participant may ignore some questions. There is an abundant statistical literature on this topic, establishing for instance how to fit model without biases due to the missingness, and imputation strategies to provide practical solutions to the analyst. In machine learning, to build models that minimize a prediction risk, most work default to these practices. As we will see, these different settings lead to different theoretical and practical solutions. I will outline some conditions under which machine-learning models yield the best-possible predictions in the presence of missing values. A striking result is that naive imputation strategies can be optimal, as the supervised-learning model does the hard work [1]. A challenge to fitting a machine-learning model is that there is a combinatorial explosion of possible missing-values patterns such that even when the output is a linear function of the fully-observed data, the optimal predictor is complex [2]. I will show how the same dedicated neural architecture can approximate well the optimal predictor for multiple missing-values mechanisms, including difficult missing-not-at-random settings [3]. [1] Josse, J., Prost, N., Scornet, E., & Varoquaux, G. (2019). On the consistency of supervised learning with missing values. arXiv preprint. [2] Le Morvan, M., Prost, N., Josse, J., Scornet, E., & Varoquaux, G. (2020). Linear predictor on linearly-generated data with missing values: non consistency and solutions. AISTATS 2020.