Leverage knowledge from under-represented classes in machine learning

Cite

Chaos Computer Club e.V.

Lemaître, Guillaume

Formal Metadata

Title

Leverage knowledge from under-represented classes in machine learning

Title of Series

EuroSciPy 2017 - 10th European Conference on Python in Science

Number of Parts

Author

Lemaître, Guillaume

Contributors

EuroSciPy

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/38192 (DOI)

Publisher

Chaos Computer Club e.V.

Release Date

2017

Language

English

Production Place

Erlangen, Germany

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

The curse of imbalanced data sets The curse of imbalanced data set refers to data sets in which the number of samples in one class is less than in others. This issue is often encountered in real world data sets such as medical imaging applications (e.g. cancer detection), fraud detection, etc. In such particular condition, machine learning algorithms learn sub-optimal models which will generally favor the class having the largest number of samples. In this talk, we will present the imbalanced-learn package which implement some of the state-of-the-art algorithms, tackling the class imbalance problem. A scikit-learn-contrib project scikit-learn includes a tremendous set of pre-processing methods (i.e. transformers, standardizers, etc.) to optimally train machine learning algorithms. However, there is currently no estimators to reduce or generate samples. Therefore, the imbalanced-learn provides a new type of estimator, named sampler, aiming at resampling a data set whenever it is desired. The samplers are fully compatible with the current scikit-learn API and are composed of the following main methods inspired from scikit-learn: (i) fit, (ii) sample, and (iii) fit_sample. Additionally, a class Pipeline is inherited from scikit-learn, permitting to incorporate samplers in the usual classification pipeline. During the talk, we will also present the key parameters, shared by all the samplers. A data science perspective Regarding the data science aspect of this talk, we will highlight the distinctive characteristics of the different algorithms: (i) over-sampling, (ii) controlled under-sampling, (iii) cleaning under-sampling, (iv) combination of over-sampling and cleaning under-sampling, and (v) ensemble sampler. Concrete examples In addition, we will briefly present a couple of examples in which the package has been used on real-world data sets. Perspectives Our package is still under heavy development and we are aiming at improving the following points: - Speed optimization through benchmarking and profiling. - Quantitative classification performance benchmarking. - Additional algorithms (categorical features, ...)