We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Leverage knowledge from under-represented classes in machine learning

Formal Metadata

Title
Leverage knowledge from under-represented classes in machine learning
Title of Series
Number of Parts
43
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceErlangen, Germany

Content Metadata

Subject Area
Genre
Abstract
The curse of imbalanced data sets The curse of imbalanced data set refers to data sets in which the number of samples in one class is less than in others. This issue is often encountered in real world data sets such as medical imaging applications (e.g. cancer detection), fraud detection, etc. In such particular condition, machine learning algorithms learn sub-optimal models which will generally favor the class having the largest number of samples. In this talk, we will present the imbalanced-learn package which implement some of the state-of-the-art algorithms, tackling the class imbalance problem. A scikit-learn-contrib project scikit-learn includes a tremendous set of pre-processing methods (i.e. transformers, standardizers, etc.) to optimally train machine learning algorithms. However, there is currently no estimators to reduce or generate samples. Therefore, the imbalanced-learn provides a new type of estimator, named sampler, aiming at resampling a data set whenever it is desired. The samplers are fully compatible with the current scikit-learn API and are composed of the following main methods inspired from scikit-learn: (i) fit, (ii) sample, and (iii) fit_sample. Additionally, a class Pipeline is inherited from scikit-learn, permitting to incorporate samplers in the usual classification pipeline. During the talk, we will also present the key parameters, shared by all the samplers. A data science perspective Regarding the data science aspect of this talk, we will highlight the distinctive characteristics of the different algorithms: (i) over-sampling, (ii) controlled under-sampling, (iii) cleaning under-sampling, (iv) combination of over-sampling and cleaning under-sampling, and (v) ensemble sampler. Concrete examples In addition, we will briefly present a couple of examples in which the package has been used on real-world data sets. Perspectives Our package is still under heavy development and we are aiming at improving the following points: - Speed optimization through benchmarking and profiling. - Quantitative classification performance benchmarking. - Additional algorithms (categorical features, ...)