## Challenges in CIL

The availability of humongous supply of raw data in many of today’s realworld applications opens up several opportunities of learning from imbalanced data to play an important role across diﬀerent application domains. However, there are several new challenges [9] at the same time. Here, we brieﬂy present several aspects for the future research directions in this critical research domain.

**Understanding the Fundamental Problems**

Majority of the imbalanced learning research works focus on improving the performance of speciﬁc algorithms paired with speciﬁc datasets, with only a very limited theoretical understanding on the principles of the problem space and the consequences of various assumptions made. For instances, several algorithms for imbalance learning published in the literature claim to have taken the performance metric better by a margin as compared to the previous solutions on speciﬁc case by case basis in terms of datasets attempted. But there exists situations where learning from the original datasets may provide better performance. This leads us to an important question: to what extent do imbalanced learning methods help with learning? This question could be further reﬁned as:

1. When a method outperformed other methods, what are the underlying effects that led to the better performance?

2. Does the solution provide clarity on the fundamental understanding of the problem at hand?

3. Can the solution scale to various other types of data?

We believe that these fundamental questions should be studied with greater interest both theoretically and empirically in order to thoroughly understand the essence of imbalanced learning problems and solutions. Furthermore, we should also ﬁnd answers to the following speciﬁc questions that would allow us to gauge the solutions better:

- What are the assumptions to make on imbalanced learning algorithms to work better compared to learning from the original distributions?
- To what degree should one artiﬁcially balance [73] [74] the original dataset by adjusting sample distributions?
- How do imbalanced data distributions affect the computational complexity of learning algorithms?
- What is the general error bound, given an imbalanced data distribution?
- Is there a general theoretical methodology that can alleviate the impediment of learning from imbalanced datasets for speciﬁc algorithms and application domains?

Estabrooks et al. [73] suggested that a combination of different expressions of resampling methods may be an effective solution to the tuning problem. Weiss and Provost [74] have analyzed, for a ﬁxed training set size, the relationship between the class distribution of training data (expressed as the percentage of minority class examples) and classiﬁer performance in terms of accuracy and AUC. Based on a thorough analysis of 26 datasets, it was suggested that if accuracy is selected as the performance criterion, the best class distribution tends to be near the naturally occurring class distribution. However, if the AUC is selected as the assessment metric, then the best class distribution tends to be near the balanced class distribution. Based on these observations, a“budget-sensitive”progressive sampling strategy was proposed to efficiently sample the minority and majority class examples such that the resulting training class distribution can provide the best performance.

## Uniform Benchmark Platform

Class Imbalance learning researchers typically use standard multi-class datasets in one-versus-rest conﬁguration to emulate binary class imbalance problems to report their results. Although there are currently many publicly available benchmarks for assessing the performance of classiﬁcation algorithms, such as the UCI Repository [64] and the LIBSVM datasets^{[10]}, there are a very limited number of benchmarks that are solely dedicated to imbalanced learning problems. None of the existing classiﬁcation data repositories identify or mention imbalance ratio as a dataset characteristics. This limitation can severely aﬀect the longterm development of research in class imbalance learning in the following pointers:

1. Lack of a uniform benchmark for standardized performance assessments.

2. Lack of data sharing and data interoperability across different disciplinary domains.

3. Increased procurement costs.

## Standardized Evaluation

Accuracy is a measure of trueness, given by (TP+TN)/N (computed using confusion matrix11). Consider a classifier that returns the label of the majority class for any input. When 1001 random test points (with 1:1000 imbalance ratio) are tried, the estimated accuracy is (1000+0)/1001=99%, which in turn provides a wrong interpretation that the classifier performance is high.

Precision is the measure of exactness, given by TP/(TP+FP) and recall is a measure of completeness, given by TP/(TP+FN). It is apparent from the formulas that precision is sensitive to data distributions, while recall is not. An assertion based solely on recall is ambiguous, since recall provides no insight to how many examples are incorrectly labeled as positive (minority). Similarly, precision cannot assert how many positive examples are labeled incorrectly. F-score combines precision and recall effectively to evaluate classification performance in imbalanced learning scenarios.

The traditional use of a point based evaluation metric such as ac-curacy, precision, and recall are not sufficient, while handling class imbalance learning problems as they show sensitivity to data distribution. It becomes very difficult to provide any concrete relative evaluations between different algorithms over varying data distributions without an accompanied curve based analysis. Therefore, it is necessary for the community to establish-as a standard-the practice of using the curve-based evaluation techniques such as ROC, Precision- Recall curve, and Cost curve. This is not only because each technique provides its own set of answers to different fundamental questions. But also because an analysis in the evaluation space of one technique can be correlated to the evaluation space of another. This standard would lead to increased transitivity and a broader understanding of the functional abilities of existing and future works.

## SSL from Imbalanced Data

The key idea of semi-supervised learning is to exploit the unlabeled examples by using the labeled examples to modify, reﬁne, or reprioritize the hypothesis obtained from the labeled data alone [75]. Some pertinent questions include:

1. How can we identify whether an unlabeled data example came from a balanced or imbalanced underlying distribution?

2. Given an imbalanced training data with labels, what are the effective and efficient methods for recovering the unlabeled data examples?

3. What kind of biases may be introduced in the recovery process given imbalanced labeled data?

## Summary

We have motivated the need for Class Imbalance Learning, a special usecase of machine learning, for its wide applicability in real world classiﬁcation tasks. We did so by introducing the fundamentals of class imbalance learning and the battery of solutions available in the literature to combat it. We have also presented some real world problem scenarios with imbalance characteristics. We concluded by enlisting the available opportunities and challenges in the ﬁeld of class imbalance learning research.

**Acknowledgements**

This research work was partly supported by a funding grant from IIT Madras under project CSE/1415/831/RFTP/BRAV.

^{[10]}https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

I had the good fortune of reading your article. It was well-written sir and contained sound, practical advice. You pointed out several things that I will remember for years to come. Thank you Sir. As a laymen outside from the ML industry I can understand those practical examples, CIL, Labelled datasets,Semi-supervised classiﬁcation algorithms, Supervised classiﬁcation algorithms,training data, unlabeled or labeled data sets etc. Thank you inspiration…appreciates it. Vazhutthukkal 🙂