Solutions to the problem of Class Imbalance Learning
When standard learning algorithms are applied to imbalanced data, the induction rules that describe the minority concepts are often fewer and weaker than those of majority concepts, since the minority class is often both outnumbered and underrepresented. To provide a concrete understanding of the direct effects of the imbalanced learning problem on standard learning algorithms, let’s consider the popular decision tree learning algorithm, where imbalanced datasets exploit inadequacies in the splitting criterion at each node.
Decision trees use a recursive, top-down greedy search algorithm that uses a feature selection scheme (e.g., information gain) to select the best feature as the split criterion at each node of the tree; a successor (leaf) is then created for each of the possible values corresponding to the split feature. As a result, the training set is successively partitioned into smaller subsets that are ultimately used to form disjoint rules pertaining to class concepts. These rules are ﬁnally combined so that the ﬁnal hypothesis minimizes the total error rate across each class. The problem with this procedure in the presence of imbalanced data is two-fold. First, successive partitioning of the data space results in fewer and fewer observations of minority class examples, resulting in fewer leaves describing minority concepts and successively weaker conﬁdences estimates. Second, concepts that have dependencies on different feature space conjunctions can go unlearned by the sparseness introduced through partitioning. Here, the ﬁrst issue correlates with the problems of relative and absolute imbalances, while the second issue best correlates with the between-class imbalance and the problem of high dimensionality. In both cases, the effects of imbalanced data on decision tree classiﬁcation performance are detrimental. In the following sections, we evaluate the solutions proposed to overcome the effects of imbalanced data.
A battery of methods to address the class imbalance condition is available in the literature  . Typically, the methods are categorized into sampling methods, cost-sensitive methods, kernel methods and active learning methods.
Sampling methods The mechanics of random oversampling follow naturally from its description by adding a set sampled from the minority class: for a set of randomly selected minority examples, augment the original set by replicating the selected examples and adding them to the original set. In this way, the number of total examples is increased and the class distribution balance is adjusted accordingly. This provides a mechanism for varying the degree of class distribution balance to any desired level. While oversampling appends data to the original dataset, random undersampling removes data from the original dataset. At ﬁrst glance, the oversampling and undersampling methods appear to be functionally equivalent since they both alter the size of the original dataset and can actually provide the same proportion of balance. However, this commonality is only superﬁcial; each method introduces its own set of problematic consequences that can potentially hinder learning  . In the case of under-sampling, the problem is relatively obvious: removing examples from the majority class may cause the classiﬁer to miss important concepts pertaining to the majority class. In regards to oversampling, the problem is a little more opaque: since oversampling simply appends replicated data to the original dataset, multiple instances of certain examples become “tied,” leading to overﬁtting . In particular, overﬁtting in oversampling occurs when classiﬁers produce multiple clauses in a rule for multiple copies of the same example which causes the rule to become too speciﬁc; although the training accuracy will be high in this scenario, the classiﬁcation performance on the unseen testing data is generally far worse .
Informed Undersampling based on EasyEnsemble and BalanceCascade  is proposed to overcome the deﬁciency of information loss introduced in the traditional random undersampling method. Another example of informed undersampling uses the K-nearest neighbor (KNN) classiﬁer to achieve undersampling. Based on the characteristics of the given data distribution, four KNN undersampling methods were proposed in , namely, NearMiss-1, NearMiss2, NearMiss-3, and the“most distant” method. Another method, the one sided selection (OSS) method  selects a representative subset of the majority class and combines it with the set of all minority examples to form a preliminary set, which is further reﬁned by using data cleaning techniques.
Synthetic sampling is a powerful method that has shown a great deal of success in various applications . The SMOTE algorithm creates artiﬁcial data based on the feature-space similarities between existing minority examples. To create a synthetic sample around a minority point in vector space, randomly select one of the K-nearest neighbors, then multiply the corresponding feature vector diﬀerence with a random number between [0, 1], and ﬁnally, add this vector to the minority point vector. Though it has many promising beneﬁts, the SMOTE algorithm also has its drawbacks, including over generalization and variance , which is largely attributed to the way in which it creates synthetic samples. Speciﬁcally, SMOTE generates the same number of synthetic data samples for each original minority example and does so without consideration to neighboring examples, which increases the occurrence of overlapping between classes. To overcome these issues, only selected sub-samples of the minority class are subjected to synthetic sample generation. Borderline-SMOTE  uses only the minority samples near the decision boundary to generate new synthetic samples. MWMOTE  identiﬁes the hard-tolearn informative minority class samples and assigns them weights according to their euclidean distance from the nearest majority class samples. It then generates the synthetic samples from the weighted informative minority class samples using a clustering approach. SCUT  over samples minority class examples through the generation of synthetic examples and employs cluster analysis in order to under sample majority classes. In addition, it handles both within-class and between class imbalance. To this end, various adaptive sampling methods have been proposed to overcome this limitation; some representative work includes Adaptive Synthetic Sampling (ADASYN) , ProWSyn  and R-SMOTE .
Data cleaning techniques, such as Tomek links , have been eﬀectively applied to remove the overlapping that is introduced from sampling methods. If two instances form a Tomek link then either one of these instances is noise or both are near a border. Therefore, one can use Tomek links to“cleanup”unwanted overlapping between classes after synthetic sampling where all Tomek links are removed until all minimally distanced nearest neighbor pairs are of the same class. By removing overlapping examples, one can establish well-deﬁned class clusters in the training set, which can, in turn, lead to well-deﬁned classiﬁcation rules for improved classiﬁcation performance. Some representative work in this area includes the OSS method , the condensed nearest neighbor rule and Tomek Links CNN+Tomek Links integration method , the neighborhood cleaning rule (NCL) , based on the edited nearest neighbor (ENN) rule—which removes examples that diﬀer from two of its three nearest neighbors, and the integrations of SMOTE with ENN (SMOTE+ENN) and SMOTE with Tomek links (SMOTE+Tomek) .
Cluster-based sampling algorithms are particularly interesting because they provide an added element of ﬂexibility that is not available in most simple and synthetic sampling algorithms, and accordingly can be tailored to target very speciﬁc problems. Cluster-based oversampling (CBO) algorithm  is proposed to eﬀectively deal with the within-class imbalance problem in tandem with the between-class imbalance problem. The CBO algorithm makes use of the K-means clustering technique. Empirical results of CBO are very suggestive into the nature of the imbalanced learning problem; namely, that targeting within-class imbalance in tandem with the between-class imbalance is an eﬀective strategy for imbalanced datasets.
The integration of sampling strategies with ensemble learning techniques has also been studied in the community. For instance, the SMOTEBoost  algorithm is based on the idea of integrating SMOTE with Adaboost.M2. Specifically, SMOTEBoost introduces synthetic sampling at each boosting iteration. In this way, each successive classiﬁer ensemble focuses more on the minority class. Since each classiﬁer ensemble is built on a different sampling of data, the ﬁnal voted classiﬁer is expected to have a broadened and well-deﬁned decision region for the minority class. Another integrated approach, the DataBoost-IM  method, combines the data generation techniques introduced in  with AdaBoost.M1 to achieve high predictive accuracy for the minority class without sacriﬁcing accuracy on the majority class. Brieﬂy, DataBoost-IM generates synthetic samples according to the ratio of diﬃcult-to-learn samples between classes. RAMOBoost  adaptively ranks minority class instances at each learning iteration according to a sampling probability distribution that is based on the underlying data distribution, and can adaptively shift the decision boundary toward diﬃcult-to-learn minority and majority class instances by using a hypothesis assessment procedure.