Home > Issue 1 > Class Imbalance Learning

Class Imbalance Learning

Solutions to the problem of Class Imbalance Learning

When standard learning algorithms are applied to imbalanced data, the induction rules that describe the minority concepts are often fewer and weaker than those of majority concepts, since the minority class is often both outnumbered and underrepresented. To provide a concrete understanding of the direct effects of the imbalanced learning problem on standard learning algorithms, let’s consider the popular decision tree learning algorithm, where imbalanced datasets exploit inadequacies in the splitting criterion at each node.

Decision trees use a recursive, top-down greedy search algorithm that uses a feature selection scheme (e.g., information gain) to select the best feature as the split criterion at each node of the tree; a successor (leaf) is then created for each of the possible values corresponding to the split feature. As a result, the training set is successively partitioned into smaller subsets that are ultimately used to form disjoint rules pertaining to class concepts. These rules are finally combined so that the final hypothesis minimizes the total error rate across each class. The problem with this procedure in the presence of imbalanced data is two-fold. First, successive partitioning of the data space results in fewer and fewer observations of minority class examples, resulting in fewer leaves describing minority concepts and successively weaker confidences estimates. Second, concepts that have dependencies on different feature space conjunctions can go unlearned by the sparseness introduced through partitioning. Here, the first issue correlates with the problems of relative and absolute imbalances, while the second issue best correlates with the between-class imbalance and the problem of high dimensionality. In both cases, the effects of imbalanced data on decision tree classification performance are detrimental. In the following sections, we evaluate the solutions proposed to overcome the effects of imbalanced data.

A battery of methods to address the class imbalance condition is available in the literature [10] [11]. Typically, the methods are categorized into sampling methods, cost-sensitive methods, kernel methods and active learning methods.

Sampling methods The mechanics of random oversampling follow naturally from its description by adding a set sampled from the minority class: for a set of randomly selected minority examples, augment the original set by replicating the selected examples and adding them to the original set. In this way, the number of total examples is increased and the class distribution balance is adjusted accordingly. This provides a mechanism for varying the degree of class distribution balance to any desired level. While oversampling appends data to the original dataset, random undersampling removes data from the original dataset. At first glance, the oversampling and undersampling methods appear to be functionally equivalent since they both alter the size of the original dataset and can actually provide the same proportion of balance. However, this commonality is only superficial; each method introduces its own set of problematic consequences that can potentially hinder learning [12] [13]. In the case of under-sampling, the problem is relatively obvious: removing examples from the majority class may cause the classifier to miss important concepts pertaining to the majority class. In regards to oversampling, the problem is a little more opaque: since oversampling simply appends replicated data to the original dataset, multiple instances of certain examples become “tied,” leading to overfitting [12]. In particular, overfitting in oversampling occurs when classifiers produce multiple clauses in a rule for multiple copies of the same example which causes the rule to become too specific; although the training accuracy will be high in this scenario, the classification performance on the unseen testing data is generally far worse [14].

Informed Undersampling based on EasyEnsemble and BalanceCascade [15] is proposed to overcome the deficiency of information loss introduced in the traditional random undersampling method. Another example of informed undersampling uses the K-nearest neighbor (KNN) classifier to achieve undersampling. Based on the characteristics of the given data distribution, four KNN undersampling methods were proposed in [16], namely, NearMiss-1, NearMiss2, NearMiss-3, and the“most distant” method. Another method, the one sided selection (OSS) method [17] selects a representative subset of the majority class and combines it with the set of all minority examples to form a preliminary set, which is further refined by using data cleaning techniques.

Synthetic sampling is a powerful method that has shown a great deal of success in various applications [18]. The SMOTE algorithm creates artificial data based on the feature-space similarities between existing minority examples. To create a synthetic sample around a minority point in vector space, randomly select one of the K-nearest neighbors, then multiply the corresponding feature vector difference with a random number between [0, 1], and finally, add this vector to the minority point vector. Though it has many promising benefits, the SMOTE algorithm also has its drawbacks, including over generalization and variance [19], which is largely attributed to the way in which it creates synthetic samples. Specifically, SMOTE generates the same number of synthetic data samples for each original minority example and does so without consideration to neighboring examples, which increases the occurrence of overlapping between classes. To overcome these issues, only selected sub-samples of the minority class are subjected to synthetic sample generation. Borderline-SMOTE [20] uses only the minority samples near the decision boundary to generate new synthetic samples. MWMOTE [21] identifies the hard-tolearn informative minority class samples and assigns them weights according to their euclidean distance from the nearest majority class samples. It then generates the synthetic samples from the weighted informative minority class samples using a clustering approach. SCUT [22] over samples minority class examples through the generation of synthetic examples and employs cluster analysis in order to under sample majority classes. In addition, it handles both within-class and between class imbalance. To this end, various adaptive sampling methods have been proposed to overcome this limitation; some representative work includes Adaptive Synthetic Sampling (ADASYN) [23], ProWSyn [24] and R-SMOTE [25].

Data cleaning techniques, such as Tomek links [26], have been effectively applied to remove the overlapping that is introduced from sampling methods. If two instances form a Tomek link then either one of these instances is noise or both are near a border. Therefore, one can use Tomek links to“cleanup”unwanted overlapping between classes after synthetic sampling where all Tomek links are removed until all minimally distanced nearest neighbor pairs are of the same class. By removing overlapping examples, one can establish well-defined class clusters in the training set, which can, in turn, lead to well-defined classification rules for improved classification performance. Some representative work in this area includes the OSS method [17], the condensed nearest neighbor rule and Tomek Links CNN+Tomek Links integration method [27], the neighborhood cleaning rule (NCL) [28], based on the edited nearest neighbor (ENN) rule—which removes examples that differ from two of its three nearest neighbors, and the integrations of SMOTE with ENN (SMOTE+ENN) and SMOTE with Tomek links (SMOTE+Tomek) [28].

Cluster-based sampling algorithms are particularly interesting because they provide an added element of flexibility that is not available in most simple and synthetic sampling algorithms, and accordingly can be tailored to target very specific problems. Cluster-based oversampling (CBO) algorithm [5] is proposed to effectively deal with the within-class imbalance problem in tandem with the between-class imbalance problem. The CBO algorithm makes use of the K-means clustering technique. Empirical results of CBO are very suggestive into the nature of the imbalanced learning problem; namely, that targeting within-class imbalance in tandem with the between-class imbalance is an effective strategy for imbalanced datasets.

The integration of sampling strategies with ensemble learning techniques has also been studied in the community. For instance, the SMOTEBoost [29] algorithm is based on the idea of integrating SMOTE with Adaboost.M2. Specifically, SMOTEBoost introduces synthetic sampling at each boosting iteration. In this way, each successive classifier ensemble focuses more on the minority class. Since each classifier ensemble is built on a different sampling of data, the final voted classifier is expected to have a broadened and well-defined decision region for the minority class. Another integrated approach, the DataBoost-IM [30] method, combines the data generation techniques introduced in [31] with AdaBoost.M1 to achieve high predictive accuracy for the minority class without sacrificing accuracy on the majority class. Briefly, DataBoost-IM generates synthetic samples according to the ratio of difficult-to-learn samples between classes. RAMOBoost [32] adaptively ranks minority class instances at each learning iteration according to a sampling probability distribution that is based on the underlying data distribution, and can adaptively shift the decision boundary toward difficult-to-learn minority and majority class instances by using a hypothesis assessment procedure.

Pages ( 4 of 8 ): « Previous123 4 56 ... 8Next »

One thought on “Class Imbalance Learning

  1. I had the good fortune of reading your article. It was well-written sir and contained sound, practical advice. You pointed out several things that I will remember for years to come. Thank you Sir. As a laymen outside from the ML industry I can understand those practical examples, CIL, Labelled datasets,Semi-supervised classification algorithms, Supervised classification algorithms,training data, unlabeled or labeled data sets etc. Thank you inspiration…appreciates it. Vazhutthukkal 🙂

Leave a Comment:

Your email address will not be published. Required fields are marked *