RUSBoost [33] is a modiﬁcation to AdaBoost.M1 for solving between class imbalance problems by undersampling from the majority class. It is documented to perform better[33] than SMOTEBoost[29] that solves class imbalance by oversampling minority class. RUSBoost algorithm performs random undersampling from the majority class at every AdaBoost iteration to match the population size of the minority class, prescribed by the data sample distribution computed based on misclassiﬁcation error and exponential loss estimates.

Cost-Free Learning (CFL) [34] is a type of learning that does not require the cost terms associated with the misclassiﬁcations and/or rejects as the inputs. The goal of this type of learning is to get optimal classiﬁcation results without using any cost information. Based on information theory, CFL seeks to maximize normalized mutual information of the targets and the decision outputs of classiﬁers, which could be binary/multi-class classiﬁcations with/without abstaining. Another advantage of the method is its ability to derive optimal rejection thresholds for abstaining classiﬁcations and the “equivalent” costs in binary classiﬁcations, which can be used as a reference for cost adjustments in costsensitive learning.

An adaptive sampling with optimal cost [35] for class imbalance learning is proposed to adaptively oversample the minority positive examples and undersample the majority negative examples, forming different sub-classiﬁers by diﬀerent subsets of training data with the best cost ratio adaptively chosen, and combining these sub-classiﬁers according to their accuracy to create a strong classiﬁer. The sample weights are computed based on the prediction probability of every sample, by a pair of induced SVM classiﬁers built on two equal sized partitions of the training instances.

Weighted Extreme Learning Machines (ELM) [36] [37] is proposed as a generalized cost-sensitive learning method to deal with imbalanced data distributions, where weights are assigned to every training instance based on users’needs. Although persample weights are possible, the authors proposed to use class proportion as the common weight of every sample from a class. They also proposed an alternate weighting scheme that uses golden ratio^{[9]} in computing the common weights for the majority classes. An adaptive semiunsupervised weighted oversampling (A-SUWO) method [38] is proposed for imbalanced datasets, which clusters the minority instances using a semi-unsupervised hierarchical clustering approach and adaptively determines the size to oversample each sub-cluster using its classiﬁcation complexity and cross validation. The minority instances are weighted based on their Euclidean distance to the majority class based on which they are oversampled.

An inverse random undersampling [39] method is proposed for class imbalance learning, where several distinct training sets are constructed by severely undersampling the majority class to sizes smaller than the minority class, to bias the learned decision boundaries towards the minority class.

**Cost-Sensitive-methods** While sampling methods attempt to balance distributions by considering the representative proportions of class examples in the distribution, costsensitive learning methods consider the costs associated with misclassifying examples [40]. Instead of creating balanced data distributions through different sampling strategies, cost-sensitive learning targets the imbalanced learning problem by using different cost matrices that describe the costs for misclassifying any particular data example. Recent research indicates that there is a strong connection between costsensitive learning and learning from imbalanced data; therefore, the theoretical foundations and algorithms of cost-sensitive methods can be naturally applied to imbalanced learning problems [41] [15].

There are many different ways of implementing cost-sensitive learning, but, in general, the majority of techniques fall under three categories. The ﬁrst class of techniques apply misclassiﬁcation costs to the dataset as a form of data space weighting; these techniques are essentially cost-sensitive bootstrap sampling approaches where misclassiﬁcation costs are used to select the best training distribution for induction. The second class applies cost-minimizing techniques to the combination schemes of ensemble methods; this class consists of various Meta techniques where standard learning algorithms are integrated with ensemble methods to develop cost-sensitive classiﬁers. Both of these classes have rich theoretical foundations that justify their approaches, with cost-sensitive data space weighting methods building on the translation theorem [42], and cost-sensitive Meta techniques building on the Metacost framework [43]. In fact, many of the existing research works often integrate the Metacost framework with data space weighting and adaptive boosting to achieve stronger classiﬁcation results. To this end, we consider both of these classes of algorithms as one in the following section. The last class of techniques incorporates cost-sensitive functions or features directly into classiﬁcation paradigms to essentially “ﬁt” the cost-sensitive framework into these classiﬁers. Because many of these techniques are speciﬁc to a particular paradigm, there is no unifying framework for this class of cost-sensitive learning, but in many cases, solutions that work for one paradigm can often be abstracted to work for others.

Motivated by the pioneering work of the AdaBoost algorithms [44], several cost-sensitive boosting methods for imbalanced learning have been proposed. Three cost-sensitive boosting methods, AdaC1, AdaC2, and AdaC3 were proposed in [45], which introduces cost items into the weight updating strategy of AdaBoost. In essence, these algorithms increase the probability of sampling a costly example at each iteration, giving the classiﬁer more instances of costly examples for a more targeted approach of induction. In general, it was observed that the inclusion of cost factors into the weighting scheme of Adaboost imposes a bias towards the minority concepts and also increases the use of more relevant data samples in each hypothesis, providing for a more robust form of classiﬁcation. Another cost-sensitive boosting algorithm that follows a similar methodology is AdaCost [46]. AdaCost, like AdaC1, introduces cost sensitivity inside the exponent of the weights-update formula of Adaboost. However, instead of applying the cost items directly, AdaCost uses a cost-adjustment function that aggressively increases the weights of costly misclassiﬁcations and conservatively decreases the weights of high-cost examples that are correctly classiﬁed.

Though these cost-sensitive algorithms can signiﬁcantly improve classiﬁcation performance, they take for granted the availability of a cost matrix and its associated cost items. In many situations, an explicit description of misclassiﬁcation costs is unknown, i.e., only an informal assertion is known, such as misclassiﬁcations on the positive class are more expensive than the negative class [47]. With respect to costsensitive decision trees, cost-sensitive ﬁtting can take three forms: ﬁrst, cost-sensitive adjustments can be applied to the decision threshold; second, cost-sensitive considerations can be given to the split criteria at each node; and lastly, cost-sensitive pruning schemes can be applied to the tree. A decision tree threshold moving scheme for imbalanced data with unknown misclassiﬁcation costs was observed in [48]. When considering cost sensitivity in the split criterion, the task at hand is to ﬁt an impurity function that is insensitive to unequal costs. For instance, traditionally, accuracy is used as the impurity function for decision trees, which chooses the split with minimal error at each node. However, this metric is sensitive to changes in sample distributions, and thus, inherently sensitive to unequal costs. In [49], three speciﬁc impurity functions, Gini, Entropy, and DKM, were shown to have improved cost insensitivity compared with the accuracy/error rate baseline.

Cost sensitivity can be introduced to neural networks in four ways [50]: ﬁrst, cost-sensitive modiﬁcations can be applied to the probabilistic estimate; second, the neural network outputs can be made costsensitive; third, cost-sensitive modiﬁcations can be applied to the learning rate η and fourth, the error minimization function can be adapted to account for expected costs. A ROC based method for determining the cost is proposed in [51], which allows to select the most efficient cost factor for a given dataset. A neural network based cost-sensitive is proposed in [36] uses Weighted Extreme Learning Machines to solve class imbalance problems. The authors have also showed that assigning different weights for each training sample, WELM could be generalized to a cost-sensitive learning method.

**Kernel-methods** SMOTE with Different Costs (SDCs) method [52] and the ensembles of over/undersampled SVMs [53], [54], [55], [56] combine Kernel methods and Sampling together to solve the imbalance problem. SDC algorithm uses different error costs [52] for different classes to bias the SVM in order to shift the decision boundary away from positive instances and make positive instances more densely distributed in an attempt to guarantee a more well-deﬁned boundary. Meanwhile, the methods proposed in [54], [55] develop ensemble systems by modifying the data distributions without modifying the underlying SVM classiﬁer.

The Granular Support Vector Machines—Repetitive Undersampling algorithm (GSVM-RU) was proposed in [57] to integrate SVM learning with undersampling methods. The major characteristics of GSVMs are two-fold. First, GSVMs can effectively analyze the inherent data distribution by observing the trade-offs between the local significance of a subset of data and its global correlation. Second, GSVMs improve the computational efficiency of SVMs through use of parallel computation. In the context of imbalanced learning, the GSVMRU method takes advantage of the GSVM by using an iterative learning procedure that uses the SVM itself for undersampling.

A kernel-boundary-alignment algorithm is proposed in [58], which considers training-data imbalance as prior information to augment SVMs to improve class-prediction accuracy. Using a simple example, we ﬁrst show that SVMs can suﬀer from high incidences of false negatives when the training instances of the target class are heavily outnumbered by the training instances of a non-target class. The remedy the authors propose is to adjust the class boundary by modifying the kernel matrix, according to the imbalanced data distribution.

One example of kernel modiﬁcation is the kernel classiﬁer construction algorithm proposed in [59] based on orthogonal forward selection (OFS) and the regularized orthogonal weighted least squares (ROWLSs) estimator. This algorithm optimizes generalization in the kernel-based learning model by introducing two major components that deal with imbalanced data distributions for two-class datasets. The ﬁrst component integrates the concepts of leave-one-out (LOO) cross validation and the area under curve (AUC) evaluation metric to develop an LOOAUC objective function as a selection mechanism of the most optimal kernel model. The second component takes advantage of the cost sensitivity of the parameter estimation cost function in the ROWLS algorithm to assign greater weights to erroneous data examples in the minority class than those in the majority class.

^{[9]}https://en.wikipedia.org/wiki/Golden_ratio

I had the good fortune of reading your article. It was well-written sir and contained sound, practical advice. You pointed out several things that I will remember for years to come. Thank you Sir. As a laymen outside from the ML industry I can understand those practical examples, CIL, Labelled datasets,Semi-supervised classiﬁcation algorithms, Supervised classiﬁcation algorithms,training data, unlabeled or labeled data sets etc. Thank you inspiration…appreciates it. Vazhutthukkal 🙂