Foundations of Imbalanced Learning
Any dataset that exhibits an unequal distribution among its classes can be considered imbalanced. However, the common understanding in the community is that imbalanced data correspond to datasets exhibiting significant, and in some cases extreme, imbalances. Speciﬁcally, this form of imbalance is referred to as a between class imbalance; not uncommon are between-class imbalances in the order of 100:1, 1000:1, and 10000:1, where in each case, one class severely out represents another. In order to highlight the implications of the imbalanced learning problem in the real world, let’s consider a real life problem of classifying a visitor to be a buyer or a non-buyer on an online retail portal, which typically has a ratio of 1:1000 or more. In reality, we ﬁnd that classiﬁers tend to provide a severely imbalanced degree of accuracy , with the majority class having close to 100 percent accuracy and the minority class having accuracies of 0-5 percent. We require a classiﬁer that will provide high accuracy for the minority class without severely jeopardizing the accuracy of the majority class.
Intrinsic imbalance is a direct result of the nature of the data space. However, imbalanced data are not solely restricted to the intrinsic variety. Variable factors such as time and storage also give rise to datasets that are imbalanced. Imbalances of this type are considered extrinsic, i.e., the imbalance is not directly related to the nature of the data space. Extrinsic imbalances are equally as interesting as their intrinsic counterparts since it may very well occur that the data space from which an extrinsic imbalanced dataset is attained may not be imbalanced at all. For instance, suppose a dataset is procured from a continuous data stream of balanced data over a speciﬁc interval of time, and if during this interval, the transmission has sporadic interruptions where data are not transmitted, then it is possible that the acquired dataset can be imbalanced in which case the dataset would be an extrinsic imbalanced dataset attained from a balanced data space. In addition to intrinsic and extrinsic imbalance  , it is important to understand the difference between relative imbalance and imbalance due to rare instances (or “absolute rarity”).
Data complexity is a broad term that comprises of issues such as overlapping, lack of representative data, small disjuncts, and others. In a simple example, consider the depicted distributions in Figure 1, where the stars and circles represent the minority and majority classes, respectively. By inspection, we see that both the distributions in Figures 1a and 1b exhibit relative imbalances. However, notice how Figure 1a has no overlapping examples between its classes and has only one concept pertaining to each class, whereas Figure 1b has both multiple concepts and severe overlapping. Also of interest is subconcept C in the distribution of Figure 1b. This concept might go unlearned by some inducers due to its lack of representative data; this issue embodies imbalances due to rare instances, which we proceed to explore. Imbalance due to rare instances is representative of domains where minority class examples are very limited, i.e., where the target concept is rare. In such a situation, the lack of representative data will make learning difficult regardless of the between-class imbalance. Furthermore, the minority concept may additionally contain a subconcept with limited instances, amounting to diverging degrees of classiﬁcation difficulty. This, in fact, is the result of another form of imbalance called within-class imbalance   , which concerns itself with the distribution of representative data for sub-concepts within a class. These ideas are again highlighted in our simpliﬁed example in Figure 1. In Figure 1b, cluster B represents the dominant minority class concept and cluster C represents a subconcept of the minority class. Cluster D represents two sub-concepts of the majority class and cluster A (anything not enclosed) represents the dominant majority class concept. For both classes, the number of examples in the dominant clusters significantly outnumber the examples in their respective subconcept clusters, so that this data space exhibits both within-class and between-class imbalances. Moreover, if we completely remove the examples in cluster B, the data space would then have a homogeneous minority class concept that is easily identiﬁed (cluster C), but can go unlearned due to its severe underrepresentation.
The existence of within-class imbalances is closely inter-twined with the problem of small disjuncts, which has been shown to greatly degrade the classiﬁcation performance   . Brieﬂy, the problem of small disjuncts can be understood as follows: A classiﬁer will attempt to learn a concept by creating multiple disjunct rules that describe the main concept . In the case of homogeneous concepts, the classiﬁer will generally create large disjuncts, i.e., rules that cover a large portion (cluster) of examples pertaining to the main concept. However, in the case of heterogeneous concepts, small disjuncts, i.e., rules that cover a small cluster of examples pertaining to the main concept, arise as a direct result of under-represented subconcepts . Moreover, since classiﬁers attempt to learn both majority and minority concepts, the problem of small disjuncts is not only restricted to the minority concept. On the contrary, small disjuncts of the majority class can arise from noisy misclassiﬁed minority class examples or under-represented sub-concepts. However, because of the vast representation of majority class data, this occurrence is infrequent. A more common scenario is that noise may inﬂuence disjuncts in the minority class. In this case, the validity of the clusters corresponding to the small disjuncts becomes an important issue, i.e., whether these examples represent an actual subconcept or are merely attributed to noise. For example, in Figure 1b, suppose a classiﬁer generates disjuncts for each of the two noisy minority samples in cluster A, then these would be illegitimate disjuncts attributed to noise compared to cluster C, for example, which is a legitimate cluster formed from a severely under represented subconcept.
The last issue to consider is the combination of imbalanced data and the small sample size problem. In many of today’s data analysis and knowledge discovery applications, it is often unavoidable to have data with high dimensionality and small sample size; some speciﬁc examples include face recognition and gene expression data analysis, among others. Traditionally, the small sample size problem has been studied extensively in the pattern recognition community. Dimensionality reduction methods have been widely adopted to handle this issue, e.g., principal component analysis (PCA) and various extension methods . However, when the representative datasets’ concepts exhibit imbalances of the forms described earlier, the combination of imbalanced data and small sample size presents a new challenge to the community. In this situation, there are two critical issues that arise simultaneously. First, since the sample size is small, all of the issues related to absolute rarity and withinclass imbalances are applicable. Second and more importantly, learning algorithms often fail to generalize inductive rules over the sample space when presented with this form of imbalance. In this case, the combination of small sample size and high dimensionality hinders learning because of difficulty involved in forming conjunctions over the high degree of features with limited samples. If the sample space is sufficiently large enough, a set of general (albeit complex) inductive rules can be deﬁned for the data space. However, when samples are limited, the rules formed can become too speciﬁc, leading to overﬁtting.
Furthermore, this also suggests that the conventional evaluation practice of using singular assessment criteria, such as the overall accuracy or error rate, does not provide adequate information in the case of imbalanced learning. Therefore, more informative assessment metrics, such as the receiver operating characteristics curves, precision recall curves, and cost curves, are necessary for conclusive evaluations of performance in the presence of imbalanced data .