Home > Issue 1 > Class Imbalance Learning

Class Imbalance Learning

SSL-methods A semi-supervised classification method for prognosis of ACLF was proposed in 2015 [60], where the authors constructed an im-balanced prediction model based on small sphere and large margin approach (SSLM) [61], which classifies two classes (improved patients, death patients) of samples by maximizing their margin. SSLM was shown to perform better than OneClass SVM and Support Vector Data Description (SVDD) methods. The authors also experimented with semisupervised Twin SVM [62] by adding unlabeled patients into the dataset.

Transductive graph-based semisupervised learning methods usually build an undirected graph utilizing both labeled and unlabeled samples as vertices. Those methods propagate the label information of labeled samples to neighbors through their edges in order to get the predicted labels of unlabeled samples. Most popular semi-supervised learning approaches are sensitive to initial label distribution happened in imbalanced labeled datasets. The class boundary will be severely skewed by the majority classes in an imbalanced classification. In [63], the authors propose a simple and effective approach to alleviate the unfavorable influence of imbalance problem by iteratively selecting a few unlabeled samples and adding them into the minority classes to form a balanced labeled dataset for the learning methods afterwards. The experiments on UCI datasets [64] and MNIST handwritten digits dataset [65] showed that the proposed approach outperforms other existing state of the art methods.

A SSL method in [66] uses a tranductive learning approach to build upon a graph-based phase field model [67] that handles imbalanced class distributions. This method is able to encourage or penalize the memberships of data to different classes according to an explicit a priori model that avoids biased classifications. Experiments, conducted on real-world benchmarks, support the better performance of the model compared to several state of the art semi-supervised learning algorithms.

The problem of predicting splice sites in a genome using semisupervised learning approach [68] is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of nonsplice sites. To address this challenge, the authors propose to use ensembles of semi-supervised classifiers, specifically self-training and cotraining classifiers. The experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, it was found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines.

A semi-supervised learning task from both labeled and unlabeled instances and in particular, selftraining with decision tree learners as base learners is proposed in [69]. The authors show that standard decision tree algorithm as the base learner cannot be effective in a self-training algorithm to semi-supervised learning. The main reason is that the basic decision tree learner does not produce reliable probability estimation to its predictions. Therefore, it cannot be a proper selection criterion in self-training. They considered the effect of several modifications to the basic decision tree learner that produce better probability estimation than using the distributions at the leaves of the tree. They show that these modifications do not produce better performance when used only the labeled data, but they do benefit more from the unlabeled data in self-training. The modifications that they considered are Naive Bayes Tree, a combination of No-pruning and Laplace correction, grafting, and using a distance-based measure. Then they extended this improvement to algorithms for ensembles of decision trees and the authors show that the ensemble learner gives an extra improvement over the adapted decision tree learners.

In [70], the authors describe the stochastic semi-supervised learning approach that was used in their submission to all six tasks in 20092010 Active Learning Challenge. The method was designed to tackle the binary classification problem under the condition that the number of labeled data points is extremely small and the two classes are highly imbalanced. It starts with only one positive seed given by the contest organizer. They randomly picked additional unlabeled data points and treated them as “negative” seeds based on the fact that the positive label is rare across all datasets. A classifier was trained using the “labeled” data points and then was used to predict the unlabeled dataset. They took the final result to be the average of “n” stochastic iterations. Supervised learning was used as a large number of labels were purchased. Their approach was shown to work well in 5 out of 6 datasets, which ranked them 3rd in the contest.

A framework to address the imbalanced data problem using semisupervised learning is proposed in [71]. Specifically, from a supervised problem, they created a semisupervised problem and then use a semi-supervised learning method to identify the most relevant instances to establish a well-defined training set. They presented extensive experimental results, which demonstrate that the proposed framework significantly outperforms all other sampling algorithms in 67% of the cases across three different classifiers and ranks second best for the remaining 33% of the cases.

A combined co-training and random subspace generation technique is proposed in [72] for sentiment classification problems. There are two main advantages of this dynamic strategy over the static strategy in generating random subspaces. First, the dynamic strategy makes the involved subspace classifiers quite different from each other even when the training data becomes similar after some iterations. Second, considering that the most helpful features (e.g., sentimental words) for sentiment classification usually account for a small portion, it is possible that one random subspace might contain few useful features. When this happens in the static strategy, the corresponding subspace classifier will perform badly in selecting correct samples from the unlabeled data. This makes semi-supervised learning fail. In comparison, the dynamic strategy can avoid this phenomenon.

Pages ( 6 of 8 ): « Previous1 ... 45 6 78Next »

One thought on “Class Imbalance Learning

  1. I had the good fortune of reading your article. It was well-written sir and contained sound, practical advice. You pointed out several things that I will remember for years to come. Thank you Sir. As a laymen outside from the ML industry I can understand those practical examples, CIL, Labelled datasets,Semi-supervised classification algorithms, Supervised classification algorithms,training data, unlabeled or labeled data sets etc. Thank you inspiration…appreciates it. Vazhutthukkal 🙂

Leave a Comment:

Your email address will not be published. Required fields are marked *