Class Imbalance Learning
The Class Imbalance Learning (CIL) problem is concerned with the performance of classiﬁcation algorithms in the presence of under-represented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced datasets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data eﬃciently into information and knowledge representation. It is necessary to study CIL because it is apparently rare to ﬁnd classiﬁcation problems in real-world scenarios that follow balanced class distributions. We present some real-world use cases, where the data distribution is naturally imbalanced and the data distribution of interest is typically the minority class data points.
Computer Assisted Coding: A computer assisted coding system (CACS) is a software system that analyzes healthcare documents (medical charts) and produces appropriate medical codes for speciﬁc phrases and terms within the document. ICD10 is the 10th revision of the International Statistical Classiﬁcation of Diseases and Related Health Problems (ICD), a medical classiﬁcation list by the World Health Organization (WHO). It contains codes for diseases, signs and symptoms, abnormal ﬁndings, complaints, social circumstances, and external causes of injury or diseases. ICD-10 Clinical Modiﬁcation has about 69,833 codes, and ICD-10 Procedure Coding System has 71,918 codes, which make up for a perfect challenge for class imbalance learning, where the coding system is both multi-label classiﬁcation, with thousands of classes and severely skewed in terms of class distribution. Companies like BuddiHealth provide deep learning and knowledge based hybrid solutions to the extreme classiﬁcation problem, while companies like 3M, Optum, Nuance, Atiego, ezCAC, Opera provide statistical NLP and rule based solutions.
Suicide Prevention Tools: Facebook has introduced a suicide-prevention feature that uses AI to identify posts indicating suicidal or harmful thoughts. The AI scans the posts and their associated comments, compares them to others that merited intervention, and, in some cases, passes them along to its community team for review. The company plans to pro-actively reach out to users it believes are at risk, showing them a screen with suicide-prevention resources including options to contact a helpline or reach out to a friend. Algorithms, trained on report data from the network’s close to two billion users, are constantly on the lookout for warning signs in content that users post, as well as replies that are received. When a red ﬂag is raised, a team of human reviewers is alerted, and the user can be contacted and offered help.
Fraud Management: is an example of big data with class imbalance characteristics, where the suspicious behaviors are“fortunately”rare events. Big data analytics link heterogeneous information from transaction data, which enables the service provider to pick up these behaviors automatically. For example, a series of cash-in transactions to the same account, from diﬀerent locations, might be an attempt to avoid paying for domestic transfers, or several cash-ins immediately followed by a cash-out could indicate money laundering. Best practice states that no actions are purely automated; the fraud analyst always has the ﬁnal say. As a population’s behavior evolves over time, the parameters of the fraud detection models must adapt to remain optimal. In response to this, machine learning algorithms predict a natural evolution of behavior based on historical data as well as previous actions taken by fraud managers in decision making, and also propose modiﬁcations to the detection model for future anomaly detection.
Churn Prediction: One of the most important business metrics for executives is churn rate—the rate at which your customers stop doing business with you. Today, data driven companies use data science to effectively predict which customers are likely to churn. Such programs allow companies to pro-actively protect revenue by incentivizing the potential churners to continue staying with them. Networking and communication companies typically have business level data, usage logs, and call center/support tickets, among other data assets. This data is generated from their consumer and business customer interactions, and these datasets vary in terms of volume and user behavior. The machine learning task of interest is to predict if a customer would become a churner or not, which is relatively a very small subset of the entire customer population.
Buyer Prediction: Millions of consumers visit e-commerce sites, but only few thousands visitors buy the products, which makes the imbalance ratio in the order of 1000:1 or more. A typical e-retailer wants to improve the customer experience and would like to improve the conversion rate. The objective is to identify the potential buyers based on their demographics, historical transaction pattern, clicks pattern, browsing pattern in diﬀerent pages, etc. Deep data analysis reveals the buyers’ buying behaviors which are highly dependent on their activities like number of clicks, session duration, previous session, purchase session, clicksrate per session etc. By applying machine learning and predictive analytics methods, the propensity score of each visitor can be estimated. This leads to multiple beneﬁts for the eretailer to oﬀer right and targeted product for the customers at the right time, increase conversion rate, and improve customer satisfaction.