Explorations in imbalanced data classification

, Precognox

Human vs. automatic classification

As human beings, we are desperate classifiers. For our ancestors it was essential for survival to distinguish a lion from a cat. Thousands of years later we still do the same: each morning when we start working, our brain needs to be able to classify the objects around us as a chair, a desk, a computer, a mouse, etc. Classification is one of the most basic, and even more importantly, one of the most complex cognitive abilities of humans. Consequently, one would assume that humans are masters of classification and for artificial intelligence-aid machines it must be a great challenge to learn the way humans acquire and apply their categories. So, are computers worse classifiers than humans?

It depends on the task in question. Computers have already produced better performance in image recognition tasks – like sorting images into categories – with a lower error rate than humans. However, humans can cope with evaluations that strongly depend on subjective factors more successfully than computers do. Our data science team accepted the challenge and decided to train an algorithm that can cope with highly complex classification issues.

Human-annotated and imbalanced data

Our data, provided by our customer, Járókelő, was made up of suppliers’ responses to users’ complaints. You can read more about our joint project with Járókelő here. As a first step, annotators evaluated the texts produced by the authorities in charge of tackling the problems raised by the users on a 1 to 5 scale. The better the performance was deemed, the higher score it received. The scores were not given randomly, but certain factors were taken into consideration: e.g. the degree of politeness, the presence of addressing the user, the length of the reply, etc. Later, these factors were mapped into the automatic classifier as features in order to teach our algorithm to classify the responses as similarly to humans as possible. Naturally, such features were also taken into account, in the identification of which computers are more precise than humans, like the proportion of nouns and that of the positive, negative and neutral sentiments in the texts. Human annotators may also have had a general impression of these factors and been influenced by them in articulating their overall evaluation. Finally, we got a dataset that contained automatically selected features and five classes.

However, it turned out soon that the distribution of the five target variables is uneven. In other words, we had much less data of class 1 and 2 than that of 3, 4 and 5. The overrepresentation of certain classes made us expect that our algorithm would overlearn them and would show a poorer performance in case of the underrepresented classes. Consequently, we needed to train a classifier that could handle human-annotated and imbalanced data.

And the winner is…

To pursue our mission, Random Forest was chosen as a learning algorithm due to its impressive performance compared to other learning methods. This task was carried out in the platform of Orange, which, as a great advantage, can be fed with separate training and testing sets. The Random Forest classifier was tested on four versions of the same training data set, namely on the original imbalanced data set, on an oversampled one, on an undersampled one and on a combination of the latter two. To be able to contrast these methods, we tested the four training sets on the same imbalanced data. Let us see which one could get the closest to the classification done by the annotator team.

1. Imbalanced training set

Teaching the algorithm to classify the imbalanced data through an imbalanced data set resulted in a precision of 0.420, a recall of 0.500 and a classification accuracy of 0.479. The number of instances of class 3 was predicted to be much higher than it actually is (see the confusion matrix below). This phenomenon can be explained with the lack of clear boundary identification among the classes, namely that the algorithm is not able to properly distinguish the instances of class 3 from the instances of class 4 and class 2. Furthermore, since the classes of 5, 4 and 3 are overrepresented in the training dataset, the algorithm tends to overscore these classes in the testing dataset, too.

, Precognox

2. Undersampled training set

When the classes of 3, 4 and 5 were undersampled in the training data, the score of precision increased significantly: it raised from 0.420 to 0.472. In addition, the numbers of instances of the predicted and actual classes are far closer than they were before, presented in the confusion matrix below. But the results of recall and classification accuracy did not show such an improvement.

, Precognox

3. Oversampled training set

Surprisingly, balancing the imbalanced data by oversampling the poorly represented classes of 1 and 2 did not generate an improvement of precision or recall. However, the classification accuracy did improve: it turned from 0.479 to 0.491, which means that our classifier trained on oversampled data was able to classify more examples correctly than the one trained on imbalanced data.

4. Smoteenn training set

The combination of the methods of over- and undersampling the training set (called Smoteenn) seems to be our Number 1 solution in terms of precision, with its remarkable performance of 0.667. However, as the confusion matrix suggests below, it highly overpredicts the numbers of instances in classes of 1 and 2. It is exactly the opposite of what happens in the case of an imbalanced data set.

, Precognox

Boosting our achievements further

What this experiment suggests to us is that it is much harder to approximate human performance in automatic classification in case of imbalanced, human-annotated data with five classes than in case of less complex classification tasks. It can be concluded that there is no single ‘take-all’ method, but there are several solutions, which can be adopted in harmony with the needs of your clients. Focusing on our clients’ satisfaction, we are still improving our automatic classifier by optimizing the distance measures of the algorithms.