Technical Blog | Are Labels Useful to Unsupervised Algorithms?

One of the significant advantages of unsupervised machine learning (UML) compared to traditional detection methods like supervised machine learning (SML) is that with UML, labels are not required for successful detection. Since manually labeling data is time consuming, difficult, and often inaccurate, UML can save significant time and allow for the detection of fraud or abuse that might have been missed during manual data labeling. 

But does the fact that UML doesn’t require labels mean that there is no benefit at all to labels? If label data exists already, how can it be used to improve UML detection results? In this article we discuss how labels can be effectively used in UML detection, even if they are not required.

How is Unsupervised Learning Different from Supervised Learning?

Supervised Machine Learning (SML) requires a training data set where the abuse or fraud has been identified, or labeled, in advance. The model is trained on this data set with the goal of being able to receive a new, unlabeled data set and to detect the same types of abuse and fraud that have been labeled in the original training data set. For fraud and abuse detection, SML are performing classification by applying specific labels to each data point. For example, a list of credit card transactions might be classified with the labels “fraud”, “not fraud”, and “unknown.”

SML has many benefits, including ease of setup with many open-source and commercial SML algorithms readily available, and being able to specify in advance the classes in the classification scheme. However, for fraud detection it has many limitations as well. First and foremost, labeling training data is a time consuming, and often inaccurate process. If a novel fraud or abuse technique suddenly gains popularity, the SML model won’t be able to detect it until after the technique has been manually labeled and the SML model has been trained. In addition, if a new attack technique is not identified by human labelers, it won’t be properly labeled in the training data and the SML model won’t be able to detect it. Secondly, SML models require lots of data pre processing to prevent the problem of over-fitting, which we’ll detail in another blog post. This limits the quantity of data that most SML models can realistically use, meaning potentially valuable data sources might be left out. 

Unsupervised Machine Learning (UML) differs from SML in that no training data set is required. The model is provided a completely unlabeled data set and attempts to find clusters of data points that are similar and unusual compared to the other data points in the data set. For example in attempting to detect fake reviews, the UML model might detect clusters of reviews from for a company in Indiana where the reviewers are from India all using android phones with outdated operating systems. 

By avoiding the need for labels, UML is able to detect new attack techniques far more quickly, and often times during the attack incubation period, before damage is done. In addition because UML models can accept a wide range of data without pre-processing, they can analyze orders of magnitude more data without risk of overfitting. This allows them to find patterns in data sources that might not have been analyzed by an SML model. 

How Labels are used in Supervised Machine Learning

In Supervised Machine Learning, labeled training data are divided into three different data sets that are described in Brian D. Ripley’s classic monograph Pattern Recognition and Neural Networks 1996.

  1. Training set: The training data set is used first to tune the model parameter weights.
  2. Validation set: Next, a smaller set of data is used to tune the overall model architecture, as opposed to the weights. For example there are many different Artificial Neural Network architectures that will be compared against each other using the Validation set.
  3. Test set: After the model is tuned on the Training and Validation data sets, the test set is used to measure the performance of the model.

The reason the testing data set is not used to train the model is to prevent overfitting – where the SML model only is able to predict the results of the training data set and not able to effectively predict the results of new data. 

How Labels are used in Unsupervised Machine Learning

In Unsupervised Machine Learning, the training labels are used to validate the model rather to train it. If labels are available, they can be used to calculate common model performance criteria like recall, precision, and accuracy, which are described in our earlier post on UML. These performance criteria in turn allow the UML model parameters to be tuned to improve its performance.

If training labels are not available, then human review is required to evaluate how well the UML model is detecting fraud or abuse. Then the parameters of the UML engine are tuned based on human feedback. 

When Unsupervised Machine Learning is most effective

Unsupervised Machine Learning is particularly effective at detecting fraud and abuse, when rapidly evolving, coordinated activity exists. 

For internet companies, common use cases include mass registration, account takeover, spam, fake reviews, and promotion abuse. In all of these use cases, attackers attempt to control huge armies of fake or stolen accounts to spread spam, phishing, or fake reviews. They use sophisticated techniques to make each account appear real and are often able to fool human reviewers and supervised machine learning models. 

Unsupervised machine learning is particularly effective at seeing through these techniques because it is able see the hidden connections between accounts. While labeling isn’t necessary for these use cases, it can help improve the UML model. 

In the financial industry, UML is highly effective at detecting transaction fraud, credit application fraud, and money laundering activities. Here again, it is the coordination of accounts that UML is uniquely able to detect because UML analyzes the connections between accounts. Similar to the internet industry, labels here are effective at improving the model performance, although not required. 

The above is the unsupervised role of the label and the interpretation of the unsupervised algorithm application scenario. 

2018-06-22T12:10:32+00:00 June 5th, 2018|Technical Post|