Unsupervised Machine Learning: A 5-Minute Beginner’s Guide

Background

We get a lot of questions about Unsupervised Machine Learning here at DataVisor, because UML is at the core of our detection platform. In this 5-minute primer on UML, we start by defining the overarching field of Artificial Intelligence, then we drill down to the sub-field of Machine Learning, and lastly we discuss the various machine learning techniques, including UML, and when each ML technique is most effective.

What is Artificial Intelligence?

Artificial intelligence is a broad branch of computer science dealing with the simulation of intelligent behavior in computers. After analyzing the cat in figure A, both a person and a working AI model can identify that Figure B is also a cat.

Cat v0
Figure 1A
Unsupervised Machine Learning Cat B
Figure 1B

This ability is a simulation of human intelligence. The AI model has the ability to identify cartoon cats based on real cats.

But how does the AI model identify that figure B is a cat? A rudimentary method would be for a programmer to manually create an enormous, detailed decision tree, hard coding each branch by hand, that would allow the model to identify the cat. Machine learning branch of artificial intelligence that solves this problem by using training data to “teach” an algorithm how to do a task rather than having to manually hard code it.

What is Machine Learning?

Machine Learning is a branch of Artificial Intelligence that allows algorithms to learn from existing data and then apply that knowledge to new data. In our example of identifying a cat, a machine learning model would analyze large numbers of cat photos and illustrations and it would “learn” to identify cats based on that data.

Figure 2: A training data set of cat photos

Many machine learning algorithms have been developed to help computers identify objects: neural networks, Bayes, decision trees, and clustering algorithms. These algorithms can broadly be grouped into three categories: Supervised learning, reinforcement learning, and unsupervised learning. We’ll cover each in detail and discuss their most common use cases.

There are three primary categories of machine learning techniques:

Supervised Machine Learning (and Deep Learning)

Supervised learning is the most common type of machine learning. It requires labeled training data and the training goal is to be able to label the new data (test data) correctly. For example, to teach an algorithm to label e-mails as spam, we manually label a specific number of e-mails as spam or non-spam and provide these to the supervised machine learning model as training data. The model will learn from the e-mails and labels. Once this is complete, we introduce unlabeled new e-mails and the model will be able to identify whether each e-mail is spam or non-spam based on what it learned from the training data set.

One particularly popular form of supervised learning is called deep learning, in which a computer algorithm simulates the way a human brain learns by creating and reinforcing connections between features in a similar way to how the brain creates and reinforces neural connections. A deep learning model analyze the photos in many different and sometime hidden methods. Each analysis method is called a layer and the deep learning model will create many layers at different levels of abstraction to discover ways to represent the data. Low level layers might include basic color or contrast data. Mid level layers might include edges and shapes. High level layers might be human recognizable features like whiskers, eyes, and ears. By analyzing the layers at different levels, deep learning is able to learn to group cat photos.

Figure 3: Lynx

Unsupervised Machine Learning

Unsupervised learning is often used to discover patterns within large amounts of unlabeled data. Its training data is unlabeled, and the training goal is to identify clusters of similar data points. For example, an unsupervised learning algorithm should be able to distinguish a group of “cat” photos from a large variety of other pictures, based on the characteristics shared by the photos of cats.

DataVisor’s unique anti-fraud algorithm is the use of unsupervised learning. There are three main applications of unsupervised learning: clustering, anomaly detection, and dimensional reduction. Using the clustering method, an algorithm gathers observations into groups one by one, with each group containing one or more features. Properly extracting features is the most critical aspect of unsupervised learning. For example, in the identification of cats, attempts are made to extract the characteristics of cats: fur, limbs, ears, eyes, whiskers, teeth, tongues, and the like. By clustering animals with the same characteristics, cats can be grouped together. But at this time, we don’t know what this group is. We only know that all data in this group belongs to the same category. Rabbits and airplanes are not in this category, since their characteristics do not fit. The validity of features directly determines the effectiveness of the algorithm. If we cluster by weight and ignore body features, it’s difficult to distinguish between rabbits and cats.

DataVisor’s anti-fraud work catches fraudulent elements, including malicious registration, hacking, fraudulent loans, and so on. DataVisor’s strength is modeling user behavior and analyzing relationships between users. It can effectively capture fraud groups and stop fraud in a timely manner.

Reinforcement Learning

Reinforcement learning is often used in robotics. The goal of the algorithm is to train the machine to perform various actions. Most of the time, the machine is placed in a specific environment in which it can self-train continuously, and the environment gives positive or negative feedback. The model continuously improves its decision-making by learning from feedback from past actions.

Which Machine Learning Technique Should I Use?

Different machine learning techniques are appropriate for different situations. So, how do we evaluate the fitness of the algorithm? To start, let’s define a few terms so that we can precisely discuss when an algorithm is successful or unsuccessful.

True Positive (TP): A positive instance that is correctly identified as a positive instance by the model

True Negative (TN): A negative instance that is correctly identified as a negative instance by the model

False Positive (FP): A negative instance that is mis-identified as a positive instance

False Negative (FN): A positive instance that is mis-identified as negative instance

Take the cat’s identification as an example. Assume that the model has acquired certain recognition ability through learning. So, we enter 4 pictures and the model’s predictions are as follows:

Figure 5: Machine Judgement Result

To understand the effectiveness of a machine learning technique, there are three commonly used evaluation indicators: precision, recall, and accuracy.

Precision: What percent of positive identifications were correct? This is calculated as TP/(TP+FP).

Recall: What fraction of all truly positive instance did we identify as positive? This is calculated as TP/(TP+FN).

Accuracy: What fraction of predictions (both positive and negative) were correct? This is calculated as (TP+TN)/(TP+TN+FN+FP)

The higher the three indicators, the more effective the algorithm.

Coordinated fraud is common in today’s online environment and unsupervised algorithms can effectively capture fraud rings. When DataVisor’s unsupervised algorithm is applied to certain fraud scenarios, its accuracy rate can be as high as 99%. This demonstrates the applicability and effectiveness of unsupervised algorithms in the Internet industry.

2018-06-22T12:27:08+00:00 June 5th, 2018|Quick Take|