The basics of clustering behind Deepviz – part 1

In the last years we’ve seen many AV companies trying to approach machine learning and artificial intelligence to help isolating new malware and rising malware families in a quick and effective way. The main problem is that many people look at machine learning like it’s a magic wand – you start using machine learning, give to the algorithm as many samples as you can and the magic is done! Not really.

When we started developing Deepviz we wanted to build not only a powerful and fully scalable automated malware analyzer infrastructure, but even a strong and effective threat intelligence platform, able to use all the data extracted by our malware analyzer and correlate it in such a way to take the best out of it.

Machine learning is the key behind how Deepviz is able to identify new malware and find similar samples, correlate them and spot bigger malware families. However implementing an effective machine learning approach is not as easy as many people think – for instance it doesn’t work like “give to the intelligence as many details as possible and it will make the trick“.

In this first blog post we want to shed some light behind our approach to machine learning and how this is greatly helping us identify new malware every day.

Clustering is an analysis technique which aims to identify significant groups in a given dataset. Clustering malware can be very useful as it allows us to find malware families or samples that share common characteristics.  In order to cluster a set of data, all what we need is to have a representation of the distance between the individual elements of the dataset.

Each element is represented by a feature set, a set of attributes that are in some way significant for the object itself. There are two main points that must be taken into consideration here:

  • the first one is to understand how to select attributes. As said above, people who think that extracting millions of attributes from a given object is what is needed for clustering don’t take into consideration that the greater number of attributes considered, the more computational time and effort is required to estimate similarity. Anyway we need to find a sufficient number of attributes that will allow us to isolate specific behaviors of several malware families. For instance running a clustering algorithm using only entropy attribute of PE files as a feature set will lead to wrong clusterization. This is where our malware analyzer plays the biggest part, extracting the most significant attributes needed for malware analysis.

  • the second point is finding the best measure to validate and compare attributes of malware. Each malware can be represented by numeric (e.g. section entropy) or nominal (e.g. contacted IPs) attributes. In math, a similarity measure is a function that describes how much two objects are similar. Euclidean distance is one of the well known and used measure to compare numeric values. What about nominal values? Given two malware with a list of contacted IPs or URLs, how can we compare these two sets? How can we use them in the clustering process?

Let’s take an example. This picture shows all MD5s contacting the domain


Our clustering algorithm grouped the samples in 4 different families. The picture above is just the last step of whole process.

In order to find similar sets, we need to know how every element is similar – or different – to each other. These values can be represented through a distance matrix.

Based of what already written, we need to:

  • Choose a feature set, composed by one or more attributes that can be used to evaluate the distance between each element of a given dataset
  • Choose an appropriate distance measure which will be applied to the attributes

Then the needed steps are:

  • Compute for each element the distance between itself and every other element in the dataset
  • Apply a clustering algorithm to group similar items. In our example the obtained sets are malicious families

When we wrote about calculating distance, we said that Euclidean distance is massively used for calculating distance between numeric values. Now we want to introduce the Jaccard distance as measure for nominal values.

Let’s give an example and consider a feature set composed by only one attribute: the list of contacted URLs for the samples 26414a9d627606c4974d8c3f372b0797 and 27f72541c93e206dcd5b2d4171e66f9a:


Sample: 26414a9d627606c4974d8c3f372b0797
Sample: 27f72541c93e206dcd5b2d4171e66f9a*

Jaccard similarity measure is one of the most used similarity measures used to compare sets of nominal values. It is defined as the size of the intersection divided by the size of the union of two give sets. If there are no intersecting elements, the Jaccard measure is 0. If the two sets share the same elements, then the Jaccard measure is 1. Here below the formula:


Considering the above given examples:

  • set A is composed by 19 elements
  • set B is composed by 12 elements
  • Interesection set is composed by 4 elements

Thus the Jaccard similarity between the two sets is 0.15.  Note that a similarity measure can be converted into a distance measure with the following formula:

distance = 1 - similarity


If we calculate the Jaccard distance between all the samples in the set we obtain the distance matrix:


Distance between A and B is 0.15, between A and C is 0.8 and so on. The distance between an element and itself is obviously 0. On a side note, take into consideration that the distance between A and B is the same between B and A. The distance measure must maintain the symmatric property. In order to optimize the computational time of the matrix we can use the symmetric property and compute only half of the matrix.

Once computed, the distance matrix can be used as input of a clustering algorithm. Clearly, the more effective and significant attributes are extracted and used to build the distance matrix, the better the clustering algorithm will perform.

In the next blog post we will cover the basics of clustering algorithm. Stay tuned!

One thought on “The basics of clustering behind Deepviz – part 1

Comments are closed.