The basics of clustering behind Deepviz – part 2

In our previous blog post we introduced the basics of clustering and the first steps to follow when you want to start clustering data (malware in this case). In this blog post we want to cover the next steps, i.e. what needs to be done once you have selected the right attributes and the best measure to validate and compare attributes. In this follow-up we’ll discuss the basics of clustering algorithms.

The next logical step once you selected the attributes and calculated the distance between the elements is grouping all elements into specific sets which share common characteristics, such as contacted IPs, URLs, imported APIs and whatever else the researcher selected as attributes.

Clustering algorithms group items based upon their mutual distance. To make it easier: the choice of including an item in a specific set is based on its distance from the set itself.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is one of the most used density based clustering algorithms. The algorithm is based on the idea that items which form high-density spatial regions can be considered as a cluster. How can this approach be applied to our malware analysis?

As showed in our previous blog post, we have computed the distance matrix of a set of items, thus we know the distance of each element from all the others.

The algorithm takes as input 3 parameters:

  • min_ptseps – which defines our idea of density
  • distance_matrix

Consider the following picture:

cluster

 

  • density: refers to the number of points inside the area described by a radius named eps.
  • core-points:  are points with a density greater than min_pts. Core points are always assigned to a cluster
  • border-points: are points with a density less than min_pts yet interesting because of the presence of a core-point with distance less or equal to eps.
  • noise-points: are points not belonging to core-points nor to border-points

 

Basing on the definition of core-sample, any cluster has at least min_pts points in it. The higher eps value, the less restrictive clustering will be. Samples with a medium distance will be grouped in the same cluster and noise points will be considered as borders (or perhaps new core points). Min_pts and eps values can be changed in order to define our idea of a malware family.

As a side note, remember that a noise point isn’t less relevant than a cluster, it really depends if we want to focus on finding new variants or instead on well known malware families.

A more detailed description of the algorithm can be found here: DBSCAN

Let’s now proceed with a more practical example.

In our previous blog post we have based our malware distance matrix on the “contacted URLs” attribute. Now we want to make some clustering based upon malware contacted IPs.

From our threat intel’s network webpage we have retrieved an interesting IP: 1.234.83.146

 

ip_1

 

Using our Threat Intelligence APIs, we have found that the IP 1.234.83.146 is contacted by 353 samples at the time of this blog post.
search_intel

 

As a first step, we compute the distance matrix using Jaccard Distance based on the contacted IPs’s list of each sample.

As the Jaccard Distance value varies from 0 (two samples has the same contacted IPs list)  to 1 (two samples has no common IPs) we can apply DBSCAN algorithm with

  • eps: 0.5
  • min_pts: 1 (noise-points will be considered a cluster)

cluster1

DBSCAN found two different clusters:

Cluster 1
290f3104a53cc5776d3ad8b562291680
0124995e09a3f5be548c4e5cadc116a1
96695193ac9870f973b9267cfe6c7009
5b875a5570014cad5e657cba72b451e9
5a72a1a53720ef4501e87e7d1c82a9ba
bd132a4410580bfc065d9260c956a79f
4fa660009cba0b3401f71439b885e067
ae7038d91f0b0af1e9422f3d7fa9a013
aa86216fc7585878e10b714cf2149933
7a764dba191d2bf20206bf4499eb2917
[...]

Cluster 2
6368cc6d88c559bb27da31ef251a52a1
5b41df0eccd56c7a0a6c441b99c77d1f
23a79b803e870f4c21a8d697a788e1a1
1052b6252a07e91a3ff300028100b338
8f68c9a4a1769f57651a6a26b0ea2cf9

cluster1.docx (Full list)

 

Please note that DBSCAN is not applied to the points in the picture. The picture itself is just a spatial representation of the distance matrix.

Let’s try to use different input parameters:

  • eps: 0.2
  • min_pts: 1

cluster2

As expected, DBSCAN has identified more clusters.

 

Cluster 1
290f3104a53cc5776d3ad8b562291680
0124995e09a3f5be548c4e5cadc116a1
96695193ac9870f973b9267cfe6c7009
5b875a5570014cad5e657cba72b451e9
5a72a1a53720ef4501e87e7d1c82a9ba
bd132a4410580bfc065d9260c956a79f
ae7038d91f0b0af1e9422f3d7fa9a013
aa86216fc7585878e10b714cf2149933
7a764dba191d2bf20206bf4499eb2917
aa34f88764f54592afe967f1e983d8e3
[...]

Cluster 2
4fa660009cba0b3401f71439b885e067
64409a372ff880026ff44d27b5441f80

Cluster 3
6368cc6d88c559bb27da31ef251a52a1
5b41df0eccd56c7a0a6c441b99c77d1f
23a79b803e870f4c21a8d697a788e1a1
1052b6252a07e91a3ff300028100b338

Cluster 4
27bd99bf75491447fb3383d1f54f4e40
a8fb2f72e3afe57964200154692e5b9b
630e12b9a1731fcc9d9f793086090b60
053dacd3cfb45d9d7f69601a4fe06000

Cluster 5
c7b19f8250b70ae5bd46590749bf9660
b23826aefbbd36166c976df201fbdd2f

Cluster 6
8f68c9a4a1769f57651a6a26b0ea2cf9

cluster2.docx (Full list)

 

This is just a basic explanation but it’s how you can leverage clustering algorithms to identify new unidentified samples and/or well known malware families once you have extracted the right data, that data which is relevant for malware isolation and identification.

Here below some links related to the isolated samples as well as to our threat intelligence portal:

Deepviz Threat Intel

290f3104a53cc5776d3ad8b562291680

4fa660009cba0b3401f71439b885e067

6368cc6d88c559bb27da31ef251a52a1

27bd99bf75491447fb3383d1f54f4e40

c7b19f8250b70ae5bd46590749bf9660

8f68c9a4a1769f57651a6a26b0ea2cf9