In our **previous blog post** we introduced the basics of clustering and the first steps to follow when you want to start clustering data (malware in this case). In this blog post we want to cover the next steps, i.e. what needs to be done once you have selected the right attributes and the best measure to validate and compare attributes. In this follow-up **we’ll discuss the basics of clustering algorithms**.

The next logical step once you selected the attributes and calculated the distance between the elements **is grouping all elements into specific sets which share common characteristics**, such as *contacted IPs*, *URLs*, *imported APIs* and whatever else the researcher selected as attributes.

Clustering algorithms group items based upon their mutual distance. To make it easier: **the choice of including an item in a specific set is based on its distance from the set itself**.

**DBSCAN** (*Density-Based Spatial Clustering of Applications with Noise*) is one of the most used density based clustering algorithms. The algorithm is based on the idea that items which form high-density spatial regions can be considered as a cluster. How can this approach be applied to our malware analysis?

As showed in our previous blog post, we have computed the distance matrix of a set of items, thus we know the distance of each element from all the others.

The algorithm takes as input 3 parameters:

*min_pts*,*eps –*which defines our idea of density*distance_matrix*

Consider the following picture:

**density:**refers to the number of points inside the area described by a radius named*eps*.**core-points:**are points with a density greater than*min_pts*. Core points are always assigned to a cluster**border-points:**are points with a density less than*min_pts*yet interesting because of the presence of a core-point with distance less or equal to*eps*.**noise-points:**are points not belonging to core-points nor to border-points

Basing on the definition of core-sample, any cluster has at least *min_pts* points in it. The higher *eps *value, the less restrictive clustering will be. Samples with a medium distance will be grouped in the same cluster and noise points will be considered as borders (or perhaps new core points). *M**in_pts *and *eps *values can be changed in order to define our idea of a malware family.

As a side note, remember that a noise point isn’t less relevant than a cluster, it really depends if we want to focus on finding new variants or instead on well known malware families.

A more detailed description of the algorithm can be found here: **DBSCAN**

Let’s now proceed with a more practical example.

In our previous blog post we have based our malware distance matrix on the “*contacted URLs*” attribute. Now we want to make some clustering based upon malware contacted IPs.

From our threat intel’s network webpage we have retrieved an interesting IP: **1.234.83.146**

Using our **Threat Intelligence APIs**, we have found that the IP **1.234.83.146** is contacted by **353 samples **at the time of this blog post.

As a first step, we compute the distance matrix using **Jaccard Distance** based on the contacted IPs’s list of each sample.

As the Jaccard Distance value varies from 0 (two samples has the same contacted IPs list) to 1 (two samples has no common IPs) we can apply DBSCAN algorithm with

*eps*: 0.5*min_pts*: 1 (*noise-points*will be considered a cluster)

DBSCAN found two different clusters:

Cluster 1 290f3104a53cc5776d3ad8b562291680 0124995e09a3f5be548c4e5cadc116a1 96695193ac9870f973b9267cfe6c7009 5b875a5570014cad5e657cba72b451e9 5a72a1a53720ef4501e87e7d1c82a9ba bd132a4410580bfc065d9260c956a79f 4fa660009cba0b3401f71439b885e067 ae7038d91f0b0af1e9422f3d7fa9a013 aa86216fc7585878e10b714cf2149933 7a764dba191d2bf20206bf4499eb2917 [...] Cluster 2 6368cc6d88c559bb27da31ef251a52a1 5b41df0eccd56c7a0a6c441b99c77d1f 23a79b803e870f4c21a8d697a788e1a1 1052b6252a07e91a3ff300028100b338 8f68c9a4a1769f57651a6a26b0ea2cf9

Please note that DBSCAN is **not **applied to the points in the picture. The picture itself is just a spatial representation of the distance matrix.

Let’s try to use different input parameters:

*eps*: 0.2*min_pts*: 1

As expected, DBSCAN has identified more clusters.

Cluster 1 290f3104a53cc5776d3ad8b562291680 0124995e09a3f5be548c4e5cadc116a1 96695193ac9870f973b9267cfe6c7009 5b875a5570014cad5e657cba72b451e9 5a72a1a53720ef4501e87e7d1c82a9ba bd132a4410580bfc065d9260c956a79f ae7038d91f0b0af1e9422f3d7fa9a013 aa86216fc7585878e10b714cf2149933 7a764dba191d2bf20206bf4499eb2917 aa34f88764f54592afe967f1e983d8e3 [...] Cluster 2 4fa660009cba0b3401f71439b885e067 64409a372ff880026ff44d27b5441f80 Cluster 3 6368cc6d88c559bb27da31ef251a52a1 5b41df0eccd56c7a0a6c441b99c77d1f 23a79b803e870f4c21a8d697a788e1a1 1052b6252a07e91a3ff300028100b338 Cluster 4 27bd99bf75491447fb3383d1f54f4e40 a8fb2f72e3afe57964200154692e5b9b 630e12b9a1731fcc9d9f793086090b60 053dacd3cfb45d9d7f69601a4fe06000 Cluster 5 c7b19f8250b70ae5bd46590749bf9660 b23826aefbbd36166c976df201fbdd2f Cluster 6 8f68c9a4a1769f57651a6a26b0ea2cf9

This is just a basic explanation but it’s how you can leverage clustering algorithms to identify new unidentified samples and/or well known malware families **once you have extracted the right data, that data which is relevant for malware isolation and identification**.

Here below some links related to the isolated samples as well as to our threat intelligence portal:

**290f3104a53cc5776d3ad8b562291680**

**4fa660009cba0b3401f71439b885e067**

**6368cc6d88c559bb27da31ef251a52a1**

**27bd99bf75491447fb3383d1f54f4e40**