K-means Clustering and its real use case in the Security domain

Priya Singh
6 min readJul 16, 2021

What Is K-means Clustering?

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster. The K-Means clustering algorithm is an iterative process where you are trying to minimize the distance of the data point from the average data point in the cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

For a better understanding of k-means, let’s take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsmen and bowlers.

What is unsupervised learning?

Unsupervised learning is where you train a machine learning algorithm, but you don’t give it the answer to the problem. In Unsupervised Learning, the machine uses unlabeled data and learns on itself without any supervision. The machine tries to find a pattern in the unlabeled data and gives a response.

Steps for K-Means Clustering-

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

“The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.”

Let’s now take an example to understand how K-Means actually works:

We have these 8 points and we want to apply k-means to create clusters for these points. Here’s how we can do it.

Step 1: Choose the number of clusters k

The first step in k-means is to pick the number of clusters, k.

Step 2: Select k random points from the data as centroids

Next, we randomly select the centroid for each cluster. Let’s say we want to have 2 clusters, so k is equal to 2 here. We then randomly select the centroid:

Here, the red and green circles represent the centroid for these clusters.

Step 3: Assign all the points to the closest cluster centroid

Once we have initialized the centroids, we assign each point to the closest cluster centroid:

Here you can see that the points which are closer to the red point are assigned to the red cluster whereas the points which are closer to the green point are assigned to the green cluster.

Step 4: Recompute the centroids of newly formed clusters

Now, once we have assigned all of the points to either cluster, the next step is to compute the centroids of newly formed clusters:

Here, the red and green crosses are the new centroids.

Step 5: Repeat steps 3 and 4

We then repeat steps 3 and 4:

The step of computing the centroid and assigning all the points to the cluster based on their distance from the centroid is a single iteration.

How does k-means clustering work?

The k-means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number (k) of clusters.

Initially k number of so called centroids are chosen. A centroid is a data point (imaginary or real) at the center of a cluster. In Praat each centroid is an existing data point in the given input data set, picked at random, such that all centroids are unique (that is, for all centroids ci and cj, cicj). These centroids are used to train a KNN Classifier. The resulting classifier is used to classify (using k = 1) the data and thereby produce an initial randomized set of clusters. Each centroid is thereafter set to the arithmetic mean of the cluster it defines. The process of classification and centroid adjustment is repeated until the values of the centroids stabilize. The final centroids will be used to produce the final classification/clustering of the input data, effectively turning the set of initially anonymous data points into a set of data points, each with a class identity.

  • Iteration 2 shows the new location of the centroid centers.
  • Iteration 3 has a handful more blue points as the centroids move.
  • Jumping to iteration 6, we see the red centroid has moved further to the right.
  • Iteration 9 shows the green section is much smaller than in iteration 2, blue has taken over the top, and the red centroid is thinner than in iteration 6.
  • The 9th iteration’s results were the same as the 8th iteration’s, so it has “converged”.

Use cases in the security domain

1. Identifying crime localities

k-means can easily be applied to identify the localities of crime in any place. With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

2. Cyber-profiling criminals

The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is information about an individual or group of individuals that are accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity . Difficulties in implementing cyber profiling is on the diversity of user data and behavior when online is sometimes different from actual behavior. Here clustering techniques are used to classify the Web-based content through data user preferences. This preference can be interpreted as an initial grouping of the data so that the resulting cluster will show user profiles

3. Insurance fraud detection

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

4. Automatic clustering of it alerts

Large enterprise infrastructure technology components such as network, storage, or database generate large volumes of alert messages because alert messages potentially point to operational issues that must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.

5. Anomaly detection

Anomaly detection refers to methods that provide warnings of unusual behaviors which may compromise the security and performance of communication networks. Anomalous behaviors can be identified by comparing the distance between real data and cluster centroids. Identifying network anomalies is essential for communication networks of enterprises or institutions. The goal is to provide an early warning about an unusual behavior which can affect the security and the performance of a network.

Thankyou for reading this article 🥰!!

-Priya Singh

--

--