The initialization of the cluster centroids is important to ensure fast convergence of K-means. Solutions range from the simple random generation of centroids to the application of genetic algorithms to evaluate the fitness of centroid candidates. We selected an efficient and fast initialization algorithm developed by M. Agha and W. Ashour [4:4].
The steps of the initialization are as follows:
Compute the standard deviation of the set of observations.
Select the dimension k {xk,0, xk,1 … xk,n} with maximum standard deviation.
Rank the observations by their increasing value of standard deviation for the dimension k.
Divide the ranked observations set equally into K sets {Sm}.
Find the median values, size (Sm)/2.
Use the corresponding observations as centroids.
The initialization algorithm is implemented by the private initialize method:
The second step in the K-means algorithm is the assignment of the observations to the clusters for which the centroids have been initialized in step 1. This feat is accomplished by the private assignToClusters method:
def assignToClusters(xt: XTSeries[Array[T]], clusters: List[Cluster[T]], membership: Array[Int]): Int = {
xt.toArray
.zipWithIndex
.filter(x => { //1
val nearestCluster = getNearestCluster(clusters, x._1)//2
val reassigned = nearestCluster != membership(x._2)
clusters(nearestCluster) += x._2 //3
membership(x._2) = nearestCluster //4
reassigned
}).size
}
The core of the assignment of observations to each cluster is the filter on the time series (line 1). The filter computes the index of the closest cluster and checks whether the observation is to be reassigned (line 2). The observation at the index x._2 is added to the nearest cluster, clusters(nearestCluster) (line 3). The current membership of the observations is then updated (line 4).
The cluster closest to an observation data is computed by the getNearestCluster method as follows:
The final step is to implement the iterative computation of the reconstruction error. In this implementation, the iteration terminates when no more observations are reassigned to different clusters. As with other data processing units, the extraction of K-means clusters is encapsulated by the pipe operator |>, so that clustering can be integrated into a workflow using dependency injection described in the Dependency Injection section in Chapter 2, Hello World!.
The generation of the K clusters is executed by the data transformation |>: