Title: | Learn Clustering Techniques Through Examples and Code |
---|---|
Description: | A comprehensive educational package combining clustering algorithms with detailed step-by-step explanations. Provides implementations of both traditional (hierarchical, k-means) and modern (Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Gaussian Mixture Models (GMM), genetic k-means) clustering methods as described in Ezugwu et. al., (2022) <doi:10.1016/j.engappai.2022.104743>. Includes educational datasets highlighting different clustering challenges, based on 'scikit-learn' examples (Pedregosa et al., 2011) <https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html>. Features detailed algorithm explanations, visualizations, and weighted distance calculations for enhanced learning. |
Authors: | Eduardo Ruiz Sabajanes [aut],
Roberto Alcantara [aut],
Juan Jose Cuadrado Gallego [aut]
|
Maintainer: | Andriy Protsak Protsak <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1 |
Built: | 2025-02-17 16:23:52 UTC |
Source: | https://github.com/andriyprotsak5/UAHDataScienceUC |
Perform a hierarchical agglomerative cluster analysis on a set of observations
agglomerative_clustering( data, proximity = "single", distance_method = "euclidean", learn = FALSE, waiting = TRUE, ... )
agglomerative_clustering( data, proximity = "single", distance_method = "euclidean", learn = FALSE, waiting = TRUE, ... )
data |
a set of observations, presented as a matrix-like object where every row is a new observation. |
proximity |
the proximity definition to be used. This should be one
of |
distance_method |
the distance measure to use. Supported values are:
|
learn |
a Boolean determining whether intermediate logs explaining how the algorithm works should be printed or not. |
waiting |
a Boolean determining whether the intermediate logs should be printed in chunks waiting for user input before printing the next or not. |
... |
additional arguments passed to |
This function performs a hierarchical cluster analysis for the
objects being clustered. The definition of a set of clusters using
this method follows a
step process, which repeats until a single
cluster remains:
Initially, each object is assigned to its own cluster. The matrix of distances between clusters is computed.
The two clusters with closest proximity will be joined together and
the proximity matrix updated. This is done according to the specified
proximity
. This step is repeated until a single cluster remains.
The definitions of proximity
considered by this function are:
single
. Defines
the proximity between two clusters as the distance between the closest
objects among the two clusters. It produces clusters where each object is
closest to at least one other object in the same cluster. It is known as
SLINK, single-link and minimum-link.
complete
.
Defines the proximity between two clusters as the distance between the
furthest objects among the two clusters. It is known as CLINK,
complete-link and maximum-link.
average
. Defines the proximity between two
clusters as the average distance between every pair of objects, one from
each cluster. It is also known as UPGMA or average-link.
An stats::hclust()
object which describes the tree produced by the
clustering process.
Eduardo Ruiz Sabajanes, [email protected]
cl <- agglomerative_clustering( db5[1:6, ], 'single', learn = TRUE, waiting = FALSE )
cl <- agglomerative_clustering( db5[1:6, ], 'single', learn = TRUE, waiting = FALSE )
Performs hierarchical correlation clustering by applying weights, distance metrics, and other parameters to analyze relationships between data points and a target.
correlation_clustering( data, target = NULL, weight = c(), distance_method = "euclidean", normalize = TRUE, labels = NULL, learn = FALSE, waiting = FALSE )
correlation_clustering( data, target = NULL, weight = c(), distance_method = "euclidean", normalize = TRUE, labels = NULL, learn = FALSE, waiting = FALSE )
data |
A data frame containing the main data |
target |
A data frame, numeric vector or matrix to use as correlation target. Default is NULL. |
weight |
A numeric vector of weights. Default is empty vector. |
distance_method |
A string specifying the distance metric to use. Options are:
|
normalize |
A boolean parameter indicating whether to normalize weights. Default is TRUE. |
labels |
A string vector for graphical solution labeling. Default is NULL. |
learn |
A boolean indicating whether to show detailed algorithm explanations. Default is FALSE. |
waiting |
A boolean controlling pauses between explanations. Default is TRUE. |
This function executes the complete hierarchical correlation method in the following steps:
The function transforms data into useful objects
Creates the clusters
Calculates the distance from the target to every cluster using the specified distance metric
Orders the distances in ascending order
Orders the clusters according to their distance from the previous step
Shows the sorted clusters and the distances used
An R object containing:
dendrogram - A hierarchical clustering dendrogram
sortedValues - A data frame with the sorted cluster values
distances - A data frame with the sorted distances
Original authors:
Roberto Alcantara [email protected]
Juan Jose Cuadrado [email protected]
Universidad de Alcala de Henares
data <- matrix(c(1,2,1,4,5,1,8,2,9,6,3,5,8,5,4), ncol=3) dataFrame <- data.frame(data) target1 <- c(1,2,3) target2 <- dataFrame[1,] weight1 <- c(1,6,3) weight2 <- c(0.1,0.6,0.3) # Basic usage correlation_clustering(dataFrame, target1) # With weights correlation_clustering(dataFrame, target1, weight1) # Without weight normalization correlation_clustering(dataFrame, target1, weight1, normalize = FALSE) # Using Canberra distance with weights correlation_clustering(dataFrame, target1, weight2, distance = "canberra", normalize = FALSE) # With detailed explanations correlation_clustering(dataFrame, target1, learn = TRUE)
data <- matrix(c(1,2,1,4,5,1,8,2,9,6,3,5,8,5,4), ncol=3) dataFrame <- data.frame(data) target1 <- c(1,2,3) target2 <- dataFrame[1,] weight1 <- c(1,6,3) weight2 <- c(0.1,0.6,0.3) # Basic usage correlation_clustering(dataFrame, target1) # With weights correlation_clustering(dataFrame, target1, weight1) # Without weight normalization correlation_clustering(dataFrame, target1, weight1, normalize = FALSE) # Using Canberra distance with weights correlation_clustering(dataFrame, target1, weight2, distance = "canberra", normalize = FALSE) # With detailed explanations correlation_clustering(dataFrame, target1, learn = TRUE)
Test Database 1
db1
db1
db1
A data frame with 500 rows and 2 columns.
The data points form two concentric circles.
Test Database 2
db2
db2
db2
A data frame with 500 rows and 2 columns.
The data points form two moons.
Test Database 3
db3
db3
db3
A data frame with 500 rows and 2 columns.
The data points form three overlapping elliptical clusters of varying densities.
Test Database 4
db4
db4
db4
A data frame with 500 rows and 2 columns.
The data points form three diagonal parallel segments.
Test Database 5
db5
db5
db5
A data frame with 500 rows and 2 columns.
The data points form three non-overlapping circular clusters of similar density.
Test Database 6
db6
db6
db6
A data frame with 500 rows and 2 columns.
The data points are uniformly distributed on the plane.
Perform DBSCAN clustering on a data matrix.
dbscan(data, epsilon, min_pts = 4, learn = FALSE, waiting = TRUE, ...)
dbscan(data, epsilon, min_pts = 4, learn = FALSE, waiting = TRUE, ...)
data |
a set of observations, presented as a matrix-like object where every row is a new observation. |
epsilon |
how close two observations have to be to be considered neighbors. |
min_pts |
the minimum amount of neighbors for a region to be considered dense. |
learn |
a Boolean determining whether intermediate logs explaining how the algorithm works should be printed or not. |
waiting |
a Boolean determining whether the intermediate logs should be printed in chunks waiting for user input before printing the next or not. |
... |
additional arguments passed to |
The data given by data
is clustered by the DBSCAN method,
which aims to partition the points into clusters such that the points in a
cluster are close to each other and the points in different clusters are far
away from each other. The clusters are defined as dense regions of points
separated by regions of low density.
The DBSCAN method follows a 2 step process:
For each point, the neighborhood of radius epsilon
is computed. If
the neighborhood contains at least min_pts
points, then the point is
considered a core point. Otherwise, the point is considered an
outlier.
For each core point, if the core point is not already assigned to a cluster, a new cluster is created and the core point is assigned to it. Then, the neighborhood of the core point is explored. If a point in the neighborhood is a core point, then the neighborhood of that point is also explored. This process is repeated until all points in the neighborhood have been explored. If a point in the neighborhood is not already assigned to a cluster, then it is assigned to the cluster of the core point.
Whatever points are not assigned to a cluster are considered outliers.
A clustlearn::dbscan()
object. It is a list with the following
components:
cluster |
a vector of integers (from 0 to max(cl$cluster) )
indicating the cluster to which each point belongs. Points in cluster number
0 are considered outliers. |
epsilon |
the value of epsilon used. |
min_pts |
the value of min_pts used. |
size |
a vector with the number of data points belonging to each cluster (where the first element is the number of outliers). |
Eduardo Ruiz Sabajanes, [email protected]
# Basic usage data <- db5[1:20,] result <- dbscan(data, epsilon = 0.3, min_pts = 4) # With learning output result <- dbscan(data, epsilon = 0.3, min_pts = 4, learn = TRUE) # Visualize results plot(data, col = result$cluster + 1, pch = 20)
# Basic usage data <- db5[1:20,] result <- dbscan(data, epsilon = 0.3, min_pts = 4) # With learning output result <- dbscan(data, epsilon = 0.3, min_pts = 4, learn = TRUE) # Visualize results plot(data, col = result$cluster + 1, pch = 20)
Perform a hierarchical Divisive cluster analysis on a set of observations
divisive_clustering(data, learn = FALSE, waiting = TRUE, ...)
divisive_clustering(data, learn = FALSE, waiting = TRUE, ...)
data |
a set of observations, presented as a matrix-like object where every row is a new observation. |
learn |
a Boolean determining whether intermediate logs explaining how the algorithm works should be printed or not. |
waiting |
a Boolean determining whether the intermediate logs should be printed in chunks waiting for user input before printing the next or not. |
... |
additional arguments passed to |
This function performs a hierarchical cluster analysis for the
objects being clustered. The definition of a set of clusters using
this method follows a
step process, which repeats until
clusters remain:
Initially, each object is assigned to the same cluster. The sum of squares of the distances between objects and their cluster center is computed.
The cluster with the highest sum of squares is split into two using
the k-means algorithm. This step is repeated until clusters remain.
An stats::hclust()
object which describes the tree produced by the
clustering process.
Eduardo Ruiz Sabajanes, [email protected]
### !! This algorithm is very slow, so we'll only test it on some datasets !! ### Helper function test <- function(db, k) { # Save old par settings old_par <- par(no.readonly = TRUE) # Ensure par settings are restored when function exits on.exit(par(old_par)) print(cl <- divisive_clustering(db, max_iterations = 5)) par(mfrow = c(1, 2)) plot(db, col = cutree(cl, k), asp = 1, pch = 20) h <- rev(cl$height)[50] clu <- as.hclust(cut(as.dendrogram(cl), h = h)$upper) ctr <- unique(cutree(cl, k)[cl$order]) plot(clu, labels = FALSE, hang = -1, xlab = "Cluster", sub = "", main = "") rect.hclust(clu, k = k, border = ctr) } ### Example 1 # test(db1, 2) ### Example 2 # test(db2, 2) ### Example 3 # test(db3, 3) ### Example 4 # test(db4, 3) ### Example 5 test(db5, 3) ### Example 6 test(db6, 3) ### Example 7 (with explanations, no plots) cl <- divisive_clustering( db5[1:6, ], learn = TRUE, waiting = FALSE )
### !! This algorithm is very slow, so we'll only test it on some datasets !! ### Helper function test <- function(db, k) { # Save old par settings old_par <- par(no.readonly = TRUE) # Ensure par settings are restored when function exits on.exit(par(old_par)) print(cl <- divisive_clustering(db, max_iterations = 5)) par(mfrow = c(1, 2)) plot(db, col = cutree(cl, k), asp = 1, pch = 20) h <- rev(cl$height)[50] clu <- as.hclust(cut(as.dendrogram(cl), h = h)$upper) ctr <- unique(cutree(cl, k)[cl$order]) plot(clu, labels = FALSE, hang = -1, xlab = "Cluster", sub = "", main = "") rect.hclust(clu, k = k, border = ctr) } ### Example 1 # test(db1, 2) ### Example 2 # test(db2, 2) ### Example 3 # test(db3, 3) ### Example 4 # test(db4, 3) ### Example 5 test(db5, 3) ### Example 6 test(db6, 3) ### Example 7 (with explanations, no plots) cl <- divisive_clustering( db5[1:6, ], learn = TRUE, waiting = FALSE )
Perform Gaussian mixture model clustering on a data matrix.
gaussian_mixture(data, k, max_iter = 10, learn = FALSE, waiting = TRUE, ...)
gaussian_mixture(data, k, max_iter = 10, learn = FALSE, waiting = TRUE, ...)
data |
a set of observations, presented as a matrix-like object where every row is a new observation. |
k |
the number of clusters to find. |
max_iter |
the maximum number of iterations to perform. |
learn |
a Boolean determining whether intermediate logs explaining how the algorithm works should be printed or not. |
waiting |
a Boolean determining whether the intermediate logs should be printed in chunks waiting for user input before printing the next or not. |
... |
additional arguments passed to |
The data given by data
is clustered by the model-based
algorithm that assumes every cluster follows a normal distribution, thus
the name "Gaussian Mixture".
The normal distributions are parameterized by their mean vector, covariance matrix and mixing proportion. Initially, the mean vector is set to the cluster centers obtained by performing a k-means clustering on the data, the covariance matrix is set to the covariance matrix of the data points belonging to each cluster and the mixing proportion is set to the proportion of data points belonging to each cluster. The algorithm then optimizes the gaussian models by means of the Expectation Maximization (EM) algorithm.
The EM algorithm is an iterative algorithm that alternates between two steps:
Compute how much is each observation expected to belong to each component of the GMM.
Recompute the GMM according to the expectations from the E-step in order to maximize them.
The algorithm stops when the changes in the expectations are sufficiently small or when a maximum number of iterations is reached.
A clustlearn::gaussian_mixture()
object. It is a list with the
following components:
cluster |
a vector of integers (from 1:k ) indicating the
cluster to which each point belongs. |
mu |
the final mean parameters. |
sigma |
the final covariance matrices. |
lambda |
the final mixing proportions. |
loglik |
the final log likelihood. |
all.loglik |
a vector of each iteration's log likelihood. |
iter |
the number of iterations performed. |
size |
a vector with the number of data points belonging to each cluster. |
Eduardo Ruiz Sabajanes, [email protected]
### !! This algorithm is very slow, so we'll only test it on some datasets !! ### Helper functions dmnorm <- function(x, mu, sigma) { k <- ncol(sigma) x <- as.matrix(x) diff <- t(t(x) - mu) num <- exp(-1 / 2 * diag(diff %*% solve(sigma) %*% t(diff))) den <- sqrt(((2 * pi)^k) * det(sigma)) num / den } test <- function(db, k) { print(cl <- gaussian_mixture(db, k, 100)) x <- seq(min(db[, 1]), max(db[, 1]), length.out = 100) y <- seq(min(db[, 2]), max(db[, 2]), length.out = 100) plot(db, col = cl$cluster, asp = 1, pch = 20) for (i in seq_len(k)) { m <- cl$mu[i, ] s <- cl$sigma[i, , ] f <- function(x, y) cl$lambda[i] * dmnorm(cbind(x, y), m, s) z <- outer(x, y, f) contour(x, y, z, col = i, add = TRUE) } } ### Example 1 test(db1[1:250,], 2) ### Example 2 (with explanations, no plots) cl <- gaussian_mixture( db5[1:10, ], 3, learn = TRUE, waiting = FALSE )
### !! This algorithm is very slow, so we'll only test it on some datasets !! ### Helper functions dmnorm <- function(x, mu, sigma) { k <- ncol(sigma) x <- as.matrix(x) diff <- t(t(x) - mu) num <- exp(-1 / 2 * diag(diff %*% solve(sigma) %*% t(diff))) den <- sqrt(((2 * pi)^k) * det(sigma)) num / den } test <- function(db, k) { print(cl <- gaussian_mixture(db, k, 100)) x <- seq(min(db[, 1]), max(db[, 1]), length.out = 100) y <- seq(min(db[, 2]), max(db[, 2]), length.out = 100) plot(db, col = cl$cluster, asp = 1, pch = 20) for (i in seq_len(k)) { m <- cl$mu[i, ] s <- cl$sigma[i, , ] f <- function(x, y) cl$lambda[i] * dmnorm(cbind(x, y), m, s) z <- outer(x, y, f) contour(x, y, z, col = i, add = TRUE) } } ### Example 1 test(db1[1:250,], 2) ### Example 2 (with explanations, no plots) cl <- gaussian_mixture( db5[1:10, ], 3, learn = TRUE, waiting = FALSE )
Performs Genetic K-Means clustering on a data matrix.
genetic_kmeans( data, k, population_size = 10, mut_probability = 0.5, max_generations = 10, learn = FALSE, waiting = TRUE, ... )
genetic_kmeans( data, k, population_size = 10, mut_probability = 0.5, max_generations = 10, learn = FALSE, waiting = TRUE, ... )
data |
a set of observations, presented as a matrix-like object where every row is a new observation. |
k |
the number of clusters. |
population_size |
the number of individuals in the population. |
mut_probability |
the probability of a mutation occurring. |
max_generations |
the maximum number of iterations allowed. |
learn |
a Boolean determining whether intermediate logs explaining how the algorithm works should be printed or not. |
waiting |
a Boolean determining whether the intermediate logs should be printed in chunks waiting for user input before printing the next or not. |
... |
additional arguments passed to |
A kmeans object as returned by the original kmeans function.
Eduardo Ruiz Sabajanes, [email protected]
### Example 1: Simple usage with circles dataset result1 <- genetic_kmeans(db1[1:20,], 2, learn = TRUE, waiting = FALSE) ### Example 2: Moons dataset with different population size result2 <- genetic_kmeans(db2[1:20,], 2, population_size = 20, learn = TRUE, waiting = FALSE) ### Example 3: Varying density clusters with different mutation probability result3 <- genetic_kmeans(db3[1:20,], 3, mut_probability = 0.7, learn = TRUE, waiting = FALSE) ### Example 4: Well-separated clusters with larger population result5 <- genetic_kmeans(db5[1:20,], 3, population_size = 30, mut_probability = 0.6, learn = TRUE, waiting = FALSE) ### Example 5: Using different parameters combinations result6 <- genetic_kmeans(db1[1:20,], 2, population_size = 15, mut_probability = 0.8, max_generations = 15, learn = TRUE, waiting = FALSE)
### Example 1: Simple usage with circles dataset result1 <- genetic_kmeans(db1[1:20,], 2, learn = TRUE, waiting = FALSE) ### Example 2: Moons dataset with different population size result2 <- genetic_kmeans(db2[1:20,], 2, population_size = 20, learn = TRUE, waiting = FALSE) ### Example 3: Varying density clusters with different mutation probability result3 <- genetic_kmeans(db3[1:20,], 3, mut_probability = 0.7, learn = TRUE, waiting = FALSE) ### Example 4: Well-separated clusters with larger population result5 <- genetic_kmeans(db5[1:20,], 3, population_size = 30, mut_probability = 0.6, learn = TRUE, waiting = FALSE) ### Example 5: Using different parameters combinations result6 <- genetic_kmeans(db1[1:20,], 2, population_size = 15, mut_probability = 0.8, max_generations = 15, learn = TRUE, waiting = FALSE)
Allele mutation probability computation
gka_allele_mutation(data, k, centers, ...)
gka_allele_mutation(data, k, centers, ...)
data |
a set of observations, presented as a matrix-like object where
every row is a new observation. The matrix is of size |
k |
the number of clusters. |
centers |
a matrix of size |
... |
additional arguments passed to |
a matrix of size n
by k
with the probability of each
allele mutating to a specific cluster.
Eduardo Ruiz Sabajanes, [email protected]
Centroid computation
gka_centers(data, k, population)
gka_centers(data, k, population)
data |
a set of observations, presented as a matrix-like object where
every row is a new observation. The matrix is of size |
k |
the number of clusters. |
population |
a matrix of size |
a 3D array of size p
by k
by m
.
Eduardo Ruiz Sabajanes, [email protected]
This method fixes chromosomes which do not have at least one observation assigned to each cluster.
gka_chromosome_fix(population, k)
gka_chromosome_fix(population, k)
population |
a matrix of size |
k |
the number of clusters. |
a matrix of size p
by n
with the cluster assignments
for each observation.
Eduardo Ruiz Sabajanes, [email protected]
K-Means Operator (KMO) which replaces the crossover operator in the Genetic K-Means algorithm (GKA).
gka_crossover(data, centers)
gka_crossover(data, centers)
data |
a set of observations, presented as a matrix-like object where
every row is a new observation. The matrix is of size |
centers |
a matrix of size |
a vector of size n
with the cluster assignments for each
observation i.e. a new chromosome.
Eduardo Ruiz Sabajanes, [email protected]
Fitness function
gka_fitness(twcv)
gka_fitness(twcv)
twcv |
a vector of size |
a vector of size p
with the fitness of each individual in the
population.
Eduardo Ruiz Sabajanes, [email protected]
Initialization method
gka_initialization(n, p, k)
gka_initialization(n, p, k)
n |
the number of observations in the data. |
p |
the number of individuals in the population. |
k |
the number of clusters. |
a matrix of size p
by n
with the cluster assignments
for each observation.
Eduardo Ruiz Sabajanes, [email protected]
Mutation method
gka_mutation(chromosome, prob, k, mut_probability)
gka_mutation(chromosome, prob, k, mut_probability)
chromosome |
a vector of size |
prob |
a matrix of size |
k |
the number of clusters. |
mut_probability |
the probability of a mutation occurring. |
a vector of size n
with the cluster assignments for each
observation i.e. a new chromosome.
Eduardo Ruiz Sabajanes, [email protected]
Selection method
gka_selection(p, fitness)
gka_selection(p, fitness)
p |
the number of individuals in the population. |
fitness |
a vector of size |
the index of the individual selected for reproduction.
Eduardo Ruiz Sabajanes, [email protected]
Total Within Cluster Variation (TWCV) computation
gka_twcv(data, k, population, centers)
gka_twcv(data, k, population, centers)
data |
a set of observations, presented as a matrix-like object where
every row is a new observation. The matrix is of size |
k |
the number of clusters. |
population |
a matrix of size |
centers |
a 3D array of size |
a vector of size p
with the total within cluster variation of
each individual in the population.
Eduardo Ruiz Sabajanes, [email protected]
Perform K-Means clustering on a data matrix.
kmeans_( data, centers, max_iterations = 10, initialization = "kmeans++", learn = FALSE, waiting = TRUE, ... )
kmeans_( data, centers, max_iterations = 10, initialization = "kmeans++", learn = FALSE, waiting = TRUE, ... )
data |
a set of observations, presented as a matrix-like object where every row is a new observation. |
centers |
either the number of clusters or a set of initial cluster
centers. If a number, the centers are chosen according to the
|
max_iterations |
the maximum number of iterations allowed. |
initialization |
the initialization method to be used. This should be
one of |
learn |
a Boolean determining whether intermediate logs explaining how the algorithm works should be printed or not. |
waiting |
a Boolean determining whether the intermediate logs should be printed in chunks waiting for user input before printing the next or not. |
... |
additional arguments passed to |
The data given by data
is clustered by the -means
method, which aims to partition the points into
groups such that the
sum of squares from points to the assigned cluster centers is minimized. At
the minimum, all cluster centers are at the mean of their Voronoi sets (the
set of data points which are nearest to the cluster center).
The -means method follows a 2 to
step process:
The first step can be subdivided into 3 steps:
Selection of the number of clusters, into which the data is
going to be grouped and of which the centers will be the representatives.
This is determined through the use of the
centers
parameter.
Computation of the distance from each data point to each center.
Assignment of each observation to a cluster. The observation is assigned to the cluster represented by the nearest center.
The next steps are just like the first but for the first sub-step:
Computation of the new centers. The center of each cluster is computed as the mean of the observations assigned to said cluster.
The algorithm stops once the centers in step are the same as the
ones in step
. However, this convergence does not always take place.
For this reason, the algorithm also stops once a maximum number of iterations
max_iterations
is reached.
The initialization
methods provided by this function are:
random
A set of centers
observations is chosen at
random from the data as the initial centers.
kmeans++
The centers
observations are chosen using the
kmeans++ algorithm. This algorithm chooses the first center at
random and then chooses the next center from the remaining observations with
probability proportional to the square distance to the closest center. This
process is repeated until centers
centers are chosen.
A stats::kmeans()
object.
Eduardo Ruiz Sabajanes, [email protected]
### Voronoi tesselation voronoi <- suppressMessages(suppressWarnings(require(deldir))) cols <- c( "#00000019", "#DF536B19", "#61D04F19", "#2297E619", "#28E2E519", "#CD0BBC19", "#F5C71019", "#9E9E9E19" ) ### Helper function test <- function(db, k) { print(cl <- kmeans_(db, k, 100)) plot(db, col = cl$cluster, asp = 1, pch = 20) points(cl$centers, col = seq_len(k), pch = 13, cex = 2, lwd = 2) if (voronoi) { x <- c(min(db[, 1]), max(db[, 1])) dx <- c(x[1] - x[2], x[2] - x[1]) y <- c(min(db[, 2]), max(db[, 2])) dy <- c(y[1] - y[2], y[2] - y[1]) tesselation <- deldir( cl$centers[, 1], cl$centers[, 2], rw = c(x + dx, y + dy) ) tiles <- tile.list(tesselation) plot( tiles, asp = 1, add = TRUE, showpoints = FALSE, border = "#00000000", fillcol = cols ) } } ### Example 1 test(db1, 2) ### Example 2 test(db2, 2) ### Example 3 test(db3, 3) ### Example 4 test(db4, 3) ### Example 5 test(db5, 3) ### Example 6 test(db6, 3) ### Example 7 (with explanations, no plots) cl <- kmeans_( db5[1:20, ], 3, learn = TRUE, waiting = FALSE )
### Voronoi tesselation voronoi <- suppressMessages(suppressWarnings(require(deldir))) cols <- c( "#00000019", "#DF536B19", "#61D04F19", "#2297E619", "#28E2E519", "#CD0BBC19", "#F5C71019", "#9E9E9E19" ) ### Helper function test <- function(db, k) { print(cl <- kmeans_(db, k, 100)) plot(db, col = cl$cluster, asp = 1, pch = 20) points(cl$centers, col = seq_len(k), pch = 13, cex = 2, lwd = 2) if (voronoi) { x <- c(min(db[, 1]), max(db[, 1])) dx <- c(x[1] - x[2], x[2] - x[1]) y <- c(min(db[, 2]), max(db[, 2])) dy <- c(y[1] - y[2], y[2] - y[1]) tesselation <- deldir( cl$centers[, 1], cl$centers[, 2], rw = c(x + dx, y + dy) ) tiles <- tile.list(tesselation) plot( tiles, asp = 1, add = TRUE, showpoints = FALSE, border = "#00000000", fillcol = cols ) } } ### Example 1 test(db1, 2) ### Example 2 test(db2, 2) ### Example 3 test(db3, 3) ### Example 4 test(db4, 3) ### Example 5 test(db5, 3) ### Example 6 test(db6, 3) ### Example 7 (with explanations, no plots) cl <- kmeans_( db5[1:20, ], 3, learn = TRUE, waiting = FALSE )