At the end of this lab you will upload your RStudio script for points.
For CIS 389 download this dataset: iris.csv
Download iris.csv
first, we will need data to perform the algorithm on. We will take the classic iris dataset. UCI Machine Learning Repository: Iris Data Set
Links to an external site.
The example is wrong, as the book taught us the right way. Use
For PC
iris <- read.csv(“C:/Users/Student/Desktop/iris.csv”)
For MAC
iris <- read.csv(“/Users/UserName/Downloads/iris.csv”)
First read in the data and then display first 6 elements of the iris data frame. We will take petallength
and petalwidth
as variables for k-means clustering. Why? Because after much exploration it has been found that these two variables have significant differences among species.
This will load the proper package to make sure the rest of the lab will work correctly
library(cluster)
We will create a new data frame corresponding to these two variables.
- kmeans_variables = data.frame(iris$petallength, iris$petalwidth)
Convert vector of characters to factors to avoid invalid color name error
2. pClass <- as.factor(iris$class)
Let’s display the data.
3. plot(kmeans_variables,col=pClass)
Applying K-means
4. KMC = kmeans(kmeans_variables, centers = 3, iter.max=50, nstart=20)
centers
number of clusters, k In our case we take it as 3 as number of different species are 3.
iter.max
maximum number of iterations to be performed.
nstart
R will try 20 different random starting assignments and then select the one with the lowest within cluster variation.
Output –
K–means clustering with 3 clusters of sizes 49, 48, 52
Cluster means:
iris.petallength iris.petalwidth
1 1.465306 0.244898
2 5.5958332.037500
3 4.269231 1.342308
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[75] 3 3 2 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 3 2 2 2 2 2
[112] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2
[149] 2
Within cluster sum of squares by cluster:
[1] 2.032245 16.291667 13.057692
(between_SS / total_SS = 94.2 %)
Available components:
[1] “cluster” “centers” “totss” “withinss” “tot.withinss”
[6] “betweenss” “size” “iter” “ifault”
KMC$cluster
will give the details of which data point is assigned which cluster.
KMC$centers
will give the details regarding cluster centroids of each cluster
> KMC$centers
iris.petallength iris.petalwidth
11.465306 0.244898
25.595833 2.037500
34.269231 1.342308
KMC$size
will give number of data points inside each cluster
- > KMC$size
- [1] 49 48 52
For other parameters look here – K-Means Clustering
Links to an external site. . But for beginner I think this much is sufficient.
Let’s see which species got which cluster
- > table(KMC$cluster, iris$class)
- Iris–setosa Iris–versicolor Iris–virginica
- 1490 0
- 2 0246
- 3 0 48 4
We can see setosa got cluster 1, versicolor got 3 and virginica got 2.
Plotting k-means
- clusplot(iris, KMC$cluster, color=TRUE, shade=TRUE, lines=0)
What you are actually watching in the clusplot()
is the plot of your observations in the principal plane. What this function is doing is calculating the principal component score for each of your observations, plotting those scores and coloring by cluster.
Principal component analysis (PCA) is a dimension reduction technique; it “summarizes” the information of all variables into a couple of “new” variables called components.
For simple plot use
- plot(kmeans_variables,col=KMC$cluster)
Sources –
NOTE:To install the cluster package in RStudio: install.packages(“cluster”)
K Means Clustering in R
Links to an external site.
Using the stats package in R for kmeans clustering
Links to an external site.
How to produce a pretty plot of the results of k-means cluster analysis?