This repository has been archived by the owner on Feb 10, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathK_means.Rmd
69 lines (51 loc) · 1.47 KB
/
K_means.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
title: "Lab 27. K-means"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Data Preprocessing
```{r extract plants}
plants <- read.csv("plants.csv", sep=';')
head(plants)
```
1. Remove rows with at least 3 NAs
2. Replace NAs with column means
3. Scale columns with doubles
```{r pressure}
prepare_dataframe <- function(data) {
data <- subset(data, select=-c(plant.name))
cnt_na <- apply(data, 1, function(z) sum(is.na(z)))
data <- data[cnt_na < 3,]
mean_pdias <- mean(data[ ,'pdias'], na.rm = TRUE)
mean_longindex <- mean(data[ ,'longindex'], na.rm = TRUE)
data$pdias[is.na(data$pdias)] <- mean_pdias
data$longindex[is.na(data$longindex)] <- mean_longindex
data$pdias <- scale(data$pdias)
data$longindex <- scale(data$longindex)
return(data)
}
plants <- prepare_dataframe(plants)
```
```{r}
plants <- subset(plants, select=c(pdias, longindex, insects, leafy))
```
## K-means
Find the last number of cluster that significantly decreases the error.
```{r}
set.seed(1234)
cluster_num <- 2:10
inner_dists <- replicate(length(cluster_num), 0)
for (i in 1:length(cluster_num)) {
model <- kmeans(plants, cluster_num[i])
inner_dists[i] <- model[ 'tot.withinss' ]
}
plot(cluster_num, inner_dists, xlab="Number of Clusters", ylab="Inner Square Sum")
```
```{r}
library(NbClust)
res <- NbClust(data = plants, distance = 'euclidean', min.nc = 2, max.nc = 10, method = 'kmeans')
```