[ADP] 계층적 군집분석 (Hierachical Clustering Analysis)

> # 09. 계층적 군집분석 (Hierachical Clustering Analysis)
>
>
> library(flexclust)
>
> data(nutrient)
> summary(nutrient)
     energy         protein          fat           calcium            iron
Min.   : 45.0   Min.   : 7.0   Min.   : 1.00   Min.   :  5.00   Min.   :0.500
1st Qu.:135.0   1st Qu.:16.5   1st Qu.: 5.00   1st Qu.:  9.00   1st Qu.:1.350
Median :180.0   Median :19.0   Median : 9.00   Median :  9.00   Median :2.500
Mean   :207.4   Mean   :19.0   Mean   :13.48   Mean   : 43.96   Mean   :2.381
3rd Qu.:282.5   3rd Qu.:22.0   3rd Qu.:22.50   3rd Qu.: 31.50   3rd Qu.:2.600
Max.   :420.0   Max.   :26.0   Max.   :39.00   Max.   :367.00   Max.   :6.000
>
> # 데이터 표준화
> nutrient.scaled <- scale(nutrient)
> # 거리 계산 (default method = 유클리드)
> d <- dist(nutrient.scaled)
> # 계층적 군집 알고리즘 선택 (평균연결법)
> fit.average <- hclust(d, method = "average")
> # 덴드로그램
> plot(fit.average, hang=-1, cex=.8, main="Average Linkage Clustering")

>
> # 군집 타당성 측도를 이용해 군집의 수를 결정할 수 있다.
> # 그 중 가장 대표적인 방법이 NbClust 패키지와 군집 내 sum of squares 그리고 gap statistics가 있다.
>
> library(NbClust)
>
> devAskNewPage(ask = TRUE)
> # 함수의 사양은 유클리드 거리, 함수가 자동으로 그래프와 다수결의 결과를 출력한다.
> nc <- NbClust(nutrient.scaled, distance = "euclidean", min.nc = 2, max.nc = 15, method = "average")
다음 플랏을 보기 위해서는 <Return>키를 치세요
*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

● 1st plot : 휴버트 지수는 심하게 꺽이고(왼쪽 그림), 최고치가 군집 5로 나타남(오른쪽 그림)

다음 플랏을 보기 위해서는 <Return>키를 치세요
*** : The D index is a graphical method of determining the number of clusters.
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure.

● 2nd plot : D 지수 도표 또한 휴버트 지수와 같이 군집 5에서 급격하게 꺽임을 알 수 있다.

*******************************************************************
* Among all indices:
* 4 proposed 2 as the best number of clusters
* 4 proposed 3 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 4 proposed 5 as the best number of clusters
* 1 proposed 9 as the best number of clusters
* 1 proposed 10 as the best number of clusters
* 2 proposed 13 as the best number of clusters
* 1 proposed 14 as the best number of clusters
* 4 proposed 15 as the best number of clusters

                   ***** Conclusion *****

* According to the majority rule, the best number of clusters is  2
● 다수결의 결과는 계층적 군집화에 관해서는 2개의 군집을 최적의 군집 개수로 나타남

*******************************************************************
경고메시지(들):
1: In pf(beale, pp, df2) : NaN이 생성되었습니다
2: In pf(beale, pp, df2) : NaN이 생성되었습니다
>
> # 최종 결과 군집수 5로 군집분석 실시
> clusters <- cutree(fit.average, k=5)
> table(clusters)
clusters
1  2  3  4  5
7 16  1  2  1
>
> par(mfrow=c(1,1))
> plot(fit.average, hang=-1, cex=.8, main="Average Linkage Clustering\n5 Cluster Solution")
> rect.hclust(fit.average, k=5)

출처 : 2020 데이터 분석 전문가 ADP 필기 한 권으로 끝내기

'데이터분석 > R' 카테고리의 다른 글

[ADP] PAM 군집 (Partitioning Around K-medoids Clustering) (0)	2022.01.16
[ADP] K-평균 군집 (K-means Clustering) (0)	2022.01.16
[ADP] 랜덤 포레스트 (Random Forest) (0)	2022.01.15
[ADP] 의사결정나무 (Decision Tree) (0)	2022.01.15
[ADP] 로지스틱 회귀분석 (Logistic Regression) (0)	2022.01.15

Meongtae's IT Blog

[ADP] 계층적 군집분석 (Hierachical Clustering Analysis)

'데이터분석 > R' 카테고리의 다른 글

티스토리툴바

[ADP] 계층적 군집분석 (Hierachical Clustering Analysis)

'데이터분석 > R' 카테고리의 다른 글

관련글

티스토리툴바