@@ -41,8 +41,8 @@ \subsection{Mutual Info Score}
41
41
% ---------- rand index ----------
42
42
\clearpage
43
43
\thispagestyle {clusteringstyle}
44
- \section {Rand index }
45
- \subsection {Rand index }
44
+ \section {Rand Index }
45
+ \subsection {Rand Index }
46
46
47
47
The Rand Index (RI) is a clustering metric that measures the similarity between two clusterings using the predicted labels generated by an algorithm
48
48
and the true labels or labels comming from a reference clustering.
@@ -77,11 +77,37 @@ \subsection{Rand index}
77
77
RI scores even when the clusterings are significantly different.
78
78
}
79
79
80
- % ---------- calinski harabasz score ----------
80
+ % ---------- calinski harabasz index ----------
81
81
\clearpage
82
82
\thispagestyle {clusteringstyle}
83
- \section {CH Score }
84
- \subsection {Calinski Harabasz Score }
83
+ \section {CH Index }
84
+ \subsection {Calinski Harabasz Index }
85
+
86
+ The Calinski–Harabasz Index (CH Index), also known as the Variance Ratio Criterion, is a clustering evaluation metric that does
87
+ not require ground-truth labels. It measures the quality of clustering by comparing the dispersion between clusters to the
88
+ dispersion within clusters.
89
+
90
+ \begin {center }
91
+ FORMULA GOES HERE
92
+ \end {center }
93
+
94
+ The CH Index is defined as the ratio of the between-clusters dispersion (BCSS) to the within-cluster dispersion (WCSS),
95
+ normalized by their respective degrees of freedom. We normalize BCSS and WCSS by their degrees of freedom to ensure comparability
96
+ across different values of $ k$ , avoiding artificial inflation of the score for higher cluster counts.
97
+
98
+ \textbf {When to use Calinski-Harabasz Index? }
99
+
100
+ Use CH Index when no ground-truth labels are available to validate the clustering quality. It can also be used to identify the
101
+ optimal number of clusters by maximizing the CH Index across different cluster counts.
102
+
103
+ \coloredboxes {
104
+ \item The CH Index does not rely on labeled data.
105
+ \item The use of degrees of freedom normalization ensures fair comparison across varying $ k$ and sample sizes.
106
+ }
107
+ {
108
+ \item The calculation assumes a Euclidean distance metric, which may limit its applicability for non-Euclidean data.
109
+ }
110
+
85
111
86
112
87
113
% ---------- contingency matrix ----------
0 commit comments