Skip to content

Commit d8c8b9e

Browse files
committed
Calinski Harabasz Index
1 parent 9edd30f commit d8c8b9e

File tree

1 file changed

+31
-5
lines changed

1 file changed

+31
-5
lines changed

book/4-clustering.tex

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,8 @@ \subsection{Mutual Info Score}
4141
% ---------- rand index ----------
4242
\clearpage
4343
\thispagestyle{clusteringstyle}
44-
\section{Rand index}
45-
\subsection{Rand index}
44+
\section{Rand Index}
45+
\subsection{Rand Index}
4646

4747
The Rand Index (RI) is a clustering metric that measures the similarity between two clusterings using the predicted labels generated by an algorithm
4848
and the true labels or labels comming from a reference clustering.
@@ -77,11 +77,37 @@ \subsection{Rand index}
7777
RI scores even when the clusterings are significantly different.
7878
}
7979

80-
% ---------- calinski harabasz score ----------
80+
% ---------- calinski harabasz index ----------
8181
\clearpage
8282
\thispagestyle{clusteringstyle}
83-
\section{CH Score}
84-
\subsection{Calinski Harabasz Score}
83+
\section{CH Index}
84+
\subsection{Calinski Harabasz Index}
85+
86+
The Calinski–Harabasz Index (CH Index), also known as the Variance Ratio Criterion, is a clustering evaluation metric that does
87+
not require ground-truth labels. It measures the quality of clustering by comparing the dispersion between clusters to the
88+
dispersion within clusters.
89+
90+
\begin{center}
91+
FORMULA GOES HERE
92+
\end{center}
93+
94+
The CH Index is defined as the ratio of the between-clusters dispersion (BCSS) to the within-cluster dispersion (WCSS),
95+
normalized by their respective degrees of freedom. We normalize BCSS and WCSS by their degrees of freedom to ensure comparability
96+
across different values of $k$, avoiding artificial inflation of the score for higher cluster counts.
97+
98+
\textbf{When to use Calinski-Harabasz Index?}
99+
100+
Use CH Index when no ground-truth labels are available to validate the clustering quality. It can also be used to identify the
101+
optimal number of clusters by maximizing the CH Index across different cluster counts.
102+
103+
\coloredboxes{
104+
\item The CH Index does not rely on labeled data.
105+
\item The use of degrees of freedom normalization ensures fair comparison across varying $k$ and sample sizes.
106+
}
107+
{
108+
\item The calculation assumes a Euclidean distance metric, which may limit its applicability for non-Euclidean data.
109+
}
110+
85111

86112

87113
% ---------- contingency matrix ----------

0 commit comments

Comments
 (0)