Skip to content

Commit 34e6f03

Browse files
committed
Discuss Holm and Bonferroni in the documentation
1 parent 498afe9 commit 34e6f03

File tree

2 files changed

+20
-6
lines changed

2 files changed

+20
-6
lines changed

docs/src/index.md

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,25 @@
11
# [CriticalDifferenceDiagrams.jl](@id Home)
22

3-
Critical difference (CD) diagrams are a powerful tool to compare outcomes of multiple treatments over multiple observations. For instance, in machine learning research we often compare the performance (outcome) of multiple methods (treatments) over multiple data sets (observations). This Julia package generates Tikz code to produce publication-ready vector graphics. A wrapper for Python is also available.
3+
Critical difference (CD) diagrams are a powerful tool to compare outcomes of multiple treatments over multiple observations. For instance, in machine learning research we often compare the performance (i.e., outcome) of multiple methods (i.e., treatments) over multiple data sets (i.e., observations). This Julia package generates Tikz code to produce publication-ready vector graphics. A [wrapper for Python](python-wrapper/) is also available.
44

55

66
## Reading a CD diagram
77

8-
Let's take a look at the treatments `clf1` to `clf5`. Their position represents their mean ranks across all outcomes of the observations, where low ranks indicate that a treatment wins more often than its competitors with higher ranks. Two or more treatments are connected with each other if we can not tell their outcomes apart, in the sense of statistical significance. For the above example, we can not tell from the data whether `clf3` and `clf5` are actually different from each other. We can tell, however, that both of them are different from all of the other treatments. This example above is adapted from https://github.com/hfawaz/cd-diagram.
8+
Let's take a look at the treatments `clf1` to `clf5`. Their position represents their mean ranks across all outcomes of the observations, where low ranks indicate that a treatment wins more often than its competitors with higher ranks. Two or more treatments are connected with each other if we can not tell their outcomes apart, in the sense of statistical significance. For the above example, we can not tell from the data whether `clf3` and `clf5` are actually different from each other. We can tell, however, that both of them are different from all of the other treatments. This example above is adapted from [github.com/hfawaz/cd-diagram](https://github.com/hfawaz/cd-diagram).
99

1010
```@raw html
1111
<img alt="assets/example.svg" src="assets/example.svg" style="width: 480px; max-width: 100%; margin: 2em auto; display: block;">
1212
```
1313

14-
A diagram like the one above concisely represents multiple hypothesis tests that are conducted over the observed outcomes. Before anything is plotted at all, the Friedman test tells us whether there are significant differences at all. If this test fails, we have not sufficient data to tell any of the treatments apart and we must abort. If, however, the test sucessfully rejects this possibility we can proceed with the post-hoc analysis. In this second step, a Wilcoxon signed-rank test tells us whether each pair of treatments exhibits a significant difference. Since we are testing multiple hypotheses, we must adjust the Wilcoxon test with Holm's method. For each group of treatments which we can not distinguish from the Holm-adjusted Wilcoxon test, we add a thick line to the diagram.
14+
### Hypothesis testing
15+
16+
A diagram like the one above concisely represents multiple hypothesis tests that are conducted over the observed outcomes. Before anything is plotted at all, the *Friedman test* tells us whether there are significant differences at all. If this test fails, we have not sufficient data to tell any of the treatments apart and we must abort. If, however, the test sucessfully rejects this possibility we can proceed with the post-hoc analysis. In this second step, a *Wilcoxon signed-rank test* tells us whether each pair of treatments exhibits a significant difference.
17+
18+
### Multiple testing
19+
20+
Since we are testing multiple hypotheses, we must *adjust* the Wilcoxon test with Holm's method or with Bonferroni's method. For each group of treatments which we can not distinguish from the Holm-adjusted (or Bonferroni-adjusted) Wilcoxon test, we add a thick line to the diagram.
21+
22+
Whether we choose Holm's method or Bonferroni's method for the adjustment depends on our personal requirements. Holm's method has the advantage of a greater statistical power than Bonferroni's method, i.e., this adjustment is capable of rejecting more null hypotheses that indeed should be rejected. However, its disadvantage is that the rejection of each null hypothesis depends on the outcome of other null hypotheses. If this property is not desired, one should instead use Bonferroni's method, which ensures that each pair-wise hypothesis test is independent from all others.
1523

1624

1725
## Getting started
@@ -37,7 +45,9 @@ plot = CriticalDifferenceDiagrams.plot(
3745
:dataset_name, # the name of the observation column
3846
:accuracy; # the name of the outcome column
3947
maximize_outcome=true, # compute ranks for minimization (default) or maximization
40-
title="CriticalDifferenceDiagrams.jl" # give an optional title
48+
title="CriticalDifferenceDiagrams.jl", # give an optional title
49+
alpha=0.05, # the significance level (default: 0.05)
50+
adjustment=:holm # :holm (default) or :bonferroni
4151
)
4252

4353
# configure the preamble of PGFPlots.jl (optional)
@@ -52,7 +62,9 @@ PGFPlots.save("example.svg", plot)
5262

5363
## Cautions
5464

55-
The hypothesis tests underneath the CD diagram do not account for variances of the outcomes. It is therefore important that these outcomes are "reliable" in the sense that each of them is obtained from a sufficiently large sample. Ideally, they come from a cross validation or from a repeated stratified split. Moreover, all treatments must have been evaluated on the same set of observations.
65+
The hypothesis tests underneath the CD diagram do not account for variances of the outcomes. It is therefore important that these outcomes are *reliable* in the sense that each of them is obtained from a sufficiently large sample. Ideally, they come from a cross validation or from a repeated stratified split. Moreover, all treatments must have been evaluated on the same set of observations.
66+
67+
The adjustments by Holm and Bonferroni can lead to different cliques. For more information, see the [Multiple testing](#multiple-testing) section above.
5668

5769

5870
## 2-dimensional sequences of CD diagrams

docs/src/python-wrapper.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,9 @@ plot = cdd.plot(
3131
"dataset_name", # the name of the observation column
3232
"accuracy", # the name of the outcome column
3333
maximize_outcome=True, # compute ranks for minimization (default) or maximization
34-
title="CriticalDifferenceDiagrams.jl" # give an optional title
34+
title="CriticalDifferenceDiagrams.jl", # give an optional title
35+
alpha=0.05, # the significance level (default: 0.05)
36+
adjustment="holm" # "holm" (default) or "bonferroni"
3537
)
3638

3739
# configure the preamble of PGFPlots.jl (optional)

0 commit comments

Comments
 (0)