Skip to content

Commit ec40a81

Browse files
committed
v0.9.0-prepped
1 parent b458a08 commit ec40a81

35 files changed

+1776
-367
lines changed

.devel/benchmarks/perf_mst_202506-hades.csv

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,13 @@
2424
#
2525
# * - compiled with -O3 -march=native
2626
method,elapsed,Δdist,Σdist,Δidx,nleaves,n,d,M,s,nthreads,trial,seed,time,host
27+
quitefast_single_kd_tree,99.88024163246155,0.0,32420.29247599636,0,22101508,100000000,2,1,norm,10,1,123,1753182443,hades
28+
quitefast_sesqui_kd_tree,98.463210105896,0.0,32420.29247599636,0,22101508,100000000,2,1,norm,10,1,123,1753182443,hades
29+
quitefast_single_kd_tree,102.39674782752991,0.0,32420.29247599636,0,22101508,100000000,2,1,norm,10,2,123,1753182443,hades
30+
quitefast_sesqui_kd_tree,99.7770323753357,0.0,32420.29247599636,0,22101508,100000000,2,1,norm,10,2,123,1753182443,hades
31+
quitefast_single_kd_tree,102.14451837539673,0.0,32420.29247599636,0,22101508,100000000,2,1,norm,10,3,123,1753182443,hades
32+
quitefast_sesqui_kd_tree,98.59938883781433,0.0,32420.29247599636,0,22101508,100000000,2,1,norm,10,3,123,1753182443,hades
33+
quitefast_single_kd_tree,179.2492880821228,0.0,84266.00496595647,0,67520649,100000000,2,10,norm,10,1,123,1753182443,hades
34+
quitefast_sesqui_kd_tree,268.3936069011688,0.0,84266.00496595647,7829,67520926,100000000,2,10,norm,10,1,123,1753182443,hades
35+
quitefast_single_kd_tree,366.5785620212555,0.0,84266.00496595647,0,67520651,100000000,2,10,norm,10,2,123,1753182443,hades
36+
quitefast_sesqui_kd_tree,333.3607323169708,0.0,84266.00496595647,7807,67520936,100000000,2,10,norm,10,2,123,1753182443,hades

.devel/sphinx/news.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,18 @@
11
# Changelog
22

3-
## 0.9.0 (2025-07-21)
3+
4+
## To Do
5+
6+
* Parallelise the K-d tree building procedure.
7+
8+
* [Python] Set up OpenMP on macOS.
9+
10+
* Extend the online documentation: tutorials, benchmarks, definitions.
11+
12+
* bibliography.bib
13+
14+
15+
## 0.9.0 (2025-07-22)
416

517
* [R] Initial CRAN release.
618

.devel/sphinx/rapi/knn_euclid.md

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Description
44

5-
If `Y` is `NULL`, then the function determines the first `k` amongst the nearest neighbours of each point in `X` with respect to the Euclidean distance. It is assumed that each query point is not its own neighbour.
5+
If `Y` is `NULL`, then the function determines the first `k` nearest neighbours of each point in `X` with respect to the Euclidean distance. It is assumed that each query point is not its own neighbour.
66

77
Otherwise, for each point in `Y`, this function determines the `k` nearest points thereto from `X`.
88

@@ -24,29 +24,31 @@ knn_euclid(
2424

2525
| | |
2626
|----|----|
27-
| `X` | the \"database\"; a matrix of shape (n,d) |
28-
| `k` | number of nearest neighbours (should be rather small, say, \<= 20) |
29-
| `Y` | the \"query points\"; `NULL` or a matrix of shape (m,d); note that setting `Y=X`, contrary to `NULL`, will include the query points themselves amongst their own neighbours |
30-
| `algorithm` | `"auto"`, `"kd_tree"` or `"brute"`; K-d trees can only be used for d between 2 and 20 only; `"auto"` selects `"kd_tree"` in low-dimensional spaces |
27+
| `X` | the \"database\"; a matrix of shape $n\times d$ |
28+
| `k` | requested number of nearest neighbours (should be rather small) |
29+
| `Y` | the \"query points\"; `NULL` or a matrix of shape $m\times d$; note that setting `Y=X`, contrary to `NULL`, will include the query points themselves amongst their own neighbours |
30+
| `algorithm` | `"auto"`, `"kd_tree"` or `"brute"`; K-d trees can be used for `d` between 2 and 20 only; `"auto"` selects `"kd_tree"` in low-dimensional spaces |
3131
| `max_leaf_size` | maximal number of points in the K-d tree leaves; smaller leaves use more memory, yet are not necessarily faster; use `0` to select the default value, currently set to 32 |
32-
| `squared` | whether to return the squared Euclidean distance |
32+
| `squared` | whether the output `nn.dist` should be based on the squared Euclidean distance |
3333
| `verbose` | whether to print diagnostic messages |
3434

3535
## Details
3636

37-
The implemented algorithms, see the `algorithm` parameter, assume that `k` is rather small; say, `k <= 20`.
37+
The implemented algorithms, see the `algorithm` parameter, assume that $k$ is rather small, say, $k \leq 20$.
3838

39-
Our implementation of K-d trees (Bentley, 1975) has been quite optimised; amongst others, it has good locality of reference, features the sliding midpoint (midrange) rule suggested by Maneewongvatana and Mound (1999), and a node pruning strategy inspired by the discussion by Sample et al. (2001). Still, it is well-known that K-d trees perform well only in spaces of low intrinsic dimensionality. Thus, due to the so-called curse of dimensionality, for high `d`, the brute-force algorithm is recommended.
39+
Our implementation of K-d trees (Bentley, 1975) has been quite optimised; amongst others, it has good locality of reference (at the cost of making a copy of the input dataset), features the sliding midpoint (midrange) rule suggested by Maneewongvatana and Mound (1999), node pruning strategies inspired by some ideas from (Sample et al., 2001), and a couple of further tuneups proposed by the current author. Still, it is well-known that K-d trees perform well only in spaces of low intrinsic dimensionality. Thus, due to the so-called curse of dimensionality, for high `d`, the brute-force algorithm is recommended.
4040

4141
The number of threads used is controlled via the `OMP_NUM_THREADS` environment variable or via the [`omp_set_num_threads`](omp.md) function at runtime. For best speed, consider building the package from sources using, e.g., `-O3 -march=native` compiler flags.
4242

4343
## Value
4444

45-
A list with two elements, `nn.index` and `nn.dist`.
45+
A list with two elements, `nn.index` and `nn.dist`, is returned.
4646

47-
`nn.dist` has shape (n,k) or (m,k); `nn.dist[i,]` is sorted nondecreasingly for all `i`. `nn.dist[i,j]` gives the weight of the edge `{i, ind[i,j]}`, i.e., the distance between the `i`-th point and its `j`-th NN.
47+
`nn.dist` and `nn.index` have shape $n\times k$ or $m\times k$, depending whether `Y` is given.
4848

49-
`nn.index` is of the same shape. `nn.index[i,j]` is the index (between `1` and `n`) of the `j`-th nearest neighbour of `i`.
49+
`nn.index[i,j]` is the index (between $1$ and $n$) of the $j$-th nearest neighbour of $i$.
50+
51+
`nn.dist[i,j]` gives the weight of the edge `{i, nn.index[i,j]}`, i.e., the distance between the $i$-th point and its $j$-th nearest neighbour, $j=1,\dots,k$. `nn.dist[i,]` is sorted nondecreasingly for all $i$.
5052

5153
## Author(s)
5254

.devel/sphinx/rapi/mst_euclid.md

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ The function determines the/a(\*) minimum spanning tree (MST) of a set of $n$ po
66

77
MSTs have many uses in, amongst others, topological data analysis (clustering, density estimation, dimensionality reduction, outlier detection, etc.).
88

9-
In clustering and density estimation, the parameter `M` plays the role of a smoothing factor; see (Campello et al. 2015) and the references therein for discussion. It corresponds to the <span class="pkg">hdbscan</span> Python package\'s `min_samples=M-1`.
9+
In clustering and density estimation, the parameter `M` plays the role of a smoothing factor; for discussion, see (Campello et al., 2015) and the references therein. `M` corresponds to the <span class="pkg">hdbscan</span> Python package\'s `min_samples=M-1`.
1010

11-
For $M\leq 2$, we get a spanning tree that minimises the sum of uclidean distances between the points, i.e., the classic Euclidean minimum spanning tree (EMST). If $M=2$, the function additionally returns the distance to each point\'s nearest neighbour.
11+
For $M\leq 2$, we get a spanning tree that minimises the sum of Euclidean distances between the points, i.e., the classic Euclidean minimum spanning tree (EMST). If $M=2$, the function additionally returns the distance to each point\'s nearest neighbour.
1212

13-
If $M>2$, the spanning tree is the smallest wrt the degree-M mutual reachability distance (Campello et al., 2013) given by $d_M(i, j)=\max\{ c_M(i), c_M(j), d(i, j)\}$, where $d(i,j)$ is the Euclidean distance between the $i$-th and the $j$-th point, and $c_M(i)$ is the $i$-th $M$-core distance defined as the distance between the $i$-th point and its $(M-1)$-th nearest neighbour (not including the query points themselves).
13+
If $M>2$, the spanning tree is the smallest with respect to the degree-$M$ mutual reachability distance (Campello et al., 2013) given by $d_M(i, j)=\max\{ c_M(i), c_M(j), d(i, j)\}$, where $d(i,j)$ is the standard Euclidean distance between the $i$-th and the $j$-th point, and $c_M(i)$ is the $i$-th $M$-core distance defined as the distance between the $i$-th point and its $(M-1)$-th nearest neighbour (not including the query point itself).
1414

1515
## Usage
1616

@@ -30,41 +30,47 @@ mst_euclid(
3030

3131
| | |
3232
|----|----|
33-
| `X` | the \"database\"; a matrix of shape (n,d) |
34-
| `M` | the degree of the mutual reachability distance (should be rather small, say, $\leq 20$). $M\leq 2$ denotes the ordinary Euclidean distance |
35-
| `algorithm` | `"auto"`, `"single_kd_tree"` `"sesqui_kd_tree"`, `"dual_kd_tree"` or `"brute"`; K-d trees can only be used for d between 2 and 20 only; `"auto"` selects `"sesqui_kd_tree"` for $d\leq 20$. `"brute"` is used otherwise |
33+
| `X` | the \"database\"; a matrix of shape $n\times d$ |
34+
| `M` | the degree of the mutual reachability distance (should be rather small). $M\leq 2$ denotes the ordinary Euclidean distance |
35+
| `algorithm` | `"auto"`, `"single_kd_tree"`, `"sesqui_kd_tree"`, `"dual_kd_tree"`, or `"brute"`; K-d trees can only be used for $d$ between 2 and 20 only; `"auto"` selects `"sesqui_kd_tree"` for $d\leq 20$. `"brute"` is used otherwise |
3636
| `max_leaf_size` | maximal number of points in the K-d tree leaves; smaller leaves use more memory, yet are not necessarily faster; use `0` to select the default value, currently set to 32 for the single-tree and sesqui-tree and 8 for the dual-tree Borůvka algorithm |
37-
| `first_pass_max_brute_size` | minimal number of points in a node to treat it as a leaf (unless it\'s actually a leaf) in the first iteration of the algorithm; use `0` to select the default value, currently set to 32 |
38-
| `mutreach_adj` | adjustment for mutual reachability distance ambiguity (for $M>2$) whose fractional part should be close to 0: values in $(-1,0)$ prefer connecting to farther NNs, values in $(0, 1)$ fall for closer NNs (which is what many other implementations provide), values in $(-2,-1)$ prefer connecting to points with smaller core distances, values in $(1, 2)$ favour larger core distances; see above for more details |
37+
| `first_pass_max_brute_size` | minimal number of points in a node to treat it as a leaf (unless it actually is a leaf) in the first iteration of the algorithm; use `0` to select the default value, currently set to 32 |
38+
| `mutreach_adj` | adjustment for mutual reachability distance ambiguity (for $M>2$) whose fractional part should be close to 0: values in $(-1,0)$ prefer connecting to farther nearest neighbours, values in $(0, 1)$ fall for closer NNs (which is what many other implementations provide), values in $(-2,-1)$ prefer connecting to points with smaller core distances, values in $(1, 2)$ favour larger core distances; see below for more details |
3939
| `verbose` | whether to print diagnostic messages |
4040

4141
## Details
4242

43-
(\*) We note that if there are many pairs of equidistant points, there can be many minimum spanning trees. In particular, it is likely that there are point pairs with the same mutual reachability distances. To make the definition less ambiguous (albeit with no guarantees), internally, the brute-force algorithm relies on the adjusted distance: $d_M(i, j)=\max\{c_M(i), c_M(j), d(i, j)\}+\varepsilon d(i, j)$ or $d_M(i, j)=\max\{c_M(i), c_M(j), d(i, j)\}-\varepsilon \min\{c_M(i), c_M(j)\}$, where $\varepsilon$ is close to 0. \|`mutreach_adj`\|\<1 selects the former formula (ε=`mutreach_adj`) whilst 1\<\|`mutreach_adj`\|\<2 chooses the latter (ε=`mutreach_adj`±1). For the K-d tree-based methods, on the other hand, `mutreach_adj` indicates the preference towards connecting to farther/closer points wrt the original metric or having smaller/larger core distances if a point $i$ has multiple nearest-neighbour candidates $j'$, $j''$ with $c_M(i) \geq \max\{d(i, j'), c_M(j')\}`$ and $c_M(i) \geq \max\{d(i, j''), c_M(j'')\}`$. Generally, the smaller the `mutreach_adj`, the more leaves there will be in the tree (note that there are only four types of adjustments, though).
43+
(\*) We note that if there are many pairs of equidistant points, there can be many minimum spanning trees. In particular, it is likely that there are point pairs with the same mutual reachability distances.
4444

45-
The implemented algorithms, see the `algorithm` parameter, assume that `M` is rather small; say, $M \leq 20$.
45+
To make the definition less ambiguous (albeit with no guarantees), internally, the brute-force algorithm relies on the adjusted distance: $d_M(i, j)=\max\{c_M(i), c_M(j), d(i, j)\}+\varepsilon d(i, j)$ or $d_M(i, j)=\max\{c_M(i), c_M(j), d(i, j)\}-\varepsilon \min\{c_M(i), c_M(j)\}$, where $\varepsilon$ is close to $0$. \|`mutreach_adj`\|\<1 selects the former formula ($\varepsilon$=`mutreach_adj`) whilst 1\<\|`mutreach_adj`\|\<2 chooses the latter ($\varepsilon$=`mutreach_adj`±1).
4646

47-
Our implementation of K-d trees (Bentley, 1975) has been quite optimised; amongst others, it has good locality of reference (at the cost of making a copy of the input dataset), features the sliding midpoint (midrange) rule suggested by Maneewongvatana and Mound (1999), node pruning strategies inspired by some ideas from (Sample et al. ,2001), and a couple of further tuneups proposed by the current author.
47+
For the K-d tree-based methods, on the other hand, `mutreach_adj` indicates the preference towards connecting to farther/closer points with respect to the original metric or having smaller/larger core distances if a point $i$ has multiple nearest-neighbour candidates $j'$, $j''$ with $c_M(i) \geq \max\{d(i, j'), c_M(j')\}$ and $c_M(i) \geq \max\{d(i, j''), c_M(j'')\}$.
4848

49-
The \"single-tree\" version of the Borůvka algorithm is naively parallelisable: in every iteration, it seeks each point\'s nearest \"alien\", i.e., the nearest point thereto from another cluster. The \"dual-tree\" Borůvka version of the algorithm is, in principle, based on (March et al., 2010). As far as our implementation is concerned, the dual-tree approach is often only faster in 2- and 3-dimensional spaces, for $M\leq 2$, and in a single-threaded setting. For another (approximate) adaptation of the dual-tree algorithm to the mutual reachability distance, see (McInnes and Healy, 2017).
49+
Generally, the smaller the `mutreach_adj`, the more leaves should be in the tree (note that there are only four types of adjustments, though).
5050

51-
The \"sesqui-tree\" variant (by the current author) is a mixture of the two approaches: it compares leaves against the full tree. It is usually faster than the single- and dual-tree methods in very low dimensional spaces and usually not much slower than the single-tree variant otherwise.
51+
The implemented algorithms, see the `algorithm` parameter, assume that $M$ is rather small; say, $M \leq 20$.
5252

53-
Nevertheless, it is well-known that K-d trees perform well only in spaces of low intrinsic dimensionality (a.k.a. the \"curse\"). For high `d`, the \"brute-force\" algorithm is recommended. Here, we provided a parallelised (see Olson, 1995) version of the Jarník (1930) (a.k.a. Prim (1957) or Dijkstra) algorithm, where the distances are computed on the fly (only once for `M<=2`).
53+
Our implementation of K-d trees (Bentley, 1975) has been quite optimised; amongst others, it has good locality of reference (at the cost of making a copy of the input dataset), features the sliding midpoint (midrange) rule suggested by Maneewongvatana and Mound (1999), node pruning strategies inspired by some ideas from (Sample et al., 2001), and a couple of further tuneups proposed by the current author.
54+
55+
The \"single-tree\" version of the Borůvka algorithm is parallelised: in every iteration, it seeks each point\'s nearest \"alien\", i.e., the nearest point thereto from another cluster. The \"dual-tree\" Borůvka version of the algorithm is, in principle, based on (March et al., 2010). As far as our implementation is concerned, the dual-tree approach is often only faster in 2- and 3-dimensional spaces, for $M\leq 2$, and in a single-threaded setting. For another (approximate) adaptation of the dual-tree algorithm to mutual reachability distances, see (McInnes and Healy, 2017).
56+
57+
The \"sesqui-tree\" variant (by the current author) is a mixture of the two approaches: it compares leaves against the full tree and can be run in parallel. It is usually faster than the single- and dual-tree methods in very low dimensional spaces and usually not much slower than the single-tree variant otherwise.
58+
59+
Nevertheless, it is well-known that K-d trees perform well only in spaces of low intrinsic dimensionality (the \"curse\"). For high $d$, the \"brute-force\" algorithm is recommended. Here, we provided a parallelised (see Olson, 1995) version of the Jarník (1930) (a.k.a. Prim, 1957) algorithm, where the distances are computed on the fly (only once for $M\leq 2$).
5460

5561
The number of threads used is controlled via the `OMP_NUM_THREADS` environment variable or via the [`omp_set_num_threads`](omp.md) function at runtime. For best speed, consider building the package from sources using, e.g., `-O3 -march=native` compiler flags.
5662

5763
## Value
5864

59-
A list with two (M=1) or four (M\>1) elements, `mst.index` and `mst.dist`, and additionally `nn.index` and `nn.dist`.
65+
A list with two $(M=1)$ or four $(M>1)$ elements, `mst.index` and `mst.dist`, and additionally `nn.index` and `nn.dist`.
6066

61-
`mst.index` is a matrix with $n-1$ rows and `2` columns, whose rows define the tree edges.
67+
`mst.index` is a matrix with $n-1$ rows and $2$ columns, whose rows define the tree edges.
6268

63-
`mst.dist` is a vector of length `n-1` giving the weights of the corresponding edges.
69+
`mst.dist` is a vector of length $n-1$ giving the weights of the corresponding edges.
6470

65-
The tree edges are ordered w.r.t. weights nondecreasingly, and then by the indexes (lexicographic ordering of the `(weight, index1, index2)` triples). For each `i`, it holds `mst_ind[i,1]<mst_ind[i,2]`.
71+
The tree edges are ordered with respect to weights nondecreasingly, and then by the indexes (lexicographic ordering of the `(weight, index1, index2)` triples). For each `i`, it holds `mst_ind[i,1]<mst_ind[i,2]`.
6672

67-
`nn.index` is an `n` by `M-1` matrix giving the indexes of each point\'s nearest neighbours. `nn.dist` provides the corresponding distances.
73+
`nn.index` is an $n$ by $M-1$ matrix giving the indexes of each point\'s nearest neighbours with respect to the Euclidean distance. `nn.dist` provides the corresponding distances.
6874

6975
## Author(s)
7076

.devel/sphinx/rapi/omp.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ omp_get_max_threads()
2222

2323
`omp_get_max_threads` returns the maximal number of threads that will be used during the next call to a parallelised function, not the maximal number of threads possibly available. It there is no built-in support for OpenMP, 1 is always returned.
2424

25-
For `omp_set_num_threads`, the previous value of `max_threads` is output.
25+
For `omp_set_num_threads`, the previous value of `max_threads` is returned.
2626

2727
## Author(s)
2828

0 commit comments

Comments
 (0)