FasterRaha

Collection of different cluster algorithms integrated in Raha to improve scalability.

To improve the scalability of Raha mulitple cluster algorithms have been implemented. The usage of the cluster algorithms was parallelized. On top, it is now possible to eliminate duplicates to decrease runtime.

Usage

It is recommended to use k-means or hierarchical agglomerative clustering with single-linkage with feature-reduction. If you have more knowlege about your dataset, you are open to use the other implemented clustering algorithms.

cluster4raha/fasterdetection.py
- the class "FasterDetection" is designed to replace the class "Detection" from Raha.
- FasterDetection.CLUSTER_ALGORITHM defines the clustering algorithm that is used.
- FasterDetection.FEATUREREDUCTION determines if the de- and reduplication is activated.
- FasterDetection.N_JOBS determines how many cores are used for high-level parallelisation.
- all cluster algorithms can be configured with:
```
FasterDetection.MBATCH_SIZE
FasterDetection.BIRCH_THRESH
FasterDetection.HDBSCAN_MIN_CLUSTER_SIZE
FasterDetection.HDBSCAN_MIN_SAMPLES
FasterDetection.PARC_DIST_STD_LOCAL
FasterDetection.PARC_JAC_STD_GLOBAL
```
cluster4raha/benchmarks.py
- this file contains a script which can be used benchmark FasterDetection. It will generate an output directory which contains output files. output files that are called average.bench contain the averaged results of the benchmarks. All other files store the measured values for each iteration on one dataset.
- there are several arguments that can be given
  - algorithms : determines which algorithm should be benchmarked. More than one can be given. ['kmeans', 'mbatch', 'hdbscan', 'agglomerative', 'kmodes', 'parc', 'birch', 'agglomerative_single']
  - --big (optional) (default=false) : adds the tax dataset to the datasets that are benchmarked
  - --iter (optional) (default = 10) : sets how many iterations are run for the benchmark
  - --verbose (optional) (default = false) : adds a counter for the iterations
  - --n_jobs (optional) (default = 1) : sets how many cores are used for high-level parallelisation
  - --datasets (optional) (default = all) : sets which datasets the benchmark should be run on. More than one can be given. ["rayyan", "hospital", "beers", "movies_1","flights"]
  - --reduction (optional) (default = false) : activates the deduplication of the dataset. Is only useful for K-Means and hierarchical agglomerative clustering with single-linkage
cluster4raha/visualization.py
- generates a visualisation of the generated cluster on a dataset and a visualisation for the ground truth values by using t-SNE.
  - which algorithm and dataset needs to be set inside the code

References

FasterRaha is based upon and uses code from Raha.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
cluster4raha		cluster4raha
datasets		datasets
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FasterRaha

Usage

References

About

Uh oh!

Releases

Packages

Languages

License

LUH-DBS/FasterRaha

Folders and files

Latest commit

History

Repository files navigation

FasterRaha

Usage

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages