Collection of different cluster algorithms integrated in Raha to improve scalability.
To improve the scalability of Raha mulitple cluster algorithms have been implemented. The usage of the cluster algorithms was parallelized. On top, it is now possible to eliminate duplicates to decrease runtime.
It is recommended to use k-means or hierarchical agglomerative clustering with single-linkage with feature-reduction. If you have more knowlege about your dataset, you are open to use the other implemented clustering algorithms.
-
cluster4raha/fasterdetection.py
-
the class "FasterDetection" is designed to replace the class "Detection" from Raha.
-
FasterDetection.CLUSTER_ALGORITHM
defines the clustering algorithm that is used. -
FasterDetection.FEATUREREDUCTION
determines if the de- and reduplication is activated. -
FasterDetection.N_JOBS
determines how many cores are used for high-level parallelisation. -
all cluster algorithms can be configured with:
FasterDetection.MBATCH_SIZE FasterDetection.BIRCH_THRESH FasterDetection.HDBSCAN_MIN_CLUSTER_SIZE FasterDetection.HDBSCAN_MIN_SAMPLES FasterDetection.PARC_DIST_STD_LOCAL FasterDetection.PARC_JAC_STD_GLOBAL
-
-
cluster4raha/benchmarks.py
- this file contains a script which can be used benchmark FasterDetection. It will generate an output directory which contains output files. output files that are called
average.bench
contain the averaged results of the benchmarks. All other files store the measured values for each iteration on one dataset. - there are several arguments that can be given
algorithms
: determines which algorithm should be benchmarked. More than one can be given. ['kmeans', 'mbatch', 'hdbscan', 'agglomerative', 'kmodes', 'parc', 'birch', 'agglomerative_single']--big
(optional) (default=false) : adds the tax dataset to the datasets that are benchmarked--iter
(optional) (default = 10) : sets how many iterations are run for the benchmark--verbose
(optional) (default = false) : adds a counter for the iterations--n_jobs
(optional) (default = 1) : sets how many cores are used for high-level parallelisation--datasets
(optional) (default = all) : sets which datasets the benchmark should be run on. More than one can be given. ["rayyan", "hospital", "beers", "movies_1","flights"]--reduction
(optional) (default = false) : activates the deduplication of the dataset. Is only useful for K-Means and hierarchical agglomerative clustering with single-linkage
- this file contains a script which can be used benchmark FasterDetection. It will generate an output directory which contains output files. output files that are called
-
cluster4raha/visualization.py
- generates a visualisation of the generated cluster on a dataset and a visualisation for the ground truth values by using t-SNE.
- which algorithm and dataset needs to be set inside the code
- generates a visualisation of the generated cluster on a dataset and a visualisation for the ground truth values by using t-SNE.
FasterRaha is based upon and uses code from Raha.