Skip to content

LUH-DBS/FasterRaha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FasterRaha

Collection of different cluster algorithms integrated in Raha to improve scalability.

To improve the scalability of Raha mulitple cluster algorithms have been implemented. The usage of the cluster algorithms was parallelized. On top, it is now possible to eliminate duplicates to decrease runtime.

Usage

It is recommended to use k-means or hierarchical agglomerative clustering with single-linkage with feature-reduction. If you have more knowlege about your dataset, you are open to use the other implemented clustering algorithms.

  • cluster4raha/fasterdetection.py

    • the class "FasterDetection" is designed to replace the class "Detection" from Raha.

    • FasterDetection.CLUSTER_ALGORITHM defines the clustering algorithm that is used.

    • FasterDetection.FEATUREREDUCTION determines if the de- and reduplication is activated.

    • FasterDetection.N_JOBS determines how many cores are used for high-level parallelisation.

    • all cluster algorithms can be configured with:

      FasterDetection.MBATCH_SIZE
      FasterDetection.BIRCH_THRESH
      FasterDetection.HDBSCAN_MIN_CLUSTER_SIZE
      FasterDetection.HDBSCAN_MIN_SAMPLES
      FasterDetection.PARC_DIST_STD_LOCAL
      FasterDetection.PARC_JAC_STD_GLOBAL
  • cluster4raha/benchmarks.py

    • this file contains a script which can be used benchmark FasterDetection. It will generate an output directory which contains output files. output files that are called average.bench contain the averaged results of the benchmarks. All other files store the measured values for each iteration on one dataset.
    • there are several arguments that can be given
      • algorithms : determines which algorithm should be benchmarked. More than one can be given. ['kmeans', 'mbatch', 'hdbscan', 'agglomerative', 'kmodes', 'parc', 'birch', 'agglomerative_single']
      • --big (optional) (default=false) : adds the tax dataset to the datasets that are benchmarked
      • --iter (optional) (default = 10) : sets how many iterations are run for the benchmark
      • --verbose (optional) (default = false) : adds a counter for the iterations
      • --n_jobs (optional) (default = 1) : sets how many cores are used for high-level parallelisation
      • --datasets (optional) (default = all) : sets which datasets the benchmark should be run on. More than one can be given. ["rayyan", "hospital", "beers", "movies_1","flights"]
      • --reduction (optional) (default = false) : activates the deduplication of the dataset. Is only useful for K-Means and hierarchical agglomerative clustering with single-linkage
  • cluster4raha/visualization.py

    • generates a visualisation of the generated cluster on a dataset and a visualisation for the ground truth values by using t-SNE.
      • which algorithm and dataset needs to be set inside the code

References

FasterRaha is based upon and uses code from Raha.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages