Skip to content

Rank-based outcome space #421

@kahaaga

Description

@kahaaga

Describe the feature you'd like to have

While implementing the Chatterjee coefficient of correlation in Associations.jl, I needed to compute ranks of an input vector, according to some method. That made me think: why not have an encoding and an outcome space in ComplexityMeasures.jl that does the same? I imagine something like this:

The encoding takes as an input the input data, which allows a mapping encoding the raw values xi onto an index idx_i, and decode that index onto xi. If length(x) == length(rank(x)), then there are as many outcomes as there are values in x (so the outcome space is a bit useless for complexity quantification). However, in datasets with repetitions, we can have length(rank(x)) << length(x). Then each rank index is an outcome, and we meaningfully estimate counts/probabilities by counting how many input data points map to each rank index.

Cite scientific papers related to the feature/algorithm

I have no references atm. This may or may not have been done by someone before.

If possible, sketch out an implementation strategy

Relatively straight-forward: follow the dev docs on implementing the encoding and outcome space.

A few notes:

  • The encoding must store the raw data, so that we can both and encode/decode.
  • There are many possible rank methods. The user can control which method to use by applying keyword argument to the encoding/outcome space (probably most elegant with dispatch on a T <: RankType). StatsBase.jl implements a few such ranking methods we can use for inspiration. I also implemented a randomization-tie-breaking variant for the Chatterjee coefficient that we would use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions