Skip to content

average_cell_entropy - string conversion causes incorrect outputs #36

@my-alaska

Description

@my-alaska

Hello! I really enjoy working with this library. It's my first time working with CA and the library makes the learning very easy.

I noticed a small problem while working with larger numbers of unique values.

In the average_cell_entropy() function, the shannon_entropy() is called on concatenated string conversions of column elements.
Computed entropy is effectively computed on a string - an array of characters.

This creates an issue when one unique element of input is converted to a string consisting of multiple characters. A number 10 gets converted to a string "10" made of 2 different characters.

Computing entropy on such strings can lead to incorrect results.

Easy example to reproduce:

>>> import numpy as np
>>> from cellpylib import average_cell_entropy
>>> average_cell_entropy(np.array([[9]]))
0.0
>>> average_cell_entropy(np.array([[10]]))
1.0

Computing entropy on 1x1 array of one unique value should always return 0.0 - one probability equal to 1 for one element.

Having number 10 as input returns entropy 1.0 in instead of 0.0 with current implementation. That's entropy for two unique values with 0.5 probability each.

This can lead to potential faulty results for very large numbers of unique values.
If the column values are [0, 1, 2, ..., 19] the string for entropy computation will be "012345678910111213141516171819" with '1' making over one third of all characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions