Skip to content

Error - Found multiple entries with exon_id #360

@monicarojasp

Description

@monicarojasp

Describe the bug
I am trying to generate the json file for Ensembl version 110

During execution, this script https://github.com/bcgsc/mavis/blob/master/src/tools/generate_ensembl_json.py it returned the following error:

Found multiple entries with exon_id=ENSE00001132905 ([('8', 144464809, 144465096, '-', '', 'ENSG00000291316'), ('8', 144464809, 144465096, '-', 'TMEM276', 'ENSG00000291317')])

This occurred because the same exon was associated with multiple gene annotations. Which can happen in Ensembl data due to overlapping or alternative annotations.

  1. Clone and install MAVIS tools and dependencies
    git clone https://github.com/bcgsc/mavis.git
    pip install ".[tools]"
  2. Ensure compatible versions of dependencies
    MAVIS requires a compatible version of numpy and pandas. If using an incompatible version, you may encounter the following error:
    ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
    To resolve this:
    pip uninstall -y numpy pandas
    pip install "numpy<2.0" "pandas<2.0"
    Verifying the installed versions, should return {{1.26.4
    1.5.3}}

    python -c "import numpy; print(numpy.version)"
    python -c "import pandas; print(pandas.__ver

Running the Script
Run the script with your species and Ensembl version:
python src/tools/generate_ensembl_json.py -s human -r 110 -o my_output/ensembl_human_v110.json
To Reproduce
Steps to reproduce the behavior:

  1. run command '...'
  2. See error ...

Expected behavior

  1. Print a warning message if multiple results exist (len(results) > 1).
  2. Prioritize entries that have a non-empty gene name.
  3. If both entries have gene names, If available, choose the gene with the higher confidence or more evidence in terms of association with the exon, If no clinical prioritization is available, and if the exon in question is not tied to any known disease, select the first gene in the list.

Input Data
s human -r 110

Versions (please complete the following information):

  • OS: MacOS Sequoia 15.4.1
  • Python Version: Python 3.9.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions