Skip to content

v3.2.4 stalls on very large dataset but v3.2.1 does not #238

@peter-kanvas

Description

@peter-kanvas

I have a workflow which uses kmc to count all kmers in an extremely large dataset of about 250,000 fasta files. The workflow was originally built with v3.2.1 of kmc, but stalled when I updates to v3.2.4. Unfortunately it doesn't exit or report an error. Here's the details I can provide:

KMC call:

kmc -fm -ci0 -cx100000000000 -t94 -k75 -m745 @reference_list database databse_dir

Result:

The program spends some time printing * characters, and then it prints Stage 1: 0% before stalling. There are 511 bin files in the workdir. Htop shows no processor activity, but the commands are still listed.

Before changing versions, i spent time trying to make sure that none of the fasta.gz files were corrupted.

  • gzip -t was clean for all genomes
  • py_fasta_validator did not indicate a problem with any of the fasta formatting
  • I ran kmc on each genome individually and it returned a result for all (however, it did fail on a few genomes, but then passed when I reran on those. This could be because I used xargs to parallelize 94 at a time)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions