Skip to content

Commit 22026e5

Browse files
committed
Add options for an initial search length and set of characters to exclude from k-mer candidacy
1 parent 6dc2d83 commit 22026e5

File tree

4 files changed

+118
-18
lines changed

4 files changed

+118
-18
lines changed

docs/source/commands.rst

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,15 +58,26 @@ to be unique at each position from a given range of k-mer lengths. See the
5858

5959
Options
6060
-------
61+
- `initial-search-length`: The initial k-mer length to search for unique sequences.
62+
Only valid when the set of lengths of k-mer lengths is a continuous range
63+
with the ``kmer-lengths`` option (which is a pair of values separated by a
64+
colon). Useful to use when the majority of largest minimum unique lengths are
65+
likely to be much smaller the maximum search length from your specified range.
66+
- `exclude-bases`: A string of bases to exclude from the search for unique
67+
sequences. Case sensitive. Default is 'Nn'.
6168
- `kmer-batch-size`: The maximum number of sequence positions to search for at
6269
a time per sequence ID. Useful for controlling memory requirements. Default
6370
is 1000000.
64-
- `thread-count`: The number of threads to use for counting on the index. Default is 1.
65-
- `verbose`: Print verbose output. Includes summary statistics at end of each sequence. Default is False.
71+
- `thread-count`: The number of threads to use for counting on the index.
72+
Default is 1.
73+
- `verbose`: Print verbose output. Includes summary statistics at end of each
74+
sequence. Default is False.
6675

6776
Positional Arguments
6877
--------------------
69-
- `kmer-lengths`: The range of k-mer lengths to search for unique sequences.
78+
- `kmer-lengths`: The range of k-mer lengths to search for unique sequences. A
79+
colon seperated pair of values specifies a continuous range. A comma
80+
seperated list specifies specific lengths to search.
7081
- `index-file`: The name of the index file to use for searching for unique sequences.
7182
- `fasta-file`: The name of the fasta file containing sequence(s) where each
7283
sequence ID will have a ``unique`` file generated. Must be equal to or a
@@ -101,6 +112,21 @@ Notably if your the largest k-mer length found is smaller than the maximum
101112
length and your minimum is larger than your (colon seperated) range, it
102113
signifies that the sequence has been exhaustively searched.
103114

115+
Ambiguous bases
116+
^^^^^^^^^^^^^^^
117+
118+
Due to the implementation of the AWFM-index, `all non-ACGT bases are treated as
119+
an equivalent base
120+
<https://almob.biomedcentral.com/articles/10.1186/s13015-021-00204-6/tables/1>`_.
121+
Unless specified otherwise, if a k-mer sequence contains a non-ACGT base, it
122+
will be counted on the index as an exact match to any other non-ACGT base. By
123+
default, The bases 'N' and 'n' are treated as ambiguous bases and will be
124+
excluded from being candidates for unique k-mers. The index library does not
125+
distinguish between upper and lower cases, however we provide the ability to
126+
exclude candidate k-mers based on case sensitivity to allow flexibility for
127+
soft-masked sequences conventionally introduced by software such as
128+
`RepeatMasker<https://www.repeatmasker.org/>`_.
129+
104130
Threading
105131
^^^^^^^^^
106132

docs/source/umap.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,6 @@ To reproduce the ``unique.uint8`` datasets that would have been generated from
1212
Umap, the following criteria must be met:
1313

1414
1. K-mers overlapping specifically with dinucleotide sequence ``N`` are not
15-
considered unique. This is the default.
15+
considered unique and should be excluded as candidates. Note that this
16+
excludes any soft-masked regions with ``n`` characters.
1617
2. Only the following k-mer lengths should be included: 24,36,50,100,150,200

newmap/main.py

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
DEFAULT_THREAD_COUNT = 1
1313
DEFAULT_MINIMUM_KMER_LENGTH = 20
1414
DEFAULT_MAXIMUM_KMER_LENGTH = 200
15+
DEFAULT_EXCLUDED_BASES = 'Nn'
1516

1617
# Defaults for mappability output
1718
DEFAULT_KMER_SIZE = 24
@@ -79,10 +80,10 @@ def parse_subcommands():
7980

8081
unique_length_parser.add_argument(
8182
"kmer_lengths",
82-
help="Specify k-mer lengths to find unique kmers. "
83+
help="Specify k-mer lengths to find unique k-mers. "
8384
"Use a comma seperated list of increasing lengths "
8485
"or a full inclusive set of lengths seperated by a colon. "
85-
"Example: 20,24,30 or 20-30.")
86+
"Example: 20,24,30 or 20:30.")
8687

8788
unique_length_parser.add_argument(
8889
"index_file",
@@ -92,21 +93,35 @@ def parse_subcommands():
9293
"fasta_file",
9394
help="Filename of (gzipped) fasta file for kmer generation")
9495

96+
unique_length_parser.add_argument(
97+
"--initial-search-length", "-l",
98+
type=int,
99+
default=0,
100+
help="Specify the initial search length for unique k-mers. Only valid "
101+
"when search range is a continous range separated by a colon."
102+
)
103+
104+
unique_length_parser.add_argument(
105+
"--exclude-bases", "-e",
106+
default=DEFAULT_EXCLUDED_BASES,
107+
help=f"Specify bases to exclude from k-mer candidacy. Case sensitive. "
108+
f"Default is {DEFAULT_EXCLUDED_BASES}")
109+
95110
unique_length_parser.add_argument(
96111
"--kmer-batch-size", "-s",
97112
default=DEFAULT_KMER_BATCH_SIZE,
98113
type=int,
99114
help="Maximum number of kmers to batch per reference sequence from "
100115
"given fasta file. "
101116
"Use to control memory usage. "
102-
"Default is {}" .format(DEFAULT_KMER_BATCH_SIZE))
117+
"Default is {}".format(DEFAULT_KMER_BATCH_SIZE))
103118

104119
unique_length_parser.add_argument(
105120
"--thread-count", "-t",
106121
default=DEFAULT_THREAD_COUNT,
107122
type=int,
108123
help="Number of threads to parallelize kmer counting. "
109-
"Default is {}" .format(DEFAULT_THREAD_COUNT))
124+
"Default is {}".format(DEFAULT_THREAD_COUNT))
110125

111126
unique_length_parser.add_argument(
112127
"--verbose",

newmap/unique_counts.py

Lines changed: 68 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ def write_unique_counts(fasta_filename: Path,
1919
index_filename: Path,
2020
kmer_batch_size: int,
2121
kmer_lengths: list[int],
22+
initial_search_length: int,
23+
exclude_bases: set[bytes],
2224
num_threads: int,
2325
use_binary_search=False,
2426
verbose: bool = False):
@@ -110,6 +112,8 @@ def write_unique_counts(fasta_filename: Path,
110112
sequence_segment,
111113
min_kmer_length,
112114
max_kmer_length,
115+
initial_search_length,
116+
exclude_bases,
113117
num_threads,
114118
data_type, # type: ignore
115119
verbose)
@@ -119,6 +123,7 @@ def write_unique_counts(fasta_filename: Path,
119123
sequence_segment,
120124
kmer_lengths,
121125
num_kmers,
126+
exclude_bases,
122127
num_threads,
123128
data_type, # type: ignore
124129
verbose)
@@ -155,6 +160,8 @@ def binary_search(index_filename: Path,
155160
sequence_segment: SequenceSegment,
156161
min_kmer_length: int,
157162
max_kmer_length: int,
163+
initial_search_length: int,
164+
exclude_bases: set[bytes],
158165
num_threads: int,
159166
data_type: Union[np.uint8, np.uint16, np.uint32],
160167
verbose: bool) -> tuple[npt.NDArray[np.uint], int]:
@@ -163,13 +170,16 @@ def binary_search(index_filename: Path,
163170

164171
# NB: Floor division for midpoint
165172
# NB: Avoid an overflow error by dividing first before sum
166-
starting_kmer_length = (max_kmer_length // 2) + (min_kmer_length // 2)
173+
if initial_search_length:
174+
starting_kmer_length = initial_search_length
175+
else:
176+
starting_kmer_length = (max_kmer_length // 2) + (min_kmer_length // 2)
167177

168178
# Track which kmer positions have finished searching,
169179
# skipping any kmers starting with an ambiguous base
170-
finished_search = np.frombuffer(sequence_segment.data,
171-
dtype=np.uint8,
172-
count=num_kmers) == ord(b'N')
180+
finished_search = get_ambiguous_positions(sequence_segment,
181+
num_kmers,
182+
exclude_bases)
173183

174184
# Print out the number of ambiguous positions skipped
175185
ambiguous_positions_skipped = finished_search.sum()
@@ -194,8 +204,10 @@ def binary_search(index_filename: Path,
194204
current_length_query,
195205
finished_search,
196206
max_kmer_length,
197-
min_kmer_length)
207+
min_kmer_length,
208+
initial_search_length)
198209

210+
# Print out the number of ambiguous positions skipped if verbosity is on
199211
if verbose:
200212
upper_bound_change_count = np.count_nonzero(
201213
upper_length_bound[(~finished_search).nonzero()] < max_kmer_length)
@@ -209,7 +221,7 @@ def binary_search(index_filename: Path,
209221
verbose_print(verbose, f"{short_kmers_discarded_count} k-mers shorter "
210222
"than the minimum length discarded due to ambiguity")
211223

212-
# List of minimum lengths (where 0 is nothing was found)
224+
# List of unique minimum lengths (where 0 is nothing was found)
213225
unique_lengths = np.zeros(num_kmers, dtype=data_type)
214226

215227
iteration_count = 1
@@ -303,7 +315,8 @@ def update_upper_search_bound(upper_length_bound_array: npt.NDArray[np.uint],
303315
current_length_query_array: npt.NDArray[np.uint],
304316
finished_search_array: npt.NDArray[np.bool_],
305317
max_kmer_length,
306-
min_kmer_length):
318+
min_kmer_length,
319+
initial_search_length):
307320
"""Modifies in the input arrays to update the upper search bound based on
308321
ambiguous bases in the sequence data.
309322
Updates the query lengths between the new maximum upper bound
@@ -329,12 +342,23 @@ def update_upper_search_bound(upper_length_bound_array: npt.NDArray[np.uint],
329342
# Set the maximum length up to 1 next to the ambiguous base position
330343
upper_length_bound_array[length_change_position:i] = \
331344
max_lengths_to_ambiguous_position
345+
332346
# Calculate the new query length as the midpoint between the updated
333347
# upper and the current lower bounds
334-
current_length_query_array[length_change_position:i] = np.floor(
348+
new_initial_search_array = np.floor(
335349
(upper_length_bound_array[length_change_position:i] / 2) +
336350
(lower_length_bound_array[length_change_position:i] / 2)).astype(
337351
data_type)
352+
353+
# If we have an initial search length
354+
if initial_search_length:
355+
# Use the initial search length if it is less than the new midpoint
356+
new_initial_search_array = np.fmin(new_initial_search_array,
357+
initial_search_length)
358+
359+
current_length_query_array[length_change_position:i] = \
360+
new_initial_search_array
361+
338362
# Mark positions with values of (min length - 1) to 1 as finished
339363
finished_search_array[minimum_length_position+1:i] = True
340364

@@ -343,14 +367,16 @@ def linear_search(index_filename: Path,
343367
sequence_segment: SequenceSegment,
344368
kmer_lengths: list[int],
345369
num_kmers: int,
370+
exclude_bases: set[bytes],
346371
num_threads: int,
347372
data_type: Union[np.uint8, np.uint16, np.uint32],
348373
verbose: bool) -> tuple[npt.NDArray[np.uint], int]:
349374
# Track which kmer positions have finished searching,
350375
# skipping any kmers starting with an ambiguous base
351376
# NB: Iterating over bytes returns ints
352-
finished_search = np.array([c == ord(b'N') for c in
353-
sequence_segment.data[:num_kmers]])
377+
finished_search = get_ambiguous_positions(sequence_segment,
378+
num_kmers,
379+
exclude_bases)
354380

355381
ambiguous_positions_skipped = finished_search.sum()
356382
verbose_print(verbose, f"Skipping {ambiguous_positions_skipped} ambiguous "
@@ -478,6 +504,26 @@ def get_num_kmers(sequence_segment: SequenceSegment,
478504
return sequence_length - lookahead_length
479505

480506

507+
def get_ambiguous_positions(sequence_segment: SequenceSegment,
508+
num_positions: int,
509+
ambiguous_bases: set[bytes]):
510+
"""Returns a boolean array of ambiguous positions in a sequence segment
511+
Where True is an ambiguous position and False is a non-ambiguous"""
512+
513+
# Track which kmer positions have finished searching,
514+
# skipping any kmers starting with an ambiguous base
515+
sequence_buffer = np.frombuffer(sequence_segment.data,
516+
dtype=np.uint8,
517+
count=num_positions)
518+
519+
ambiguous_array_positions = np.full(num_positions, False, dtype=bool)
520+
for base in ambiguous_bases:
521+
ambiguous_array_positions |= (sequence_buffer == ord(base))
522+
523+
return ambiguous_array_positions
524+
525+
526+
481527
def print_summary_statisitcs(verbose: bool,
482528
total_unique_lengths_count: int,
483529
total_ambiguous_positions: int,
@@ -505,6 +551,8 @@ def main(args):
505551
index_filename = args.index_file
506552
kmer_batch_size = args.kmer_batch_size
507553
kmer_lengths_arg = args.kmer_lengths
554+
initial_search_length = args.initial_search_length
555+
exclude_bases_arg = args.exclude_bases
508556
num_threads = args.thread_count
509557
verbose = args.verbose
510558

@@ -530,10 +578,20 @@ def main(args):
530578
else:
531579
kmer_lengths = list(map(int, kmer_lengths_arg.split(",")))
532580

581+
if (initial_search_length and
582+
not use_binary_search):
583+
raise ValueError("Initial search length only valid when a range of "
584+
"k-mer lengths is given")
585+
586+
exclude_bases = set([bytes(base, encoding="utf-8")
587+
for base in exclude_bases_arg])
588+
533589
write_unique_counts(Path(fasta_filename),
534590
Path(index_filename),
535591
kmer_batch_size,
536592
kmer_lengths,
593+
initial_search_length,
594+
exclude_bases,
537595
num_threads,
538596
use_binary_search,
539597
verbose)

0 commit comments

Comments
 (0)