Status: research prototype (v0.1.0) • Developer: Yasin Kaya, MPIPZ (ykaya@mpipz.mpg.de | yyasinkkaya@gmail.com) • License: MIT
- Fast, portable, single binary. Streams large genomes/reads in one pass.
- Context-aware priors. Pick a quick heuristic or supply your own logistic k-mer weights.
- FASTA/FASTQ auto-detection. Handles multi-line FASTA and 4-line FASTQ.
- Practical TSV output:
seq_id, pos_1based, context, strand, kmer, prob
. - Works from stdin. Pipe compressed input via
zcat
/pigz -dc
.
- A C++17 compiler (GCC ≥ 8, Clang ≥ 7, or MSVC ≥ 2019).
g++ -std=c++17 -O3 -march=native -o revelio revelio.cpp
# or
clang++ -std=c++17 -O3 -march=native -o revelio revelio.cpp
cl /std:c++17 /O2 /EHsc revelio.cpp /Fe:revelio.exe
Drop this into a file named Makefile
:
CXX ?= g++
CXXFLAGS ?= -std=c++17 -O3 -march=native -Wall -Wextra -DNDEBUG
revelio: revelio.cpp
$(CXX) $(CXXFLAGS) -o $@ $<
clean:
rm -f revelio
Heuristic priors (no model):
./revelio -i genome.fa --heuristic > calls.tsv
FASTQ with a logistic weights table and 7-mer context:
./revelio -i reads.fq --model weights.tsv -k 7 --header --out calls.tsv
Filter by context (only CG and CHG):
./revelio -i genome.fa --heuristic --contexts CG,CHG > calls.tsv
Stream a gzipped FASTA:
zcat genome.fa.gz | ./revelio -i /dev/stdin --heuristic > calls.tsv
revelio 0.1.0 — lightweight methylation-context predictor
Usage:
revelio -i <input.fa|fq|/dev/stdin> [--fasta|--fastq] [-k 5]
[--heuristic | --model weights.tsv]
[--contexts CG,CHG,CHH] [--header] [--out output.tsv]
Options:
-i, --input <path> Input FASTA/FASTQ or /dev/stdin
--fasta Force FASTA (auto-detected by default)
--fastq Force FASTQ
-k, --kmer <odd> Centered k-mer size (default: 5; must be odd)
--heuristic Use built-in heuristic priors (no weights table)
--model <tsv> Use logistic weights table (see below)
--contexts <list> Comma-separated subset (CG,CHG,CHH); default: all
--header Print header line in TSV output
--out <path> Write to file (default: stdout)
-h, --help Show help and exit
Output columns (TSV)
seq_id pos_1based context strand kmer prob
Example:
Chr1 123 CG + ACGTA 0.903114
Chr1 456 CHH + CATTG 0.377540
Note: Positions report the C on the + strand. Current version outputs
strand=+
(reverse-strand aggregation is on the roadmap).
A lightweight prior that scores sites based on context class and simple k-mer checks (e.g., centered C, valid CG/CHG/CHH, optional flanking patterns). Intended for fast ranking, not calibrated probabilities.
Loads a user-provided (context, k-mer) → weight mapping and computes:
p = sigmoid( BIAS + weight(context, kmer) )
sigmoid(x) = 1 / (1 + e^(-x))
-k
must be odd (e.g., 5, 7, 9) and the k-mer is centered on the cytosine.- If a
(context,kmer)
pair is missing, Revelio falls back toBIAS
(if provided) or a neutral default.
A simple TSV without header:
# optional global bias row
BIAS 0.0
# context kmer weight
CG ACGTA 2.25
CHG CAGTC 1.10
CHH CATTG 0.30
Guidelines:
context ∈ {CG, CHG, CHH}
kmer
length must equal-k
and be uppercase DNA (A/C/G/T).- Use
BIAS
to shift all logits up/down (acts like an intercept).
- FASTA: multi-line sequences supported.
- FASTQ: 4-line records, qualities ignored (sequence only).
- Compression: not handled internally—use shell pipes (
zcat | revelio -i /dev/stdin
).
Column | Description |
---|---|
seq_id |
FASTA/FASTQ record identifier |
pos_1based |
1-based coordinate of the C on the + strand |
context |
One of CG , CHG , CHH |
strand |
Currently + (forward) |
kmer |
Centered k-mer (odd -k , C is the center) |
prob |
Heuristic/logistic probability in [0,1] |
- Compile with
-O3 -march=native
on a recent compiler. - Prefer smaller odd
-k
(5 or 7) for speed; largerk
can improve specificity but increases table size and cache pressure. - Stream from disk or stdin; Revelio is streaming and uses bounded memory.
- Use Revelio to pre-rank CG/CHG/CHH sites or flag hotspots before signal-level callers.
- For ground-truth methylation on ONT data, switch to Remora/Dorado, Megalodon, or DeepSignal-plant and then reconcile with Revelio’s ranked candidates.
- Reverse-strand aggregation and duplex-aware reporting
- Optional BED/BEDMethyl output
- Native gzip input
- Pluggable k-mer featurizers (e.g., degeneracy, repeats)
MIT. See https://github.com/yykaya/Revelio/blob/main/LICENSE.
Thanks to the plant epigenetics community for foundational work on CG/CHG/CHH methylation and to the ONT ecosystem for practical modified-base pipelines that complement Revelio.