aedera
diff --git a/‎README.md
Lines changed: 134 additions & 0 deletions b/‎README.md
Lines changed: 134 additions & 0 deletions
diff --git a/‎img/kmer-norms.png
381 KB b/‎img/kmer-norms.png
381 KB
diff --git a/‎m6anormalization/__init__.py
Lines changed: 39 additions & 0 deletions b/‎m6anormalization/__init__.py
Lines changed: 39 additions & 0 deletions
diff --git a/‎m6anormalization/cli/__init__.py b/‎m6anormalization/cli/__init__.py
diff --git a/‎m6anormalization/cli/apply.py
Lines changed: 88 additions & 0 deletions b/‎m6anormalization/cli/apply.py
Lines changed: 88 additions & 0 deletions
@@ -0,0 +1,134 @@
+# m6anormalization
+
+This repository provides a package for calculating k-mer normalization
+constants for m6a levels inferred from DNA Nanopore reads. For more
+detailed information, refer to the pulication:
+
+[Simultaneous Profiling of Chromatin Accessibility and DNA Methylation
+in Complete Plant Genomes Using Long-Read
+Sequencing](https://www.biorxiv.org/content/10.1101/2023.11.15.567180v2).
+
+<figure>
+  <p align="center">
+  <img src="img/kmer-norms.png" alt="m6a normalization constants" height="400" style="vertical-align:middle"/>
+  </p>
+</figure>
+
+## Requirements
+
+Before installation, ensure you have Python 3.x and Conda installed on your system. The package installation will automatically manage Pip and Numpy dependencies.
+
+## Installation
+
+
+### Create conda environment
+
+It is highly recommendable to create a dedicated Conda environment.
+
+```shell
+conda create -n m6a python=3.6
+conda activate m6a
+```
+
+### Install pip
+
+If [pip](https://pypi.org/project/pip/) is not already installed, use
+the following command
+
+```shell
+conda install pip
+```
+
+### Clone the repository and install the package
+
+```shell
+git clone https://github.com/aedera/m6anormalization
+cd m6anormalization
+pip install .
+```
+
+## Usage
+
+### Process m6a calls
+
+To process m6A calls from Nanopore reads, use the
+[megalodon](https://github.com/nanoporetech/megalodon) tool. The
+output file per_read_modified_base_calls.txt is required for
+generating k-mer normalization constants. Execute the following
+commands to process it before constant calculation:
+
+```shell
+awk '$7=="Y" {if(exp($5)>=0.75) { print $2"\t"$4"\t"$4"\t"1"\t"$3} else if(exp($5)<0.75) print $2"\t"$4"\t"$4"\t"0"\t"$3}' per_read_modified_base_calls.txt >  bper_read.tmp
+sort -k 1,1 -k2,2n -T bper_read.tmp > bper_read.sorted.tmp
+bedtools merge -i bper_read.sorted.tmp -c 4,4,5 -o count,mean,distinct -d -2 | \
+awk '{if($3==$2) { print $1"\t"$2"\t"$3+1"\t"$5"\t"$6} else print  $1"\t"$2+1"\t"$2+2"\t"$5"\t"$6}' > m6a.bed
+```
+
+These commands discretize per-read m6A calls and aggregates them to
+derive methylation levels per genomic adenine. These methylation
+levels are stored in the `m6a.bed` file. Per-read m6a calls are
+discretized using 0.75 as a threshold.
+
+The `m6a.bed` file should look like this
+
+```
+Chr1    32      33      0.100   -       20
+Chr1    34      35      0.071   +       14
+Chr1    35      36      0.067   +       15
+Chr1    36      37      0.174   -       23
+Chr1    39      40      0.125   -       24
+Chr1    40      41      0.083   -       24
+Chr1    41      42      0.000   +       16
+Chr1    42      43      0.000   +       17
+Chr1    43      44      0.125   -       24
+Chr1    47      48      0.083   -       24
+Chr1    48      49      0.000   +       18
+Chr1    49      50      0.000   +       18
+Chr1    50      51      0.000   +       18
+```
+
+where the columns indicate 
+
+|chromosome|start|end|m6a level|strand|coverage|
+|----------|-----|---|---------|------|--------|
+
+
+### Generate k-mer normalization constants
+
+Use the `m6a.bed` file to generate k-mer normalization constants:
+
+```shell
+m6anormalization generate --bed m6a.bed \
+                          --fas fas_file \
+			  --chrs Chr1,Chr2,Chr3,Chr4,Chr5 \
+			  --out kconstants.tsv
+```
+
+`fas_file` refers to the fasta file used as input for megalodon. The k-mers are extracted from the chromosomes indicated in `--chrs`.
+
+### Apply normalization constants to genome
+
+Apply the generated normalization constants (`kconstants.tsv`) to
+normalize the m6a levels:
+
+```shell
+m6anormalization apply --norm kconstants.tsv \
+                       --fas fas_file \
+                       --bed m6a.bed
+		       --chrs Chr1,Chr2,Chr3,Chr4,Chr5
+```
+
+This step produces a bed file per chromosome with normalized m6a
+levels with the following format:
+
+|chromosome|start|end|m6a level|strand|coverage|k-mer|normalized m6a level|
+|----------|-----|---|---------|------|--------|-----|--------------------|
+
+## Contributing
+
+Contributions from anyone are welcome. You can start by adding a new entry [here](https://github.com/aedera/m6anormalization/issues).
+
+
+## License
+
+This package is released under the [MIT License](LICENSE).
@@ -0,0 +1,39 @@
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
+
+__version__ = '0.1.0'
+
+from m6anormalization.cli import (
+    generate, apply,
+)
+
+modules = [
+    'generate', 'apply'
+]
+
+def main():
+    parser = ArgumentParser(
+        'm6anormalization',
+        formatter_class=ArgumentDefaultsHelpFormatter
+    )
+
+    parser.add_argument(
+        '-v', '--version', action='version',
+        version='%(prog)s {}'.format(__version__)
+    )
+
+    subparsers = parser.add_subparsers(
+        title='subcommands', description='valid commands',
+        help='additional help', dest='command'
+    )
+    subparsers.required = True
+
+    for module in modules:
+        mod = globals()[module]
+        p = subparsers.add_parser(module, parents=[mod.argparser()])
+        p.set_defaults(func=mod.main)
+
+    args = parser.parse_args()
+    args.func(args)
+
+
+
@@ -0,0 +1,88 @@
+import argparse
+import multiprocessing
+
+from m6anormalization.utils.io import read_fas
+from m6anormalization.utils.utils import get_kmer
+
+# import pdb
+
+# class ForkedPdb(pdb.Pdb):
+#     """A Pdb subclass that may be used
+#     from a forked multiprocessing child
+
+#     """
+#     def interaction(self, *args, **kwargs):
+#         _stdin = sys.stdin
+#         try:
+#             sys.stdin = open('/dev/stdin')
+#             pdb.Pdb.interaction(self, *args, **kwargs)
+#         finally:
+#             sys.stdin = _stdin
+
+class Writer(multiprocessing.Process):
+    def __init__(self, knorm_fn, fas_fn, bed_fn, chrn):
+        super().__init__()
+        self.knorm_fn = knorm_fn
+        self.fas_fn   = fas_fn
+        self.bed_fn   = bed_fn
+        self.tchrn    = chrn
+
+    def run(self):
+        # read kmer normalization
+        k2v = {}
+        with open(self.knorm_fn) as f:
+            for line in f:
+                kmer, val = line.strip().split('\t')
+                k2v[kmer] = float(val)
+
+        # read fasta file to retrieve k-mers associated to positions
+        seq = read_fas(self.fas_fn, self.tchrn)
+
+        # process input bed file
+        out = open(f'{self.tchrn}.bed', 'w')
+        # process input bed file
+        with open(self.bed_fn) as f:
+            for line in f:
+                chrn, start, stop, mlevel, strand, cov = line.strip().split('\t')
+                if chrn != self.tchrn:
+                    continue
+                start  = int(start)
+                mlevel = float(mlevel)
+
+                kmer = get_kmer(seq, start)
+                norm = k2v[kmer] if kmer in k2v else None
+
+                if not norm is None:
+                    if norm != 0:
+                        new_level = mlevel/norm
+                    else:
+                        new_level = 0.
+
+                    out.write(f'{self.tchrn}\t{start}\t{start+1}\t{mlevel}\t{strand}\t{cov}\t{kmer}\t{new_level:.4f}\n')
+        out.close()
+
+def main(args):
+    knorm_fn = args.norm
+    fas_fn   = args.fas
+    bed_fn   = args.bed
+    chrs     = args.chrs
+    chrs     = [chrn.strip() for chrn in chrs.split(',')]
+
+    processes = []
+    for chrn in chrs: # launch a process for each chromosome
+        p = Writer(knorm_fn, fas_fn, bed_fn, chrn)
+        p.start()
+        processes.append(p)
+    [p.join() for p in processes]
+
+def argparser():
+    parser = argparse.ArgumentParser(
+        'Apply normalization constants to genome',
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+        add_help=False
+    )
+    parser.add_argument("--norm", required=True, type=str)
+    parser.add_argument("--fas",  required=True, type=str)
+    parser.add_argument("--bed",  required=True, type=str)
+    parser.add_argument("--chrs", required=True, type=str)
+    return parser