Skip to content

Commit f1f757d

Browse files
authored
Merge pull request #33 from Noble-Lab/varun
Pull request for some issue fixing (#19, #22, #23, #30, #31, #34, #25)
2 parents a7cefa2 + 1ec641b commit f1f757d

File tree

11 files changed

+8653
-259
lines changed

11 files changed

+8653
-259
lines changed

README.md

Lines changed: 70 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,43 @@
55

66
Data and pre-trained model weights are available [here](https://zenodo.org/record/5976003).
77

8-
## How to get started with Casanovo?
8+
A link to the preprint of the paper where we discuss our methods and tests can be found [here](https://www.biorxiv.org/content/10.1101/2022.02.07.479481v1).
99

10-
Install `casanovo` as a Python package from this repo (requires Python 3.7+, dependencies will be installed automatically as needed):
10+
# How to get started with Casanovo?
11+
## Our recommendation:
12+
13+
Install **Anaconda**! It helps keep your environment for casanovo and its dependencies separate from your other Python environments. **This is especially helpful because casanovo works within a specific range of Python versions (3.7 > Python version > 3.10).**
14+
15+
- Check out the [Windows](https://docs.anaconda.com/anaconda/install/windows/#), [MacOS](https://docs.anaconda.com/anaconda/install/mac-os/), and [Linux](https://docs.anaconda.com/anaconda/install/linux/) installation instructions.
16+
17+
Once you have Anaconda installed, you can use this helpful [cheat sheet](https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf) to see common commands and what they do.
18+
19+
## Environment creation:
20+
21+
Open up the anaconda prompt and run this command:
22+
```
23+
conda create --name casanovo_env python=3.7
24+
```
25+
This will create an anaconda environment called `casanovo_env` that has python 3.7 installed (You can check if it was created by typing `conda env list`).
26+
27+
You can activate this environment by typing:
28+
```
29+
conda activate casanovo_env
30+
```
31+
To the left of your anaconda prompt line it should now say **(casanovo_env)** instead of **(base)**. If this is the case, then you have set up anaconda and the environment properly.
32+
33+
**Be sure to retype in the activation command into your terminal when you reopen anaconda and want to use casanovo.** The base environment most likely will not work.
34+
35+
## Installation:
36+
37+
Install `casanovo` as a Python package from this repo (requires 3.7 > [Python version] > 3.10 , dependencies will be installed automatically as needed):
1138
```
1239
pip install git+https://github.com/Noble-Lab/casanovo.git#egg=casanovo
1340
```
1441

15-
Once installed, Casanovo can be used with a simple command line interface. Run `casanovo --help` for more details. All auxiliary data, model and training-related variables can be specified in a user created `.py` file, see `casanovo/config.py` for the default configuration that was used to obtain the reported results.
42+
Once installed, Casanovo can be used with a simple command line interface. **Run `casanovo --help` for more details.** All auxiliary data, model and training-related variables can be specified in a user created `.yaml` file, see `casanovo/config.yaml` for the default configuration that was used to obtain the reported results.
43+
44+
# Example Commands:
1645

1746
- To evaluate _de novo_ sequencing performance of a pre-trained model (peptide annotations are needed for spectra):
1847
```
@@ -28,5 +57,43 @@ casanovo --mode=denovo --model_path='path/to/pretrained' --test_data_path='path/
2857
```
2958
casanovo train --mode=train --model_path='path/to/pretrained' --train_data_path='path/to/train/mgf/files/dir' --val_data_path='path/to/validation/mgf/files/dir' --config_path='path/to/config'
3059
```
60+
# Example Job:
61+
## A small walkthrough on how to use casanovo with a very small spectra (~100) set
3162

63+
### The spectra file (.mgf) that we will be running this job on can be seen in the sample_data folder.
64+
65+
- Step 1: Install casanovo (see above for details)
66+
- Step 2: Download the casanovo_pretrained_model_weights.zip from [here](https://zenodo.org/record/5976003). Place these models in a location that you can easly access and know the path of.
67+
- We will be using pretrained_excl_mouse.ckpt for this job.
68+
- Step 3: Ensure you are in the proper anaconda environment by typing ```conda activate casanovo_env```. (If you named it differently, type in that name instead)
69+
- Step 4: Run this command:
70+
```
71+
casanovo --mode=denovo --model_path='[PATH_TO]/pretrained_excl_mouse.ckpt' --test_data_path='sample_data' --preprocess_spec=False
72+
```
73+
Make sure you have the proper filepath to the pretrained_excl_mouse.ckpt file.
74+
- Note: If you want to get the ouput csv in a place OTHER than where you ran this command, specify where you would like the output to be placed by specifying a directory in the --output_path CLI field
75+
- It would look like ```--output_path='path/to/output/location'``` appended onto the end of the above command. Be sure to provide a directory, not a file!
76+
77+
This job should take very little time to run (< 1 minute), and the result should be a file named ```casanovo_output.csv``` wherever you specified.
78+
79+
If the first few lines look like:
80+
```
81+
spectrum_id,denovo_seq,peptide_score,aa_scores
82+
0,LAHYNKR,0.9912219984190804,"[1.0, 1.0, 1.0, 0.99948...
83+
```
84+
Congratulations! You got casanovo to work!
85+
86+
# Common Troubleshooting/FAQ
87+
88+
## Installed casanovo and it worked before, but I reopened Anaconda again and now it says casanovo is not installed
89+
Make sure you are in the `casanovo_env` environment. You can make sure you are in it by typing
90+
```
91+
conda activate casanovo_env
92+
```
93+
## What CLI Prompts can I use?
94+
Run the following command in your command prompt:
95+
```
96+
casanovo --help
97+
```
98+
It should give you a comprehensive list of all CLI options you can tag onto a casanovo job and how/why to use them.
3299

casanovo/casanovo.py

Lines changed: 56 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,86 @@
11
"""The command line entry point for casanovo"""
2+
from email.policy import default
23
import click, logging
3-
from casanovo.denovo import train, test_evaluate, test_denovo
4-
4+
import yaml
5+
import os
6+
from casanovo.denovo import train, evaluate, denovo
57

8+
#Required options
69
@click.command()
7-
@click.option("--mode", default='eval', help="Choose to train a model or test denovo predictions")
8-
@click.option("--model_path", help="Specify path to pre-trained model weights for testing or to continue to train")
9-
@click.option("--train_data_path", help="Specify path to mgf files to be used as training data")
10-
@click.option("--val_data_path", help="Specify path to mgf files to be used as validation data")
11-
@click.option("--test_data_path", help="Specify path to mgf files to be used as test data")
12-
@click.option("--config_path", help="Specify path to config file which includes data and model related options")
13-
@click.option("--output_path", help="Specify path to output de novo sequences")
10+
@click.option("--mode", required=True, default='eval', help="Choose on a high level what the program will do. \"train\" will train a model from scratch or continue training a pre-trained model. \"eval\" will evaluate de novo sequencing performance of a pre-trained model (peptide annotations are needed for spectra). \"denovo\" will run de novo sequencing without evaluation (specificy directory path for output csv file with de novo sequences).", type=click.Choice(['train', 'eval', 'denovo']))
11+
@click.option("--model_path", required=True, help="Specify path to pre-trained model weights (.ckpt file) for testing or to continue to train.", type=click.Path(exists=True, dir_okay=False, file_okay=True))
12+
#Base options
13+
@click.option("--train_data_path", help="Specify path to .mgf files to be used as training data", type=click.Path(exists=True, dir_okay=True, file_okay=False))
14+
@click.option("--val_data_path", help="Specify path to .mgf files to be used as validation data", type=click.Path(exists=True, dir_okay=True, file_okay=False))
15+
@click.option("--test_data_path", help="Specify path to .mgf files to be used as test data", type=click.Path(exists=True, dir_okay=True, file_okay=False))
16+
@click.option("--config_path", help="Specify path to custom config file which includes data and model related options. If not included, the default config.yaml will be used.", type=click.Path(exists=True, dir_okay=False, file_okay=True))
17+
@click.option("--output_path", help="Specify path to output de novo sequences. Output format is .csv", type=click.Path(exists=True, dir_okay=True, file_okay=False))
18+
#De Novo sequencing options
19+
@click.option("--preprocess_spec", default=None, help="True if spectra data should be preprocessed, False if using preprocessed data.", type=click.BOOL)
20+
@click.option("--num_workers", default=None, help="Number of workers to use for spectra reading.", type=click.INT)
21+
@click.option("--gpus", default=(), help="Specify gpus for usage. For multiple gpus, use format: --gpus=0 --gpus=1 --gpus=2... etc. etc.", type=click.INT, multiple=True)
1422

1523
def main(
24+
#Req + base vars
1625
mode,
1726
model_path,
1827
train_data_path,
1928
val_data_path,
2029
test_data_path,
2130
config_path,
22-
output_path
31+
output_path,
32+
#De Novo vars
33+
preprocess_spec,
34+
num_workers,
35+
gpus
2336
):
24-
"""The command line function"""
37+
"""
38+
The command line function for casanovo. De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model.
39+
40+
\b
41+
Training option requirements:
42+
mode, model_path, train_data_path, val_data_path, config_path
43+
44+
\b
45+
Evaluation option requirements:
46+
mode, model_path, test_data_path, config_path
47+
48+
\b
49+
De Novo option requirements:
50+
mode, model_path, test_data_path, config_path, output_path
51+
"""
2552
logging.basicConfig(
2653
level=logging.INFO,
2754
format="%(levelname)s: %(message)s",
2855
)
56+
if config_path == None:
57+
abs_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'config.yaml')
58+
with open(abs_path) as f:
59+
config = yaml.safe_load(f)
60+
else:
61+
with open(config_path) as f:
62+
config = yaml.safe_load(f)
63+
64+
if(preprocess_spec != None):
65+
config['preprocess_spec'] = preprocess_spec
66+
if(num_workers != None):
67+
config['num_workers'] = num_workers
68+
if(gpus != ()):
69+
config['gpus'] = gpus
2970
if mode == 'train':
30-
71+
3172
logging.info('Training Casanovo...')
32-
train(train_data_path, val_data_path, model_path, config_path)
73+
train(train_data_path, val_data_path, model_path, config)
3374

3475
elif mode == 'eval':
3576

3677
logging.info('Evaluating Casanovo...')
37-
test_evaluate(test_data_path, model_path, config_path)
78+
evaluate(test_data_path, model_path, config)
3879

3980
elif mode == 'denovo':
4081

4182
logging.info('De novo sequencing with Casanovo...')
42-
test_denovo(test_data_path, model_path, config_path, output_path)
83+
denovo(test_data_path, model_path, config, output_path)
4384

4485
pass
4586

casanovo/config.py

Lines changed: 0 additions & 92 deletions
This file was deleted.

casanovo/config.yaml

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
###
2+
# Config file for casanovo. Edit this to change parameters for train, eval, and denovo.
3+
# Note that blank entries are interpreted as "None"
4+
###
5+
6+
#Random seed to be used across the pipeline
7+
random_seed: 454
8+
9+
#Train data options
10+
train_annot_spec_idx_path: casanovo_train.hdf5 #path to write the training data index file, starts at os.getcwd()
11+
train_spec_idx_overwrite: True
12+
13+
#Validation data options
14+
val_annot_spec_idx_path: casanovo_val.hdf5 #path to write the validation data index file, starts at os.getcwd()
15+
val_spec_idx_overwrite: True
16+
17+
#Test data options
18+
test_annot_spec_idx_path: casanovo_test.hdf5 #path to write the test data index file
19+
test_spec_idx_overwrite: True
20+
21+
#Preprocessing parameters
22+
preprocess_spec: True #Should be True when using raw mass spec data. Use False when reproducing Casanovo results with the provided benchmark data set which is pre-processed
23+
n_peaks: 150
24+
min_mz: 50.52564895 #1.0005079 * 50.5
25+
max_mz: 2500
26+
min_intensity: 0.01
27+
fragment_tol_mass: 2 # Da
28+
29+
#Hardware options
30+
num_workers: 8
31+
gpus: [0] #None for CPU, int list to specify GPUs
32+
33+
#Model options
34+
dim_model: 512
35+
n_head: 8
36+
dim_feedforward: 1024
37+
n_layers: 9
38+
dropout: 0
39+
dim_intensity:
40+
custom_encoder:
41+
max_length: 100
42+
residues:
43+
G: 57.021463735
44+
A: 71.037113805
45+
S: 87.032028435
46+
P: 97.052763875
47+
V: 99.068413945
48+
T: 101.047678505
49+
C+57.021: 160.030644505 #103.009184505 + 57.02146
50+
L: 113.084064015
51+
I: 113.084064015
52+
N: 114.042927470
53+
D: 115.026943065
54+
Q: 128.058577540
55+
K: 128.094963050
56+
E: 129.042593135
57+
M: 131.040484645
58+
H: 137.058911875
59+
F: 147.068413945
60+
R: 156.101111050
61+
Y: 163.063328575
62+
W: 186.079312980
63+
# AA mods:
64+
M+15.995: 147.035399645 #131.040484645 + 15.994915 # Met Oxidation
65+
N+0.984: 115.02694347 #114.042927470 + 0.984016, # Asn Deamidation
66+
Q+0.984: 129.04259354 #128.058577540 + 0.984016, # Gln Deamidation
67+
max_charge: 10
68+
n_log: 1
69+
tb_summarywriter:
70+
warmup_iters: 100000
71+
max_iters: 600000
72+
learning_rate: 5e-4
73+
weight_decay: 1e-5
74+
75+
#Training/inference options
76+
train_batch_size: 32
77+
val_batch_size: 1024
78+
test_batch_size: 1024
79+
80+
accelerator: "ddp"
81+
logger:
82+
max_epochs: 30
83+
num_sanity_val_steps: 0
84+
85+
train_from_scratch: True
86+
87+
save_model: False
88+
model_save_folder_path: ''
89+
save_weights_only: True
90+
every_n_epochs: 1

casanovo/denovo/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""The de novo sequencing model"""
22
from .model import Spec2Pep
33
from .dataloaders import DeNovoDataModule
4-
from .train_test import train, test_denovo, test_evaluate
4+
from .train_test import train, denovo, evaluate
55

0 commit comments

Comments
 (0)