Skip to content

Commit c3d2bba

Browse files
Add non-enzymatic dataset to FAQ (#288)
* Add non-enzymatic dataset to FAQ * Minor text changes --------- Co-authored-by: Wout Bittremieux <bittremieux@users.noreply.github.com>
1 parent e17808b commit c3d2bba

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

docs/faq.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ This will need to be set with each new shell session, or you can add it to your
8686

8787
**Where can I find the data that Casanovo was trained on?**
8888

89-
The [Casanovo results reported ](https://doi.org/10.1101/2023.01.03.522621) were obtained by training on two different datasets: (i) a commonly used nine-species benchmark dataset, and (ii) a large-scale training dataset derived from the MassIVE Knowledge Base (MassIVE-KB).
89+
The [Casanovo results reported](https://doi.org/10.1101/2023.01.03.522621) were obtained by training on two different datasets: (i) a commonly used nine-species benchmark dataset, and (ii) a large-scale training dataset derived from the MassIVE Knowledge Base (MassIVE-KB).
9090

9191
All data for the _nine-species benchmark_ is available as annotated MGF files [on MassIVE](https://doi.org/doi:10.25345/C52V2CK8J).
9292
Using these data, Casanovo was trained in a cross-validated fashion, training on eight species and testing on the remaining species.
@@ -97,6 +97,9 @@ To compile this dataset yourself, on the [MassIVE website](https://massive.ucsd.
9797
This will give you a zipped TSV file with the metadata and peptide identifications for all 30 million PSMs.
9898
Using the filename (column "filename") you can then retrieve the corresponding peak files from the MassIVE FTP server and extract the desired spectra using their scan number (column "scan").
9999

100+
The _non-enzymatic dataset_, used to train a non-tryptic version of Casanovo, was created by selecting PSMs with a uniform distribution of amino acids at the C-terminal peptide positions from two datasets: MassIVE-KB and PROSPECT.
101+
Training, validation, and test splits for the non-enzymatic dataset are available as annotated MGF files [on MassIVE](https://doi.org/doi:10.25345/C5KS6JG0W).
102+
100103
**How do I know which model to use after training Casanovo?**
101104

102105
By default, Casanovo saves a snapshot of the model weights after every 50,000 training steps.

0 commit comments

Comments
 (0)