Add non-enzymatic dataset to FAQ (#288)

melihyilmaz · bittremieux · web-flow · commit c3d2bbac7cc2 · 2024-02-07T09:14:39.000+01:00
* Add non-enzymatic dataset to FAQ

* Minor text changes

---------

Co-authored-by: Wout Bittremieux &lt;bittremieux@users.noreply.github.com&gt;
diff --git a/docs/faq.md b/docs/faq.md
@@ -86,7 +86,7 @@ This will need to be set with each new shell session, or you can add it to your
 
 **Where can I find the data that Casanovo was trained on?**
 
-The [Casanovo results reported ](https://doi.org/10.1101/2023.01.03.522621) were obtained by training on two different datasets: (i) a commonly used nine-species benchmark dataset, and (ii) a large-scale training dataset derived from the MassIVE Knowledge Base (MassIVE-KB).
+The [Casanovo results reported](https://doi.org/10.1101/2023.01.03.522621) were obtained by training on two different datasets: (i) a commonly used nine-species benchmark dataset, and (ii) a large-scale training dataset derived from the MassIVE Knowledge Base (MassIVE-KB).
 
 All data for the _nine-species benchmark_ is available as annotated MGF files [on MassIVE](https://doi.org/doi:10.25345/C52V2CK8J).
 Using these data, Casanovo was trained in a cross-validated fashion, training on eight species and testing on the remaining species.
@@ -97,6 +97,9 @@ To compile this dataset yourself, on the [MassIVE website](https://massive.ucsd.
 This will give you a zipped TSV file with the metadata and peptide identifications for all 30 million PSMs.
 Using the filename (column "filename") you can then retrieve the corresponding peak files from the MassIVE FTP server and extract the desired spectra using their scan number (column "scan").
 
+The _non-enzymatic dataset_, used to train a non-tryptic version of Casanovo, was created by selecting PSMs with a uniform distribution of amino acids at the C-terminal peptide positions from two datasets: MassIVE-KB and PROSPECT.
+Training, validation, and test splits for the non-enzymatic dataset are available as annotated MGF files [on MassIVE](https://doi.org/doi:10.25345/C5KS6JG0W).
+
 **How do I know which model to use after training Casanovo?**
 
 By default, Casanovo saves a snapshot of the model weights after every 50,000 training steps.