-
Hi there, Thank you for building such a great tool! along with many useful scripts, they have been a huge help for me. I have a question about the paralog database. Can I use the paralog database within phylofisher to filter out the possible paralogs in my dataset through blast? I just not sure whether this is a right way to go. I used orthofinder to select the orthologroups and used tree based pruning strategy to filter out the single copy orthologs, and I blast the results against the phylofisher paralog database, and there were a lot of hits, is that mean that they are probably paralogs and should be removed from the final matrix? Mia |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi @Mia1349, We are glad to hear you are finding PhyloFisher useful in your work! The short answer is I don't think the strategy you suggest above will be informative in determining whether or not you have paralogs remaining in your dataset. Here is the much longer explanation as to why. 1) The sequences in the provided paralogs dataset of PhyloFisher will produce significant hits closely related sequences that in your case may be the desired ortholog. This is the reason we maintain them and manually inspect homolog trees for ortholog selection. 2) In the strategy you have taken some of our "paralogs" may be fine to maintain in your dataset because they have an orthologous relationship to one another. For example, the gene CDK5 (used in the PhyloFisher dataset) had a duplication event early in the history of eukaryotes. Some extant taxa have maintained "copy 1" and some have "copy 2." Some may even have both but I cannot remember at the moment. For the sake of conversation we will say "copy 1" is maintained as the ortholog in the PhyloFisher database and "copy 2" is maintained as the paralog. However, if we had split this tree to produce two sequence files (such as your algorithm might likely have) one containing only sequences in the "copy 1" clade and the other containing only sequences that make up the "copy 2" clade then sequences within a file have an orthologous relationship to one another and are therefore fine to use in the final analysis. Again for the sake of conversation you might well have a file in your dataset that contains only sequences of CDK5 "copy 2" which would be fine for inclusion in your final matrix but would produce highly significant hits to the paralogs database of PhyloFisher. However, you might also have a mixture of "copy 1" and "copy 2" and you would not be able to tell. One strategy to take is to make a custom PhyloFisher database with your ortholog files (some or all depending on how thorough you want to be ) using the script build_database.py and move through the PhyloFisher workflow by recollecting sequences from all or even just a few taxa in your dataset. Then building and manually inspecting the resulting gene trees to evaluate ortholog selection in your dataset. I hope this is helpful. Please let me know if I can further clarify anything for you or if you have additional questions I can answer. Thank you for your interest in PhyloFisher. Alex |
Beta Was this translation helpful? Give feedback.
Hi @Mia1349,
We are glad to hear you are finding PhyloFisher useful in your work!
The short answer is I don't think the strategy you suggest above will be informative in determining whether or not you have paralogs remaining in your dataset.
Here is the much longer explanation as to why. 1) The sequences in the provided paralogs dataset of PhyloFisher will produce significant hits closely related sequences that in your case may be the desired ortholog. This is the reason we maintain them and manually inspect homolog trees for ortholog selection. 2) In the strategy you have taken some of our "paralogs" may be fine to maintain in your dataset because they have an orthologous relationship to…