Skip to content

Commit 991544b

Browse files
committed
2 parents 0a41e24 + 880413d commit 991544b

File tree

1 file changed

+40
-19
lines changed

1 file changed

+40
-19
lines changed

README.md

Lines changed: 40 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
# clan_check
2-
Check trees for compatibility with defined monophyletic [edit - not right terminology ] groups - "The incontrovertible clan test"
2+
Check trees for compatibility with defined monophyletic [edit - not right terminology ] groups - "The incontestable clan test"
33

44
## Background
5-
###What does it do?
5+
6+
### What does it do?
7+
68
Clan_check analyses single-copy phylogenetic trees to assess if they violate clans* defined by the user.
79

810
>*see the following paper for a definiton of a "clan"
@@ -15,29 +17,30 @@ The output is a list for all the trees of each clan using a scoring of 1 or 0 wh
1517

1618
The software will also return a 1 if the none of the taxa from the clan are found in the tree, or if only 1 of the taxa are found.
1719

18-
A "0" means that two or more of the taxa from that clan were found and they were not monophyletic.
20+
A "0" means that two or more of the taxa from that clan were found and they were not in a clan (i.e. they were not together to the exclusion of all other taxa on the tree).
1921

2022
### But... why?
23+
2124
This is designed for large-scale phylogenomic analyses where the user may have thousands of phylogenetic trees. While every effort may have been taken to ensure that the best orthlogs have been chosen, sometimes due to hidden paralogy it is not easy to get the choice right.
2225

23-
In these cases, the only evidence that the gene family may be problematic is when the resulting phylogeentic tree is "incorrect".
26+
In these cases, the only evidence that the gene family may be problematic is when the resulting phylogeentic tree is incorrect for known or "incontestable" groups.
2427

25-
One way to test for "problematic" gene families is to look for "incontrovertible relationships" that are not part of the question being asked in the study, but without doubt should exist if the taxa are in the tree.
28+
This involves looking for "incontestable relationships" that are not part of the question being asked in the study, but without doubt should exist if the taxa are in the tree.
2629

27-
An example of this is, if I was carrying out a phylogenomic study of the fishes and used several mammals as an outgroup, then I should never expect the mammal clan to be paraphyletic [edit - whats the equivalent of paraphyly for a clan?].
30+
An example of this is, if a phylogenomic study involved the analysis of the relationships of the birds and used several mammals as an outgroup, then mammals would always be expected to group together.
2831

29-
In this case the mammals are an incontrovertible clan. If the mammals are paraphyletic with the fishes, then it is very likely that one of the internal branches of the tree represents a duplication and not a speciation event, and so they are not all orthologs.
32+
In this case the mammals are an incontestable clan. If the mammals do not group together, then it is very likely that one of the internal branches of the tree represents a duplication and not a speciation event, and so some of the genes in the family may not be orthologs.
3033

31-
Clan_check searches for these instances.
34+
`Clan_check` searches for these instances.
3235

33-
If given many such clans to check, researchers can assess the number of these clans that are violated and decide on the weight of evidence necessary to remove or re-visit the analysis of that gene family.
36+
If given many such clans to check, researchers can assess the number of these clans that are violated and decide on the weight of evidence necessary to remove or re-visit the analysis of any gene families.
3437

35-
Care must be taken choosing the clans to be tested and in the designing of the study, to include taxa that allows this test to be made.
38+
Care must be taken choosing the clans to be tested and in the design of the study to include taxa that allows this test to be made.
3639

3740
You can provide trees and clans of any size and `clan_check` will search for the appropriate sub-set of the clans defined.
3841

3942
For example:
40-
>if you have a tree with `(A,B,(C,D));` and a clan definition of `C D E`, clan_check will search for monophylies of `C` and `D` only.
43+
>if you have a tree with `(A,B,(C,D));` and a clan definition of `C D E`, clan_check will search for clans containing `C` and `D` only.
4144
4245
If only 1 of the taxa from a clan are in the tree, clan_check will assume that the clan is not violated, and return a "1" for that test (see output files detail below).
4346

@@ -64,7 +67,7 @@ Usage: `clan_check -f [phylip formatted tree file] -c [clan file] `
6467

6568
Where: [phylip formatted tree file] is a phylip formatted file of trees to be assessed
6669

67-
[clan file] is a file lists of taxa in each line (space seperated) that are to be checked for monophylies.
70+
[clan file] is a file that contains lists of taxa in each line (space seperated) that are to be checked for clans.
6871

6972
Two example files are provided:
7073

@@ -73,6 +76,7 @@ Two example files are provided:
7376
```
7477
(((a,(b,(c,d))),f),e);
7578
(((a,(b,(e,d))),c),g);
79+
((a,(b,(e,d))),c);
7680
```
7781
These trees can be rooted or unrooted. `clan_check` will unroot all rooted trees before carrying out the analysis.
7882

@@ -90,16 +94,33 @@ g d
9094

9195
The output will be named `[phylip formatted tree file].scores.txt` and will have the following format:
9296

93-
```
94-
Tree number size Clan 1 Clan 2 Clan 3 Clan 4 Clan 5 Clan 6
95-
Tree 1 6 1 1 0 1 1 1
96-
Tree 2 6 0 1 0 1 1 1
97-
```
97+
98+
|Tree number | size | Clan 1 | Clan 2 | Clan 3 | Clan 4 | Clan 5 | Clan 6 |
99+
|------------|------|--------|--------|--------|--------|--------|--------|
100+
|Tree 1 | 6 | 1 | 1 | 0 | 1 | 1 | ? |
101+
|Tree 2 | 6 | 0 | 1 | 0 | 1 | 1 | 1 |
102+
|Tree 3 | 5 | 0 | 1 | 0 | 1 | 1 | ? |
103+
98104
Where `tree number` is in the same order as the input trees, `size` = the number of taxa in the tree, `Clan x` is the clan definied by the xth line of the clan file.
99105

100-
In this example Clan 3 defined as having the monophyly of "c d a" was violated in both tree 1 and tree 2.
106+
### Interpreting the results
107+
108+
A "1" in the table means that this tree did not violate this clan.
109+
110+
A "0" in the table means that this tree violated this clan.
111+
112+
A "?" in the table means that there were not enough taxa from the Clan in this tree to carry out the test (minimum required is 2 taxa).
113+
114+
So in the test data:
115+
116+
* All three trees did not contain Clan 3, (c d a) despite all three trees containing all three taxa
117+
118+
* Both tree 2 and tree 3 did not contain clan 1 (c d b), despite both trees containing all three taxa
119+
120+
* We could not test Clan 6 (g d) against Tree 1 or Tree 3 as neither of those trees had taxon "g".
121+
122+
For each tree, you can express the number of Clans violated as a sum, percentage, or treat any violation as a reason to exlucde the tree from further analyses. It all depends on what question you are asking and the level of stringency you wish to apply.
101123

102-
In this result Tree 2 violated 2 of the clans and tree 1 violoated 1.
103124

104125
## Caveats
105126

0 commit comments

Comments
 (0)