-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add filtering by reversions to naive sequence, generalize ranking strategy #130
Conversation
Would close #90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Will—this seems very well thought-out! As for the min/max -/+ question, another approach would be to adjust our definitions so that the defaults are always minimizing. It wouldn't be so unnatural to change the log-likelihoods to negative log-likelihoods, and call them "loss". Then the sign usage would be less prone to misinterpretation, and the default ranking string would have no "-". As a side point, the docstring says "likelihood" for B and C, but isn't it log-likelihood? This difference matters for doing linear combinations.
I am curious to what extent reversions are eliminated, given that we are starting out with max parsimony trees.
Thanks for the point about log likelihoods, I'll change that. Is your proposal to change criterion definitions your recommendation? It seems reasonable. I did look at how reversion counting would change inference on the gcreplay data. Keep in mind that this is reversion to the naive sequence, so it doesn't necessarily imply that there are fewer reverted mutations in the tree. I compared these lexicographic orderings:
Over 145 alignments, prioritizing naive reversions over context likelihood, but less than branching process likelihood results in a reversion reduction in 6.21 percent of gcreplay alignments. The median improvement was 1. The max improvement was 4. The median log context likelihood reduction was 0.77. The max log context likelihood reduction was 8.54. Prioritizing naive reversions over everything else, including branching process likelihood results in a reversion reduction in 42.07 percent of gcreplay alignments. The median improvement was 1. The max improvement was 10. The median log bp likelihood reduction was 0.61. The max log bp likelihood reduction was 2.77. |
Further changes and additions:
I think this is ready to merge @wsdewitt. Since it contains changes that aren't backward-compatible I suppose it should have a new major version. |
I've attempted to re-create the behavior of the old ranking coefficient arguments by translating them into a ranking strategy expression, if provided. I brought back some old tests to make sure that this works, so we should now be able to release this as something like |
Motivation: Minimizing reversions
We continue to receive feedback that users of gctree often see reversions in the inferred trees, and that they edit the trees to remove these. Unfortunately, counting the total number of reversions in each tree efficiently is difficult because of the data structure we use for storing trees. However, it is easy to count only reversions to the naive sequence.
The primary purpose of this PR is to add the option to rank trees according to the number of reversions to the naive base at any site. However, experimenting with real gcreplay data, I see that prioritizing reversions above context likelihood or branching process likelihood often results in a significant reduction in likelihood of the top-ranked tree. Users of this option should consider how their intuition about reversions relates to the intuition behind likelihood models already available.
Also keep in mind that reversions are an inevitable feature of parsimony trees, when data they’re constructed on contains sites that support conflicting partitions of the taxa. A tree without reversions is possible in this case, but it will not be maximally parsimonious (it will contain many duplicate mutations occurring independently on different branches). Trees constructed using non-parsimony methods on densely sampled data will almost certainly also contain reversions, because it’s considered unlikely for the same mutation to occur independently many times. Since GCtree only ranks maximally parsimonious trees, it cannot completely eliminate reversions.
Additional changes in this PR
Adding reversion counting as a ranking criterion brings us to a total of six available ranking criteria. Previous versions of gctree allowed ranking of trees in two ways: either lexicographically with a fixed, default order of criteria (BP likelihood, isotype parsimony, context likelihood, allele count), or using a linear combination of available criteria, specified using the options
--branching_process_ranking_coeff
,--ranking_coeffs
, and--use_old_mut_parsimony
. This system of options was a big mistake, because it’s hard to add additional criteria while maintaining backward compatibility (the number of ranking coefficients needs to change) and despite its complexity it doesn’t allow customization of lexicographic orderings.For this reason, I think it would be best to deprecate the three CLI options mentioned above, and replace them with a
--ranking_strategy
option which takes a string expression. Here’s how this is described in the docs:It is difficult to maintain consistent behavior and backward compatibility with the old CLI options, so their use now throws an error, urging the user to employ the new
--ranking_strategy
option. However, the default behavior without any of these CLI options remains identical to older versions of gctree.If we ever want to add additional ranking criteria to gctree, these changes will make it very straightforward to do so while maintaining backward-compatibility.
Other changes:
Mut. Pars.
toMutabilityParsimony
andIsotype Pars.
toIsotypeParsimony
for consistency and to aid in manual parsing of gctree log filestests/smalltest.sh
for various kinds of--ranking_strategy
expressions.--tree_stats
(a csv file and pair plot showing all trees and best ranked tree) and--summarize_forest
.Remaining questions / TODOs:
B,C,R
to automatically maximize branching process likelihood and context likelihood, but minimize reversions. However, as noted in the docstring for the--ranking_strategy
option, this creates ambiguity about the behavior of the ranking strategyB
and-B
, since linear combinations of criteria are always minimized. However, gctree output includes a description of how ranking is done, so that should help. For example:Ranking trees to maximize LogBPLikelihood then minimize IsotypeParsimony then maximize LogContextLikelihood then minimize Alleles then minimize NaiveReversions
orRanking trees to minimize a linear combination of -1(LogBPLikelihood) + 0.01(IsotypeParsimony) + -1(LogContextLikelihood)
. There is also a warning any time a coefficient's sign contradicts the default ordering. The alternative obvious approach would be to expect the user to sprinkle in negative signs whenever they want a criterion maximized rather than minimized, for instance-B,-C,R
to do whatB,C,R
does now. That would eliminate any ambiguity/inconsistency, but it's less intuitive. Thoughts @wsdewitt?