Add filtering by reversions to naive sequence, generalize ranking strategy #130

willdumm · 2024-07-12T20:43:48Z

Motivation: Minimizing reversions

We continue to receive feedback that users of gctree often see reversions in the inferred trees, and that they edit the trees to remove these. Unfortunately, counting the total number of reversions in each tree efficiently is difficult because of the data structure we use for storing trees. However, it is easy to count only reversions to the naive sequence.

The primary purpose of this PR is to add the option to rank trees according to the number of reversions to the naive base at any site. However, experimenting with real gcreplay data, I see that prioritizing reversions above context likelihood or branching process likelihood often results in a significant reduction in likelihood of the top-ranked tree. Users of this option should consider how their intuition about reversions relates to the intuition behind likelihood models already available.

Also keep in mind that reversions are an inevitable feature of parsimony trees, when data they’re constructed on contains sites that support conflicting partitions of the taxa. A tree without reversions is possible in this case, but it will not be maximally parsimonious (it will contain many duplicate mutations occurring independently on different branches). Trees constructed using non-parsimony methods on densely sampled data will almost certainly also contain reversions, because it’s considered unlikely for the same mutation to occur independently many times. Since GCtree only ranks maximally parsimonious trees, it cannot completely eliminate reversions.

Additional changes in this PR

Adding reversion counting as a ranking criterion brings us to a total of six available ranking criteria. Previous versions of gctree allowed ranking of trees in two ways: either lexicographically with a fixed, default order of criteria (BP likelihood, isotype parsimony, context likelihood, allele count), or using a linear combination of available criteria, specified using the options --branching_process_ranking_coeff, --ranking_coeffs, and --use_old_mut_parsimony. This system of options was a big mistake, because it’s hard to add additional criteria while maintaining backward compatibility (the number of ranking coefficients needs to change) and despite its complexity it doesn’t allow customization of lexicographic orderings.

For this reason, I think it would be best to deprecate the three CLI options mentioned above, and replace them with a --ranking_strategy option which takes a string expression. Here’s how this is described in the docs:

Expression describing tree ranking strategy. If provided, takes precedence over all other ranking arguments. Two types of expressions are permitted: First are those describing lexicographic orderings, like B,C,A, which means choose trees to maximize branching process likelihood, then maximize context likelihood, then minimize number of alleles. Next are expressions describing linear combinations of criteria, like B+2C-1.1A, which means choose trees to minimize the specified linear combination of criteria. If linear combination expression has leading -, use = instead of space to separate argument, e.g. --ranking_strategy=-B+R. These two methods of ranking cannot be combined. For example, B+C,A is not a valid ranking strategy expression. Ranking criteria are specified using the following identifiers:
B - branching process likelihood (default maximized),
I - isotype parsimony (default minimized),
C - context-based Poisson likelihood (default maximized),
M - old mutability parsimony (deprecated, default minimized),
A - number of alleles (default minimized),
R - sitewise reversions to naive sequence (default minimized)
To compute the value of a criterion on ranked trees without affecting the ranking, include that ranking criterion with a coefficient of zero, as in B+2C+0A, or B,C,0A.
To reverse the default ordering on a criterion in lexicographic ranking, provide a negative coefficient. For example, B,-A will first maximize branching process likelihood, then maximize the number of alleles, since the default is to minimize the number of alleles. -B, A will minimize branching process likelihood, since the default is to maximize it, then minimize alleles.
A ranking strategy string containing a single ranking criterion identifier will be treated as a lexicographic ordering. That is, B will maximize branching process likelihood, and -B will minimize branching process likelihood, while B+0C will minimize branching process likelihood and -B+0C will maximize it. gctree infer --verbose will describe the ranking strategy used. Examine this output to make sure it’s as expected.

It is difficult to maintain consistent behavior and backward compatibility with the old CLI options, so their use now throws an error, urging the user to employ the new --ranking_strategy option. However, the default behavior without any of these CLI options remains identical to older versions of gctree.

If we ever want to add additional ranking criteria to gctree, these changes will make it very straightforward to do so while maintaining backward-compatibility.

Other changes:

use Pandas DataFrame printing utilities to format gctree output more consistently
change names of ranking criteria: Mut. Pars. to MutabilityParsimony and Isotype Pars. to IsotypeParsimony for consistency and to aid in manual parsing of gctree log files
add tests to tests/smalltest.sh for various kinds of --ranking_strategy expressions.
as mentioned in the quoted docstring above, we can compute criteria without ranking with respect to them by giving them a coefficient of zero. These criteria are included in outputs from --tree_stats (a csv file and pair plot showing all trees and best ranked tree) and --summarize_forest.

Remaining questions / TODOs:

I think it's nice for a ranking strategy of e.g. B,C,R to automatically maximize branching process likelihood and context likelihood, but minimize reversions. However, as noted in the docstring for the --ranking_strategy option, this creates ambiguity about the behavior of the ranking strategy B and -B, since linear combinations of criteria are always minimized. However, gctree output includes a description of how ranking is done, so that should help. For example: Ranking trees to maximize LogBPLikelihood then minimize IsotypeParsimony then maximize LogContextLikelihood then minimize Alleles then minimize NaiveReversions or Ranking trees to minimize a linear combination of -1(LogBPLikelihood) + 0.01(IsotypeParsimony) + -1(LogContextLikelihood). There is also a warning any time a coefficient's sign contradicts the default ordering. The alternative obvious approach would be to expect the user to sprinkle in negative signs whenever they want a criterion maximized rather than minimized, for instance -B,-C,R to do what B,C,R does now. That would eliminate any ambiguity/inconsistency, but it's less intuitive. Thoughts @wsdewitt?
I'll need to update the Quickstart to describe these changes

willdumm · 2024-07-12T21:00:47Z

Would close #90

wsdewitt

Thanks, Will—this seems very well thought-out! As for the min/max -/+ question, another approach would be to adjust our definitions so that the defaults are always minimizing. It wouldn't be so unnatural to change the log-likelihoods to negative log-likelihoods, and call them "loss". Then the sign usage would be less prone to misinterpretation, and the default ranking string would have no "-". As a side point, the docstring says "likelihood" for B and C, but isn't it log-likelihood? This difference matters for doing linear combinations.

I am curious to what extent reversions are eliminated, given that we are starting out with max parsimony trees.

willdumm · 2024-07-15T21:09:33Z

Thanks for the point about log likelihoods, I'll change that.

Is your proposal to change criterion definitions your recommendation? It seems reasonable.

I did look at how reversion counting would change inference on the gcreplay data. Keep in mind that this is reversion to the naive sequence, so it doesn't necessarily imply that there are fewer reverted mutations in the tree.

I compared these lexicographic orderings:

BP likelihood, context likelihood, alleles, reversion count (essentially the currently used method, since context likelihood almost always picks out a single tree)
BP likelihood, reversions, context likelihood, alleles
reversions, BP likelihood, context likelihood, alleles

Over 145 alignments, prioritizing naive reversions over context likelihood, but less than branching process likelihood results in a reversion reduction in 6.21 percent of gcreplay alignments. The median improvement was 1. The max improvement was 4. The median log context likelihood reduction was 0.77. The max log context likelihood reduction was 8.54.

Prioritizing naive reversions over everything else, including branching process likelihood results in a reversion reduction in 42.07 percent of gcreplay alignments. The median improvement was 1. The max improvement was 10. The median log bp likelihood reduction was 0.61. The max log bp likelihood reduction was 2.77.

willdumm · 2024-07-24T20:59:18Z

Further changes and additions:

Will's suggestion about minimizing all criteria by default is now implemented, and instead of log likelihoods we call them log loss.
The Quickstart is updated to reflect changes made in this PR
An option to draw nucleotide mutations on tree branches --show_nucleotide_mutations is added. This is nice for looking at reversions, since the existing aa mutation annotations don't show silent mutations
When ranking by linear combination, the value of the ranking expression is shown on the pair plot produced by --tree_stats

I think this is ready to merge @wsdewitt. Since it contains changes that aren't backward-compatible I suppose it should have a new major version.

willdumm · 2024-08-06T23:13:52Z

I've attempted to re-create the behavior of the old ranking coefficient arguments by translating them into a ranking strategy expression, if provided. I brought back some old tests to make sure that this works, so we should now be able to release this as something like v4.3.0

willdumm added 9 commits July 9, 2024 15:55

add reversion func and rewrite ranking again

5b17328

update tests

7938c08

working ranking expression parsing

eca4da2

cleanup, format, and lint

1edf8d0

fix forest summary

a3cc5d1

more fixes and cleanup

0c5e549

finish fixing output for zero coefficients

6295d93

WIP

804ad71

remove reversion counting from default ordering

5240f7e

willdumm requested a review from wsdewitt July 12, 2024 20:58

wsdewitt approved these changes Jul 15, 2024

View reviewed changes

willdumm and others added 7 commits July 17, 2024 15:47

fix tree counting with pickled forest input

14f21aa

add nucleotide annotations

ba30393

change to default minimize all criteria

f9421f5

simplify ranking assuming all criteria are minimized

be3946c

WIP add linear combination to tree_stats pairplot

cf5e4b8

format and lint

9f858f7

fix quickstart and docs

7da2859

restore backward compatibility and tests

074cf1b

willdumm merged commit 10eabc7 into main Aug 8, 2024
8 checks passed

willdumm deleted the wd-wt-reversions branch August 8, 2024 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add filtering by reversions to naive sequence, generalize ranking strategy #130

Add filtering by reversions to naive sequence, generalize ranking strategy #130

willdumm commented Jul 12, 2024 •

edited

Loading

willdumm commented Jul 12, 2024

wsdewitt left a comment •

edited

Loading

willdumm commented Jul 15, 2024

willdumm commented Jul 24, 2024

willdumm commented Aug 6, 2024

Add filtering by reversions to naive sequence, generalize ranking strategy #130

Add filtering by reversions to naive sequence, generalize ranking strategy #130

Conversation

willdumm commented Jul 12, 2024 • edited Loading

Motivation: Minimizing reversions

Additional changes in this PR

Other changes:

Remaining questions / TODOs:

willdumm commented Jul 12, 2024

wsdewitt left a comment • edited Loading

Choose a reason for hiding this comment

willdumm commented Jul 15, 2024

willdumm commented Jul 24, 2024

willdumm commented Aug 6, 2024

willdumm commented Jul 12, 2024 •

edited

Loading

wsdewitt left a comment •

edited

Loading