hapSum.<regionName>.s.<s>.png
andhapSum_log.<regionName>.s.<s>.png
Across the region being imputed (x-axis, physical position), either the fraction (normal version) or log10 of the expected number of haplotypes each sample carries from each ancestral haplotype. The most important thing to see in this plot is that there is not drastic, but rather gradual, movement between the relative proportions of the ancestral haplotypes. Large swings in ancestral haplotype sums indicate the heuristics are working sub-optimally (see next plot) (though this could also be interesting biology!). It is also useful to check here how many ancestral haplotypes are being used, to see if some could be trimmed, for example if some are very infrequently used, though this is rare, as STITCH will try to fill them up.shuffleHaplotypes.<iteration>.<regionName>.png
(Optional) This is optional and requires optionplot_shuffle_haplotype_attempts
, but very useful if working with a species which has a substantially different recombination rate from expected, or just generally to visualize some results from the model. In each plot, made according to the choices ofshuffleHaplotypeIterations
(the<iteration>
marker above), there are three subplots as follows. First, the top plot shows recombination rate (green) and inverse cumulative recombination rate (gold) (see plot title). Second, normalized recombination rate, normalized by averaging overshuffle_bin_radius
base pairs, with heuristically determined recombination hotspots (two purple vertical bars surrounding the red vertical bar at the hot spot), to check for artificial ancestral haplotype swapping - i.e. where in the idealized case, two ancestral haplotypes have been correctly inferred but erroneously include a swap between them at some point. Note under each hot spot to check, there is a red or green colour, the green of which indicates this was inferred to be an ancestral haplotype swap, which will be corrected in the subsequent iteration. Finally, on the third bottom plot, is posterior state usage of the ancestral haplotypes among the first 20 samples in the model (note the data is unphased - this is the sum of the marginal posterior hidden state probabilities). Together, what one wants to see, is increasingly less jumbled posterior state probabilities in the third / bottom plot, i.e. that the chunks of colour that extend horizontally, are progressively extending further horizontally, meaning that each sample has fewer recombinations against the ancestral haplotypes. If the recombination rate and heterozygosity of the species is very high (like in insects), you may need to decreaseshuffle_bin_radius
substantially, to say 100, so it can properly find the hotspots to try and resolve swaps at. When heterozygosity is higher, like in humans or mice, you want to set this higher, to minimize the effects of noise. In addition, by the end, with the final choice ofiteration
amongshuffleHaplotypeIterations
, you want to no longer be able to identify many clear consistent switches between samples in the third plot. For example, if many of the samples on the bottom plot have the same colour over long stretches, them all switch simultaneously to another colour, this suggests an unresolved ancestral haplotype flip. Ideally this won't happen much, if at all, by the final such iteration that shuffling is searched for. If it is, it suggests increasing the value ofshuffleHaplotypeIterations
, and possiblyniterations
as well.r2.<regionName>.jpg
Scatter plot of real allele frequency as estimated from the pileup of sequencing reads (x-axis) and estimated allele frequency from the posterior of the model (y-axis) (per-SNP, sum of the average usage of each ancestral haplotype times the probability that ancestral haplotype emits an alternate base, divided by the number of samples). One should generally see good agreement between these. Note that these are at "good SNPs" with info score > 0.4 and HWE p-value > 1x10^-6 (the later criterion in particular might not be appropriate for all settings).metricsForPostImputationQCChromosomeWide.<regionName>.sample.jpg
Top plot gives -log10(HWE p-value), middle plot shows info score, bottom plot shows allele frquency, which each plot lined up vertically. This can be useful to get a sense of the LD structure (through the bottom plot, with the distribution of allele frequencies), as well as its relationship to imputation performance (the middle plot, the info score, which is a measure of the confidence the imputation has in its performance). One can also use this to test how various parameter choices affect imputation. For example, for species that have recently been through bottlenecks, allele frequencies are often highly correlated locally, visible as multiple nearby SNPs that have the same allele frequency. Choices of imputation parameters that increase the tightness in the spread of these allele frequencies often correlate with increased imputation performance.metricsForPostImputationQC.<regionName>.sample.jpg
Similar to above, but plotting each of the three against each other. Useful for getting an overal sense of the distribution of the metrics, particularly info, which can be useful towards thinking about a threshold for filtering out variants after imputation. For example, in the middle plot with estimated allele frequency vs info, if you expect your data to cluster into a small range of allele frequencies, and your data seems concentrated in those frequencies for high info, you can think about an info score cutoff that allows you to capture most of these variants.alphaMat.<regionName>.s.<s>.png
andalphaMat.<regionName>.normalized.s.<s>.png
Likely not informative for general use. alphaMat is the name of the internal variable that stores the probability of jumping into an ancestral haplotype conditional on a jump. Un-normalized is with respect to movement of all samples, while normalized has sum 1