Do the things we don't say (but perhaps that we thought) affect what we say (or think!) in the future? Modern (standard) LLMs output sequences of tokens, one token at a time. However, in order to emit a single token at timestep
Given a model,
We can then ask: given an observed sequence of tokens from a human conversation or narrative, can we better explain the token-by-token probabilities using that full tensor (e.g., by accounting for tokens not emitted), or is "all" of the predictive power carried solely by the single observed sequence?