-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow split T5 & CLIP prompts for flux & add a separate T5 token counter #1906
base: main
Are you sure you want to change the base?
Conversation
All examples are generated with Flux 1 Dev using the same seed and sampling parameters: euler sampler with simple scheduler, 20 steps, 896x1152 resolution, distilled CFG scale 3.5, CFG scale 1, & seed 65432. The prompts demonstrate some of the differences in T5 and CLIP tokenization, as seen in this debug output from the 2nd example below (SPLIT case):
Example 1: Example 2: Example 3: |
Hey @DenOfEquity! I noticed your forgeFlux_dualPrompt extension and saw that we're working on similar ideas around multi-prompt support. I recently submitted this PR for dual prompt support and a T5 token counter for Flux models, and I'm also thinking about future improvements such as providing a complete tokenization breakdown for the user. That would be beneficial for users of single encoder models too, although it's an interesting challenge to do that in a way that's not disruptive to the existing interface... Anyway, I'd love to hear your thoughts on the approach in this PR and if you're interested in collaborating. Before I started working on this, I looked through past issues and PRs here but didn’t see anything similar—so I'm curious now if this was something you and @lllyasviel had previously discussed, especially in terms of native support vs. an extension? |
Hi. I haven't had time to test this yet, and I'd also like to see if lllyasviel has new opinions on it. IMO this keyword method is ideal as there's no major UI changes. Method should be extensible to at least sdxl (clip-l, clip-g). |
Thanks for the initial feedback.
I agree, and there is no API change either. It all flows through the same prompt argument as before, and there is no change if the SPLIT keyword is not provided. Plus, users are already familiar with this paradigm (AND, BREAK, etc)... but mostly, I implemented it this way so that the change would be small and not increase maintenance burden.
One reason I included the tokenization breakdown in my example is that I think it makes it much easier to see why we should care about the ability to specify different prompts per encoder. I've been working with print debugging to see the tokenization in the terminal, but I want to find the right way to show this to the user, because understanding this will help them to write better prompts (even when using single encoder models). Two designs I'm thinking about -- 1. show prompt tokenization on mouseover of the token counter, or 2. add a new accordion (collapsed by default to minimize visual disruption) which shows the tokenization once expanded. I think another good comparison to try is simply Anyway, since this project aims to be more research focused, I think there is also a good argument for exposing the functionality to users, as they may discover some methods and applications that we did not consider.
I haven't experimented with that yet, and while my gut intuition is that it might be less valuable than the Flux case due to architectural differences, it would still be a very small change to implement as you are pointing out. Thinking about it also has me rethinking a few things, like naming the token counters primary/secondary instead of CLIP/T5, and adding the additional counter to the negative prompt as well. That might actually simplify things even further -- it may be possible to even treat both counters as a single UI element, and do most all changes Plus some of my tests with setting CFG != 1 on Flux have me thinking perhaps negative prompts aren't totally DoA there after all.
No, it requires a small adjustment... and I should also test how embeddings are handled just to make sure. Thanks for pointing this out. |
CLIP and T5 are very different in the way they process and understand language, with a major difference being tokenization.
CLIP and T5 encoders process language very differently, even in how they tokenize input. T5 is case-sensitive, while CLIP is case-insensitive. T5 uses a prefix ("▁") to mark tokens that occur at the beginning of a word, whereas CLIP appends
</w>
to indicate the end of a word. Furthermore, T5 includes a rich vocabulary that extends into other languages, such as French and German, while CLIP is more likely to have single-token recognition of internet-specific terms like "womancrushwednesday" or "hamillhimself."So to get the most out of a dual encoder model architecture like Flux, it's valuable to be able to pass separate prompts to the T5 and CLIP encoders to leverage their respective strengths and exert fine-grained artistic control over the conditioning.
This commit introduces the SPLIT keyword for Flux models. When present, everything before SPLIT is processed by T5, while everything after is processed by CLIP. Additionally, an extra token counter has been added to the UI, helping users understand how each encoder interprets their prompts.
Example images forthcoming.