-
Notifications
You must be signed in to change notification settings - Fork 57
Add pattern match to reduce suffix duplication #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
A perfect llm wouldn't need this, but if the suffix is already there we should avoid appending it again, particularly in multi-line suggestions. Except closing brackets or similar. We should still append those.
If you want to see an example of this, just write out multiple lines repeating the phrase |
Thanks, I've switched to this branch to give it a spin for a few days and see if it reduces the duplicated completions. Will report back. |
Awesome, hope it goes well. As noted, the main possible regression to look for, besides any unforeseen bug, is if situations arise where the LLM explicitly and correctly wanted to duplicate the suffix, such as adding a closing bracket within a nested function call. Otherwise this should strictly be an improvement on the existing logic, assuming my construction of the escaped regex pattern is sound and safe. It's likely excessively strict to limit the new behavior to cases with two word chars in the suffix, though I chose to err on the side of caution here, as it seems preferable to avoid possibly breaking correct completions at the expense of missed opportunities to correct a mistake. Thinking a bit more about it though, my current condition is likely very poorly chosen for non-latin use-cases, as the |
Hmm, very strange. I'm assuming the extra We could probably add an extra check before the current logic to see if the extra lines beyond the first are duplicates, though I'm not sure if sometimes the LLM would want to duplicate lines. I haven't given that specific case much thought, we probably need to think more about this. In theory some logic to match against the subsequent lines and delete them if they match what's there would work, and would then allow the current suffix logic implemented here to do it's job as-is. Unfortunately this specific example with the trailing open-brace will still fail the punctuation check. Like I said, the LLM may have valid reason to insert extra closing punctuation, and I think it might be over-complicating the code to try to specifically cover all combinations of closing punctuation vs opening punctuation, though it's not impossible. It would be good to have a think about this still, maybe it's easier than expected, if we just maintain a list of closing or middle punctuation (I'm thinking commas should be included) and construct some sensible pattern out of that. I will have a think about that. This does raise the interesting side question though, of what to do if the LLM decides to fill-in-middle on the first line, but add additional correct lines after. Maybe we need to be repeating our suffix pattern check against the first line of completion, If you're running into the simpler-to-solve cases a fair bit, I can implement those two bits relatively easily so you can take them for a test drive, should I add them to the PR or make a new branch? |
I think the correct approach is to not allow multi-line completions when we are FIM-ing before the end of the current line and add logic to deduplicate the line suffix from the generated suggestion. For example: {cur_line_prefix}█{cur_line_suffix} If |
Good point, that way the user can request a multi-line completion by completing at the end of the line. Not sure whether the LLM should still be allowed to hard-wrap its output though, but then I don't have a good feel for how often that'll come up in practice in general code. I'll cook up a branch with and without for you to try out when I get the time. I'll also add opening braces, brackets, and angles to the allowable suffix match patterns, I think that should only cause a mismatch in the very specific and weird scenario where it could want to open a new bracket that is immediately closed by an existing one before being reopened, all at the end of the line, going from I would test this out more myself, but my work is currently involving more running of code with different parameters than coding. Sorry for relying on you to test out this stuff 😆. If we do disallow that multi-line pattern, I might wanna see how much it affects the hard-wrapped latex code I write, as wrapping should come up a bit more there. EDIT: if we do force inline suggestions for an inline completion, we should modify the rendering, at least for neovim, to use inline virtual text drawing. That way the suffix of the line gets pushed about and soft-wrapped, and also doesn't change colour as we don't draw over the top of it. Not sure what the equivalent is for vim. As the other pattern of completion will be at the end-of-line, we can also try out inline completion on the first line there. |
A perfect LLM wouldn't need this, because it would understand the suffix is already there and that this project's code will append it anyway. But real LLMs aren't perfect, so if the suffix is already there in the last line then we should avoid appending it again, particularly in multi-line suggestions.
However, the suffix could be a closing bracket and the completion could add a function call ending in a closing bracket or other punctuation, so to avoid messing up otherwise perfect completions we only use this under the condition that there are some word characters in there, which hopefully will be rarely intentionally duplicated, unlike the example of a closing bracket.
If anyone else has opinions on what should determine if the suffix is purely that kind of punctuation or not then feel free to tweak this, but for now I just check for two consecutive word characters. That may need tweaking if it causes issues for a particular programming language or something like that.
Also would be very grateful if someone had a check of my use of
escape()
in the pattern matching. This should be safe, due to the use of very magic pattern matching (\V
), and even failing that the regex shouldn't be able to do anything nasty anyway, but I still feel a little uneasy passing in arbitrary data here, given my lack of experience writing this kind of code before.