Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heuristics for adversarial suffixes #58

Open
seanpmorgan opened this issue Oct 4, 2023 · 4 comments
Open

Add heuristics for adversarial suffixes #58

seanpmorgan opened this issue Oct 4, 2023 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed heuristics

Comments

@seanpmorgan
Copy link
Member

Would be interesting to see what type of heuristics can be applied against adversarial suffixes. As background:
https://arxiv.org/abs/2307.15043
https://github.com/llm-attacks/llm-attacks

To be clear, this wouldn't be a defense for all possible adversarial attacks. It does seem like we could screen some though.

@seanpmorgan seanpmorgan added enhancement New feature or request help wanted Extra attention is needed heuristics labels Oct 4, 2023
@ristomcgehee
Copy link
Contributor

What would you think about instead using a machine learning classifier? We could generate a list of several hundred or thousand adversarial suffixes, and then train a machine learning algorithm to classify text as adversarial vs non-adversarial. It would probably need to be a neural network in order to handle the complexities of language, but if it was non-transformer based, I would think it wouldn't have the same underlying weakness as the LLM. A determined attacker could still train a suffix generator that avoids our classifier, but it would significantly increase attacker costs.

It seems to me that coming up with heuristics would be quite challenging for this. I kinda think that one needs an understanding of the normal language in order to recognize a suffix as suspicious, and it would be laborious to try to encode that understanding manually.

@ristomcgehee
Copy link
Contributor

After thinking about this more, perhaps a better approach would be to fine tune an existing LLM. The fine-tuned LLM could be trained to recognize a wide variety of prompt injection attacks, not just adversarial suffixes. I think fine tuning could help with situations where the attacker tries to prompt inject Rebuff itself. I expect that sooner or later, OpenAI will have some mechanism to share fine-tuned models. It also seems it might be possible to fine tune Llama 2 and distribute the modified weights.

@seanpmorgan
Copy link
Member Author

So the issue with using a machine learning classifier here is that a gradient based adversarial input can be crafted to simultaneously trick the LLM and the classifier (especially since the model would be publicly available). We want to rely on more traditional "heuristics" to infer that a crafted input is suspicious. The current heuristics we have built in are pretty basic, but we can utilize more advanced grammar parsing etc.

That doesn't mean we can't use ML models as defense layers in general though, just that it's not the solution for gradient based attacks. I think it's a great idea in general and we can start working on #13 to support that and other modular defenses that we want to add.

@ristomcgehee
Copy link
Contributor

Yeah, you're right that an adversary can trick both the LLM and the classifier. I'm just having trouble thinking of heuristics that might work against this sort of attack. Though maybe if I had more knowledge of more traditional NLP, I'd be able to think of some ideas.

I wonder if adversarial suffix attacks are similar to each other in vector space? Perhaps the vector similarity defense could help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed heuristics
Projects
None yet
Development

No branches or pull requests

2 participants