-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Android App Llama-v3-2-3B-Chat Quantized returns garbled text #34
Comments
I have faced a similar issue. I found the generation quality downgraded a lot, comparing to run it with onnx-runtime in 4 bit. |
Hi @zhehui-chen and @franklyd, Sorry to hear you are seeing poor results through the app. We know that the app has issues on some consumer devices (especially on Android 14 or earlier). There are two underlying features needed in the Android "metabuild". This is why in the app README we say it only works on Android 15. If you can provide us with exactly what devices, what Android version, and ideally the exact Android build (should be in the settings), that would be really helpful so that we can investigate further why it's not working. Especially if it's on Android 15, where we expect it to work. Thanks! |
Thanks! Actually, I was running |
This is the command I used:
In Note on Model Creation: Using the conversion method results in a fixed context length, which remains unchanged thereafter, as far as I understand. There is also a note here that the latest context length issues are fixed, the older version I have tested had this issue. Device Requirements: "At least Genie SDK from QNN SDK 2.29.0 (earlier versions have issues with long prompts)." |
@franklyd The Android 15 requirement affects all uses of Genie SDK, whether it is through genie-t2t-run or the app. Which version of Android are you on (and what model is your phone)? The model should produce sensible results, so if it is not it is either the metabuild requirement or there is some other issue we have yet to discover. @Zant12 Are you also seeing garbled responses? If so, what is your Android OS version and device? |
Thanks! Previously I was tested on Android 14, the oneplus gen3 Pad. Now I got a Xiaomi 15 phone (with Android 15), and I can test on it then. |
@gustavla I have recently retested the app on I do still see an issue on a second submission. To reproduce:
Model verified: Llama 3.2 3B I believe this app caches the previous responses, which explains the additional complexity? |
@Zant12 Thank you for the detailed description. Given that the you do get OK results on the first prompt, it does suggest that there might be a bug in the app. Have you tried to reproduce this through t2t-genie-run? You would need to manually construct the prompts to build up the chat history. We will try to find time to see if we can reproduce this as well. |
@gustavla actually, after more patient testing, it will return a correct response in both the cli and the app, but I never saw it because it takes minutes for prompt processing the second response, instead of the few seconds you normally see. *sorry it was not gettysberg (which is shorter), I tested with the jfk inaugural-address. it should still fit into the 2k context window. |
I'm having the same issue running the app on Snapdragon 8 Elite, Samsung S25 Ultra. |
@Zant12 Thanks for clarifying. We will try to reproduce this on our end. @marcusnagy Sorry to hear that. Can you give us some more information about your device? For instance Android OS and whatever additional information that you can provide? |
@gustavla genie_config {
"dialog": {
"version": 1,
"type": "basic",
"context": {
"version": 1,
"size": 2048,
"n-vocab": 128256,
"bos-token": -1,
"eos-token": [128001, 128009, 128008]
},
"sampler": {
"version": 1,
"seed": 42,
"temp": 0.8,
"top-k": 40,
"top-p": 0.95
},
"tokenizer": {
"version": 1,
"path": "<tokenizer_path>"
},
"engine": {
"version": 1,
"n-threads": 3,
"backend": {
"version": 1,
"type": "QnnHtp",
"QnnHtp": {
"version": 1,
"use-mmap": true,
"spill-fill-bufsize": 0,
"mmap-budget": 0,
"poll": true,
"cpu-mask": "0xe0",
"kv-dim": 128,
"allow-async-init": false
},
"extensions": "<htp_backend_ext_path>"
},
"model": {
"version": 1,
"type": "binary",
"binary": {
"version": 1,
"ctx-bins": [
"<models_path>/llama_v3_2_3b_chat_quantized_part_1_of_3.bin",
"<models_path>/llama_v3_2_3b_chat_quantized_part_2_of_3.bin",
"<models_path>/llama_v3_2_3b_chat_quantized_part_3_of_3.bin"
]
},
"positional-encoding": {
"type": "rope",
"rope-dim": 64,
"rope-theta": 500000,
"rope-scaling": {
"rope-type": "llama3",
"factor": 8.0,
"low-freq-factor": 1.0,
"high-freq-factor": 4.0,
"original-max-position-embeddings": 8192
}
}
}
}
}
} |
We resolved the issue by recompiling the model with the correct context length set. Is there a reason why we can not have longer context? |
@marcusnagy So you resolved it? What did you originally set the context length to that made it fail? What I can tell you about context length is this:
|
We set the context length to 4096... changing. We understood that it was not enough to just build the models but we also had to update the context length inside the config files. For some reason though that didn't work. That's why we ultimately figured let us just try to reproduce exactly the same setup that you guys have and with that it ended up working. With the 4096 context length we got this issue of gibberish text. |
Follow the guideline to build the ChatApp with Llama-v3-2-3B-Chat Quantized. The QNN version I used is 2.28.2.
I successfully run the ChatApp on my android device (OnePlus 13 with snapdragon 8Elite).
However, while chatting with the app, it always returns me with garbled text like the following.
Does anyone have any idea about this problem?
The text was updated successfully, but these errors were encountered: