Android App Llama-v3-2-3B-Chat Quantized returns garbled text #34

zhehui-chen · 2024-12-30T07:01:50Z

Follow the guideline to build the ChatApp with Llama-v3-2-3B-Chat Quantized. The QNN version I used is 2.28.2.

I successfully run the ChatApp on my android device (OnePlus 13 with snapdragon 8Elite).

However, while chatting with the app, it always returns me with garbled text like the following.

Does anyone have any idea about this problem?

franklyd · 2025-01-03T14:57:19Z

I have faced a similar issue. I found the generation quality downgraded a lot, comparing to run it with onnx-runtime in 4 bit.

gustavla · 2025-01-07T21:12:04Z

Hi @zhehui-chen and @franklyd,

Sorry to hear you are seeing poor results through the app.

We know that the app has issues on some consumer devices (especially on Android 14 or earlier). There are two underlying features needed in the Android "metabuild". This is why in the app README we say it only works on Android 15. If you can provide us with exactly what devices, what Android version, and ideally the exact Android build (should be in the settings), that would be really helpful so that we can investigate further why it's not working. Especially if it's on Android 15, where we expect it to work. Thanks!

franklyd · 2025-01-14T08:23:54Z

Thanks! Actually, I was running genie-t2t-run on Android device (gen 3) directly.
The instruction for the LLM was to generate some json output, but I observe much poorer quality comparing to onnx q4, especially it cannot follow the instruction to output in json.

Zant12 · 2025-02-16T22:14:54Z

@zhehui-chen

This is the command I used:

python -m qai_hub_models.models.llama_v3_2_3b_chat_quantized.export  --chipset qualcomm-snapdragon-8-elite --context-length 2048 --output-dir genie_bundle  --skip-inferencing --skip-profiling

In genie_config.json. make sure "size": 2048,

Note on Model Creation: Using the conversion method results in a fixed context length, which remains unchanged thereafter, as far as I understand.

There is also a note here that the latest context length issues are fixed, the older version I have tested had this issue.

Device Requirements: "At least Genie SDK from QNN SDK 2.29.0 (earlier versions have issues with long prompts)."

gustavla · 2025-02-17T04:35:11Z

@franklyd The Android 15 requirement affects all uses of Genie SDK, whether it is through genie-t2t-run or the app. Which version of Android are you on (and what model is your phone)? The model should produce sensible results, so if it is not it is either the metabuild requirement or there is some other issue we have yet to discover.

@Zant12 Are you also seeing garbled responses? If so, what is your Android OS version and device?

franklyd · 2025-02-18T04:19:52Z

Thanks! Previously I was tested on Android 14, the oneplus gen3 Pad. Now I got a Xiaomi 15 phone (with Android 15), and I can test on it then.

Zant12 · 2025-02-18T06:10:30Z

@gustavla I have recently retested the app on version 2.31.0.250130, before the version was 2.28.

I do still see an issue on a second submission.

To reproduce:

submit the transcript of the gettysburg address, and ask for a summary. There are no issues, you can have multiline chats after the summary.
submit the transcript as your second message, still within context window. This will fail to provide a response, and mess with the context from here onwards, which produces the gibberish effect.

Model verified: Llama 3.2 3B
Device name: Oneplus 13
OS: ColorOS 15.0
Build version: PJZ110_15.0.0.210(CN01)

I believe this app caches the previous responses, which explains the additional complexity?

gustavla · 2025-02-18T20:02:38Z

@Zant12 Thank you for the detailed description. Given that the you do get OK results on the first prompt, it does suggest that there might be a bug in the app. Have you tried to reproduce this through t2t-genie-run? You would need to manually construct the prompts to build up the chat history.

We will try to find time to see if we can reproduce this as well.

Zant12 · 2025-02-18T21:11:17Z

@gustavla actually, after more patient testing, it will return a correct response in both the cli and the app, but I never saw it because it takes minutes for prompt processing the second response, instead of the few seconds you normally see.
If you submit a new query while it is still processing, (which I did since I thought it returned an empty response), you will have the gibberish response.

*sorry it was not gettysberg (which is shorter), I tested with the jfk inaugural-address. it should still fit into the 2k context window.

mjnong · 2025-02-19T16:14:31Z

I'm having the same issue running the app on Snapdragon 8 Elite, Samsung S25 Ultra.
I only get gibberish on a request and any second attempt to sending a message crashes the application

gustavla · 2025-02-20T00:23:49Z

@Zant12 Thanks for clarifying. We will try to reproduce this on our end.

@marcusnagy Sorry to hear that. Can you give us some more information about your device? For instance Android OS and whatever additional information that you can provide?

mjnong · 2025-02-20T05:21:31Z

@gustavla
Phone specs
Name: Samsung Galaxy S25 Ultra
Model Name: SM-S938B/DS
Android version: 15

genie_config

{
    "dialog": {
        "version": 1,
        "type": "basic",
        "context": {
            "version": 1,
            "size": 2048,
            "n-vocab": 128256,
            "bos-token": -1,
            "eos-token": [128001, 128009, 128008]
        },
        "sampler": {
            "version": 1,
            "seed": 42,
            "temp": 0.8,
            "top-k": 40,
            "top-p": 0.95
        },
        "tokenizer": {
            "version": 1,
            "path": "<tokenizer_path>"
        },
        "engine": {
            "version": 1,
            "n-threads": 3,
            "backend": {
                "version": 1,
                "type": "QnnHtp",
                "QnnHtp": {
                    "version": 1,
                    "use-mmap": true,
                    "spill-fill-bufsize": 0,
                    "mmap-budget": 0,
                    "poll": true,
                    "cpu-mask": "0xe0",
                    "kv-dim": 128,
                    "allow-async-init": false
                },
                "extensions": "<htp_backend_ext_path>"
            },
            "model": {
                "version": 1,
                "type": "binary",
                "binary": {
                    "version": 1,
                    "ctx-bins": [
                        "<models_path>/llama_v3_2_3b_chat_quantized_part_1_of_3.bin",
                        "<models_path>/llama_v3_2_3b_chat_quantized_part_2_of_3.bin",
                        "<models_path>/llama_v3_2_3b_chat_quantized_part_3_of_3.bin"
                    ]
                },
                "positional-encoding": {
                    "type": "rope",
                    "rope-dim": 64,
                    "rope-theta": 500000,
                    "rope-scaling": {
                        "rope-type": "llama3",
                        "factor": 8.0,
                        "low-freq-factor": 1.0,
                        "high-freq-factor": 4.0,
                        "original-max-position-embeddings": 8192
                    }
                }
            }
        }
    }
}

mjnong · 2025-02-20T14:58:12Z

@gustavla Phone specs Name: Samsung Galaxy S25 Ultra Model Name: SM-S938B/DS Android version: 15

genie_config

{
"dialog": {
"version": 1,
"type": "basic",
"context": {
"version": 1,
"size": 2048,
"n-vocab": 128256,
"bos-token": -1,
"eos-token": [128001, 128009, 128008]
},
"sampler": {
"version": 1,
"seed": 42,
"temp": 0.8,
"top-k": 40,
"top-p": 0.95
},
"tokenizer": {
"version": 1,
"path": "<tokenizer_path>"
},
"engine": {
"version": 1,
"n-threads": 3,
"backend": {
"version": 1,
"type": "QnnHtp",
"QnnHtp": {
"version": 1,
"use-mmap": true,
"spill-fill-bufsize": 0,
"mmap-budget": 0,
"poll": true,
"cpu-mask": "0xe0",
"kv-dim": 128,
"allow-async-init": false
},
"extensions": "<htp_backend_ext_path>"
},
"model": {
"version": 1,
"type": "binary",
"binary": {
"version": 1,
"ctx-bins": [
"<models_path>/llama_v3_2_3b_chat_quantized_part_1_of_3.bin",
"<models_path>/llama_v3_2_3b_chat_quantized_part_2_of_3.bin",
"<models_path>/llama_v3_2_3b_chat_quantized_part_3_of_3.bin"
]
},
"positional-encoding": {
"type": "rope",
"rope-dim": 64,
"rope-theta": 500000,
"rope-scaling": {
"rope-type": "llama3",
"factor": 8.0,
"low-freq-factor": 1.0,
"high-freq-factor": 4.0,
"original-max-position-embeddings": 8192
}
}
}
}
}
}

We resolved the issue by recompiling the model with the correct context length set. Is there a reason why we can not have longer context?

gustavla · 2025-02-24T17:57:09Z

@marcusnagy So you resolved it? What did you originally set the context length to that made it fail?

What I can tell you about context length is this:

Context lengths longer than 4096 are not tested and may not work due to memory limitations. With a S25, you should be able to use 4096 for all the models that we have.
You have to match compile-time and runtime-time context length. So you cannot change the context length just in the Genie configuration. You have to go back to the export script and change the --context-length parameter. I'm not sure what kind of error Genie will throw if they are mismatched.

mjnong · 2025-02-25T06:33:51Z

@marcusnagy So you resolved it? What did you originally set the context length to that made it fail?

What I can tell you about context length is this:

Context lengths longer than 4096 are not tested and may not work due to memory limitations. With a S25, you should be able to use 4096 for all the models that we have.

You have to match compile-time and runtime-time context length. So you cannot change the context length just in the Genie configuration. You have to go back to the export script and change the --context-length parameter. I'm not sure what kind of error Genie will throw if they are mismatched.

We set the context length to 4096... changing. We understood that it was not enough to just build the models but we also had to update the context length inside the config files. For some reason though that didn't work. That's why we ultimately figured let us just try to reproduce exactly the same setup that you guys have and with that it ended up working. With the 4096 context length we got this issue of gibberish text.

zhehui-chen mentioned this issue Dec 30, 2024

Capturing prod devices that supports ChatApp #29

Open

mestrona-3 added the assigned label Jan 7, 2025

tombang mentioned this issue Mar 10, 2025

There is an error while use this app on 8 Gen2 #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Android App Llama-v3-2-3B-Chat Quantized returns garbled text #34

Android App Llama-v3-2-3B-Chat Quantized returns garbled text #34

zhehui-chen commented Dec 30, 2024

franklyd commented Jan 3, 2025

gustavla commented Jan 7, 2025

franklyd commented Jan 14, 2025

Zant12 commented Feb 16, 2025

gustavla commented Feb 17, 2025

franklyd commented Feb 18, 2025

Zant12 commented Feb 18, 2025

gustavla commented Feb 18, 2025

Zant12 commented Feb 18, 2025 •

edited

Loading

mjnong commented Feb 19, 2025

gustavla commented Feb 20, 2025

mjnong commented Feb 20, 2025

mjnong commented Feb 20, 2025

gustavla commented Feb 24, 2025

mjnong commented Feb 25, 2025

Android App Llama-v3-2-3B-Chat Quantized returns garbled text #34

Android App Llama-v3-2-3B-Chat Quantized returns garbled text #34

Comments

zhehui-chen commented Dec 30, 2024

franklyd commented Jan 3, 2025

gustavla commented Jan 7, 2025

franklyd commented Jan 14, 2025

Zant12 commented Feb 16, 2025

gustavla commented Feb 17, 2025

franklyd commented Feb 18, 2025

Zant12 commented Feb 18, 2025

gustavla commented Feb 18, 2025

Zant12 commented Feb 18, 2025 • edited Loading

mjnong commented Feb 19, 2025

gustavla commented Feb 20, 2025

mjnong commented Feb 20, 2025

mjnong commented Feb 20, 2025

gustavla commented Feb 24, 2025

mjnong commented Feb 25, 2025

Zant12 commented Feb 18, 2025 •

edited

Loading