Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android App Llama-v3-2-3B-Chat Quantized returns garbled text #34

Open
zhehui-chen opened this issue Dec 30, 2024 · 15 comments
Open

Android App Llama-v3-2-3B-Chat Quantized returns garbled text #34

zhehui-chen opened this issue Dec 30, 2024 · 15 comments
Labels

Comments

@zhehui-chen
Copy link

Follow the guideline to build the ChatApp with Llama-v3-2-3B-Chat Quantized. The QNN version I used is 2.28.2.

I successfully run the ChatApp on my android device (OnePlus 13 with snapdragon 8Elite).

However, while chatting with the app, it always returns me with garbled text like the following.

image

Does anyone have any idea about this problem?

@franklyd
Copy link

franklyd commented Jan 3, 2025

I have faced a similar issue. I found the generation quality downgraded a lot, comparing to run it with onnx-runtime in 4 bit.

@gustavla
Copy link

gustavla commented Jan 7, 2025

Hi @zhehui-chen and @franklyd,

Sorry to hear you are seeing poor results through the app.

We know that the app has issues on some consumer devices (especially on Android 14 or earlier). There are two underlying features needed in the Android "metabuild". This is why in the app README we say it only works on Android 15. If you can provide us with exactly what devices, what Android version, and ideally the exact Android build (should be in the settings), that would be really helpful so that we can investigate further why it's not working. Especially if it's on Android 15, where we expect it to work. Thanks!

@franklyd
Copy link

Thanks! Actually, I was running genie-t2t-run on Android device (gen 3) directly.
The instruction for the LLM was to generate some json output, but I observe much poorer quality comparing to onnx q4, especially it cannot follow the instruction to output in json.

@Zant12
Copy link

Zant12 commented Feb 16, 2025

@zhehui-chen

This is the command I used:

python -m qai_hub_models.models.llama_v3_2_3b_chat_quantized.export  --chipset qualcomm-snapdragon-8-elite --context-length 2048 --output-dir genie_bundle  --skip-inferencing --skip-profiling 

In genie_config.json. make sure "size": 2048,

Note on Model Creation: Using the conversion method results in a fixed context length, which remains unchanged thereafter, as far as I understand.

There is also a note here that the latest context length issues are fixed, the older version I have tested had this issue.

Device Requirements: "At least Genie SDK from QNN SDK 2.29.0 (earlier versions have issues with long prompts)."

@gustavla
Copy link

@franklyd The Android 15 requirement affects all uses of Genie SDK, whether it is through genie-t2t-run or the app. Which version of Android are you on (and what model is your phone)? The model should produce sensible results, so if it is not it is either the metabuild requirement or there is some other issue we have yet to discover.

@Zant12 Are you also seeing garbled responses? If so, what is your Android OS version and device?

@franklyd
Copy link

Thanks! Previously I was tested on Android 14, the oneplus gen3 Pad. Now I got a Xiaomi 15 phone (with Android 15), and I can test on it then.

@Zant12
Copy link

Zant12 commented Feb 18, 2025

@gustavla I have recently retested the app on version 2.31.0.250130, before the version was 2.28.

I do still see an issue on a second submission.

To reproduce:

  • submit the transcript of the gettysburg address, and ask for a summary. There are no issues, you can have multiline chats after the summary.
  • submit the transcript as your second message, still within context window. This will fail to provide a response, and mess with the context from here onwards, which produces the gibberish effect.

Model verified: Llama 3.2 3B
Device name: Oneplus 13
OS: ColorOS 15.0
Build version: PJZ110_15.0.0.210(CN01)

I believe this app caches the previous responses, which explains the additional complexity?

@gustavla
Copy link

@Zant12 Thank you for the detailed description. Given that the you do get OK results on the first prompt, it does suggest that there might be a bug in the app. Have you tried to reproduce this through t2t-genie-run? You would need to manually construct the prompts to build up the chat history.

We will try to find time to see if we can reproduce this as well.

@Zant12
Copy link

Zant12 commented Feb 18, 2025

@gustavla actually, after more patient testing, it will return a correct response in both the cli and the app, but I never saw it because it takes minutes for prompt processing the second response, instead of the few seconds you normally see.
If you submit a new query while it is still processing, (which I did since I thought it returned an empty response), you will have the gibberish response.

*sorry it was not gettysberg (which is shorter), I tested with the jfk inaugural-address. it should still fit into the 2k context window.

@mjnong
Copy link

mjnong commented Feb 19, 2025

I'm having the same issue running the app on Snapdragon 8 Elite, Samsung S25 Ultra.
I only get gibberish on a request and any second attempt to sending a message crashes the application

@gustavla
Copy link

@Zant12 Thanks for clarifying. We will try to reproduce this on our end.

@marcusnagy Sorry to hear that. Can you give us some more information about your device? For instance Android OS and whatever additional information that you can provide?

@mjnong
Copy link

mjnong commented Feb 20, 2025

@gustavla
Phone specs
Name: Samsung Galaxy S25 Ultra
Model Name: SM-S938B/DS
Android version: 15

genie_config

{
    "dialog": {
        "version": 1,
        "type": "basic",
        "context": {
            "version": 1,
            "size": 2048,
            "n-vocab": 128256,
            "bos-token": -1,
            "eos-token": [128001, 128009, 128008]
        },
        "sampler": {
            "version": 1,
            "seed": 42,
            "temp": 0.8,
            "top-k": 40,
            "top-p": 0.95
        },
        "tokenizer": {
            "version": 1,
            "path": "<tokenizer_path>"
        },
        "engine": {
            "version": 1,
            "n-threads": 3,
            "backend": {
                "version": 1,
                "type": "QnnHtp",
                "QnnHtp": {
                    "version": 1,
                    "use-mmap": true,
                    "spill-fill-bufsize": 0,
                    "mmap-budget": 0,
                    "poll": true,
                    "cpu-mask": "0xe0",
                    "kv-dim": 128,
                    "allow-async-init": false
                },
                "extensions": "<htp_backend_ext_path>"
            },
            "model": {
                "version": 1,
                "type": "binary",
                "binary": {
                    "version": 1,
                    "ctx-bins": [
                        "<models_path>/llama_v3_2_3b_chat_quantized_part_1_of_3.bin",
                        "<models_path>/llama_v3_2_3b_chat_quantized_part_2_of_3.bin",
                        "<models_path>/llama_v3_2_3b_chat_quantized_part_3_of_3.bin"
                    ]
                },
                "positional-encoding": {
                    "type": "rope",
                    "rope-dim": 64,
                    "rope-theta": 500000,
                    "rope-scaling": {
                        "rope-type": "llama3",
                        "factor": 8.0,
                        "low-freq-factor": 1.0,
                        "high-freq-factor": 4.0,
                        "original-max-position-embeddings": 8192
                    }
                }
            }
        }
    }
}

@mjnong
Copy link

mjnong commented Feb 20, 2025

@gustavla Phone specs Name: Samsung Galaxy S25 Ultra Model Name: SM-S938B/DS Android version: 15

genie_config

{
"dialog": {
"version": 1,
"type": "basic",
"context": {
"version": 1,
"size": 2048,
"n-vocab": 128256,
"bos-token": -1,
"eos-token": [128001, 128009, 128008]
},
"sampler": {
"version": 1,
"seed": 42,
"temp": 0.8,
"top-k": 40,
"top-p": 0.95
},
"tokenizer": {
"version": 1,
"path": "<tokenizer_path>"
},
"engine": {
"version": 1,
"n-threads": 3,
"backend": {
"version": 1,
"type": "QnnHtp",
"QnnHtp": {
"version": 1,
"use-mmap": true,
"spill-fill-bufsize": 0,
"mmap-budget": 0,
"poll": true,
"cpu-mask": "0xe0",
"kv-dim": 128,
"allow-async-init": false
},
"extensions": "<htp_backend_ext_path>"
},
"model": {
"version": 1,
"type": "binary",
"binary": {
"version": 1,
"ctx-bins": [
"<models_path>/llama_v3_2_3b_chat_quantized_part_1_of_3.bin",
"<models_path>/llama_v3_2_3b_chat_quantized_part_2_of_3.bin",
"<models_path>/llama_v3_2_3b_chat_quantized_part_3_of_3.bin"
]
},
"positional-encoding": {
"type": "rope",
"rope-dim": 64,
"rope-theta": 500000,
"rope-scaling": {
"rope-type": "llama3",
"factor": 8.0,
"low-freq-factor": 1.0,
"high-freq-factor": 4.0,
"original-max-position-embeddings": 8192
}
}
}
}
}
}

We resolved the issue by recompiling the model with the correct context length set. Is there a reason why we can not have longer context?

@gustavla
Copy link

@marcusnagy So you resolved it? What did you originally set the context length to that made it fail?

What I can tell you about context length is this:

  • Context lengths longer than 4096 are not tested and may not work due to memory limitations. With a S25, you should be able to use 4096 for all the models that we have.
  • You have to match compile-time and runtime-time context length. So you cannot change the context length just in the Genie configuration. You have to go back to the export script and change the --context-length parameter. I'm not sure what kind of error Genie will throw if they are mismatched.

@mjnong
Copy link

mjnong commented Feb 25, 2025

@marcusnagy So you resolved it? What did you originally set the context length to that made it fail?

What I can tell you about context length is this:

  • Context lengths longer than 4096 are not tested and may not work due to memory limitations. With a S25, you should be able to use 4096 for all the models that we have.
  • You have to match compile-time and runtime-time context length. So you cannot change the context length just in the Genie configuration. You have to go back to the export script and change the --context-length parameter. I'm not sure what kind of error Genie will throw if they are mismatched.

We set the context length to 4096... changing. We understood that it was not enough to just build the models but we also had to update the context length inside the config files. For some reason though that didn't work. That's why we ultimately figured let us just try to reproduce exactly the same setup that you guys have and with that it ended up working. With the 4096 context length we got this issue of gibberish text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants