Skip to content

Using Exllama backend requires all the modules to be on GPU - how? #306

Open
@tigerinus

Description

@tigerinus

I'm sorry I am unable to find relevant doc on Internet on how to load all modules on GPU.

I got this error message from my code:

Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config object

A snippet from my code (to make it work, I had to uncomment the config part, but it won't be using Exllama)

    MODEL_ID = "TheBloke/Llama-2-13b-Chat-GPTQ"
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    # config = AutoConfig.from_pretrained(MODEL_ID)
    # config.quantization_config["disable_exllama"] = True

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        # config=config,
    )

Any help is greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions