Using Exllama backend requires all the modules to be on GPU - how?

I'm sorry I am unable to find relevant doc on Internet on how to load all modules on GPU.

I got this error message from my code:

```
Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config object
```

A snippet from my code (to make it work, I had to uncomment the `config` part, but it won't be using Exllama)

```python
    MODEL_ID = "TheBloke/Llama-2-13b-Chat-GPTQ"
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    # config = AutoConfig.from_pretrained(MODEL_ID)
    # config.quantization_config["disable_exllama"] = True

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        # config=config,
    )
```

Any help is greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using Exllama backend requires all the modules to be on GPU - how? #306

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Using Exllama backend requires all the modules to be on GPU - how? #306

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions