Description
When running the updated version of the script examples/multimodal_vision/qwen_2_5_vl_example.py using the same model and version as before, I encounter the following error related to linear kernel implementations.
In the older version of the codebase, I was importing the model as:
from llmcompressor.transformers.tracing import TraceableQwen2_5_VLForConditionalGeneration
However, in the new version, the import is changed to:
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
When running the same model and version with this new import, I get the following error related to linear kernel implementation failures.
Is there any way to access the old import that worked previously?
Environment
Include all relevant environment information:
- OS running google colab
- Python 3.11.13
- llmcompressor==0.5.3
- torch 2.6.0
Errors
ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons:
- MacheteLinearKernel requires capability 90, current compute capability is 80
- AllSparkLinearKernel cannot implement due to: For Ampere GPU, AllSpark does not support group_size = 128. Only group_size = -1 are supported.
- MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 3420 is not divisible by min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.
- BitBLASLinearKernel cannot implement due to: bitblas is not installed. Please install bitblas by running pip install bitblas>=0.1.0
- ExllamaLinearKernel cannot implement due to: Output features must be a multiple of the pack factor (32 / num_bits) so that we can correctly pack the zero points