Add GLM4.1V model (Draft) #19331

zRzRzRzRzRzRzR · 2025-06-08T15:54:55Z

This PR aims to add support for the GLM-4.1V model.
Due to the upgrade of the implementation in the transformers library, some interfaces have been changed.
This model definitely needs to sample FPS and video duration.
The related transformers PR is here.

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

github-actions · 2025-06-08T15:55:03Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Hello @zRzRzRzRzRzRzR, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here with a summary of this pull request. This PR introduces support for the GLM-4.1V multimodal model. The primary goal is to integrate this new model into vLLM, enabling inference with both image and video inputs. This required adapting the model architecture, multimodal processing logic, and specific components like positional embeddings from the upstream transformers library implementation. Key changes include adding the model definition, implementing its vision tower and multimodal processing, adding a dedicated video loader with frame sampling capabilities, and updating core vLLM components like rotary embeddings and chat utilities to handle the model's specific requirements and token formats.

Highlights

New Model Support: Adds the Glm4vForConditionalGeneration model to support the GLM-4.1V multimodal model.
Multimodal Processing: Implements the GLM-4.1V specific vision transformer, patch merger, and multimodal processing logic to handle image and video inputs.
Video Loading: Introduces a dedicated video loader (Glm4vVideoLoader) for GLM-4.1V that handles frame sampling and metadata extraction.
Positional Encoding: Adds GLM-4.1V specific logic to the rotary embedding layer to correctly calculate positions for multimodal inputs.
Chat Utilities: Updates chat utility functions to recognize and handle GLM-4.1V's unique image and video placeholder tokens.
Examples and Registry: Adds an example for offline inference with GLM-4.1V and registers the new model within vLLM.

Changelog

Click here to see the changelog

examples/offline_inference/vision_language.py
- Added run_glm4_1v function to demonstrate offline inference for GLM-4.1V, including setting up engine arguments and prompt formatting for image and video.
tests/models/registry.py
- Added Glm4vForConditionalGeneration to the check_available_online registry, mapping it to the Hugging Face model name 'THUDM/GLM-4.1V-9B'.
vllm/entrypoints/chat_utils.py
- Added handling for glm4v model type in _placeholder_str to use the correct image and video placeholder tokens (<|begin_of_image|><|image|><|end_of_image|> and <|begin_of_video|><|video|><|end_of_video|>).
vllm/model_executor/layers/rotary_embedding.py
- Imported itertools, List, Tuple.
- Added _glm4v_get_input_positions_tensor class method to calculate rotary positional embeddings specifically for GLM-4.1V, considering image and video grid structures.
- Integrated the call to _glm4v_get_input_positions_tensor within get_input_positions_tensor for glm4v model types.
vllm/model_executor/models/glm4_1v.py
- Added new file implementing the Glm4vForConditionalGeneration model.
- Defined input types for image and video pixel values and embeddings.
- Implemented GLM-4.1V specific vision encoder components (Glm4vVisionMLP, Glm4vVisionAttention, Glm4vVisionPatchEmbed, Glm4vPatchMerger, Glm4vVisionEmbeddings, Glm4vVisionRotaryEmbedding, Glm4vVisionTransformer).
- Implemented multimodal processing logic (Glm4vProcessingInfo, Glm4vDummyInputsBuilder, Glm4vMultiModalProcessor) including handling image resizing, patch merging, and prompt updates.
- Integrated the vision tower and language model backbone.
- Added methods for parsing multimodal inputs, processing them into embeddings, and merging them with text embeddings.
- Implemented weight loading logic specific to GLM-4.1V.
- Registered the model and its processor with the vLLM multimodal registry.
vllm/model_executor/models/registry.py
- Added Glm4vForConditionalGeneration to the VLLM_REGISTERED_MODELS dictionary, linking it to the new glm4_1v module.
vllm/multimodal/video.py
- Imported Dict and Tuple.
- Added Glm4vVideoLoader registered under the 'glm4v' backend, using OpenCV to load video frames and metadata with support for uniform frame sampling.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the GLM-4.1V model, which is a valuable addition. The implementation involves adding a new model file, updating the rotary embedding logic for multimodal inputs, and integrating the model into the examples and registry. Overall, the structure of the changes looks good, following the pattern for adding new multimodal models. However, I've identified a few areas, particularly in the multimodal position embedding and video handling, that require attention to ensure correctness and consistency.

Summary of Findings

Incorrect handling of video position IDs and grid dimensions: The logic for calculating 3D position IDs for video tokens in _glm4v_get_input_positions_tensor appears to incorrectly use image grid dimensions for video height and width, and the temporal counter logic seems flawed for videos with multiple tokens.
Inconsistent video metadata: The _parse_and_validate_video_input function hardcodes FPS and video backend in the metadata, which is inconsistent with the Glm4vVideoLoader that samples these values from the video file.
Type mismatch in vision embeddings forward pass: The Glm4vVisionTransformer.forward method passes batch sequence lengths (seqlens) to Glm4vVisionEmbeddings.forward, which expects sequence lengths per image item (lengths).
Potential issue in v0 multimodal embedding merging: The v0 compatibility method get_input_embeddings_v0 merges image and video embeddings sequentially, which might be incorrect for interleaved multimodal inputs compared to the simultaneous merging approach in the v1 method.
Strict assertion in rotary embedding: An assertion in _glm4v_get_input_positions_tensor requires text segments to have a length greater than 0, which might be too strict.
Hardcoded FPS in example script: The example script hardcodes FPS in the mm_processor_kwargs, which might conflict with the model's actual video processing logic.
Misleading limit_mm_per_prompt in example script: The example script sets an image-specific limit for a function designed to handle both image and video.

Merge Readiness

This pull request introduces support for a new model and includes significant changes to the multimodal infrastructure. While the overall structure is good, there are several high-severity issues related to the correctness of multimodal position encoding and video metadata handling that need to be addressed before merging. I am unable to approve this pull request, and others should review and approve this code before merging.

gemini-code-assist · 2025-06-08T15:56:04Z

vllm/model_executor/layers/rotary_embedding.py

+                        video_frame_num,
+                        image_grid_thw[mm_data_idx][1],
+                        image_grid_thw[mm_data_idx][2],
+                    )


In the video processing block, the temporal dimension t is set to video_frame_num (line 1219), and the height h and width w are taken from image_grid_thw (lines 1220-1221). This seems incorrect for video. The height and width should likely come from video_grid_thw instead of image_grid_thw.

t, h, w = ( video_frame_num, video_grid_thw[mm_data_idx][1], video_grid_thw[mm_data_idx][2], )

gemini-code-assist · 2025-06-08T15:56:04Z

vllm/model_executor/layers/rotary_embedding.py

+                            torch.stack([t_index, h_index, w_index]) + st_idx)
+
+                    mm_data_idx += 1
+                    video_frame_num += 1


The video_frame_num is incremented by 1 after processing a video block. If a single video item consists of multiple 'video' tokens (e.g., representing different frames or chunks), this logic might be incorrect. The temporal dimension t for a video item should likely be derived from the video's properties (like number of frames or temporal patches) rather than a simple counter that increments per video block.

gemini-code-assist · 2025-06-08T15:56:04Z

vllm/model_executor/layers/rotary_embedding.py

+        mrope_position_delta = (llm_positions.max() + 1 -
+                                len(input_tokens)).item()


The calculation of mrope_position_delta as (llm_positions.max() + 1 - len(input_tokens)).item() seems specific to how GLM4V handles positions. Could you confirm if this delta calculation is correct and aligns with the model's expected positional encoding behavior, especially considering the 3D position IDs?

gemini-code-assist · 2025-06-08T15:56:05Z

vllm/model_executor/models/glm4_1v.py

+        x = self.embeddings(x, seqlens, grid_thw, image_type_ids[:, 0],
+                            image_type_ids[:, 1])


In Glm4vVisionTransformer.forward, self.embeddings is called with seqlens as the second argument (lengths). However, Glm4vVisionEmbeddings.adapt_position_encoding expects lengths to be a list or tensor of sequence lengths for each image ([batch_size]). seqlens is a list of sequence lengths for the batch of tokens used in attention, which is derived from cu_seqlens. This seems like a type mismatch and potential correctness issue. The lengths argument for position encoding should likely be derived from grid_thw (which has shape (num_images, 3)).

gemini-code-assist · 2025-06-08T15:56:05Z

vllm/model_executor/models/glm4_1v.py

+                        fps=2.0,
+                        duration=num_frames / 2.0,
+                        total_num_frames=num_frames,
+                        video_backend="pyav",
+                    ))


In _parse_and_validate_video_input, the video_metadata is constructed by hardcoding fps=2.0 and video_backend="pyav". However, the Glm4vVideoLoader samples the actual FPS and duration from the video file. This hardcoding seems inconsistent with the loader and might lead to incorrect metadata being passed if the actual video properties differ.

Do we really need to parse VideoMetadata to Glm4vVideoPixelInputs? Seems it's not used by model forwarding.

gemini-code-assist · 2025-06-08T15:56:05Z

examples/offline_inference/vision_language.py

+            "max_pixels": 1280 * 28 * 28,
+            "fps": 1,
+        },
+        limit_mm_per_prompt={"image": 1},


The limit_mm_per_prompt is set to {"image": 1} here. However, the run_glm4_1v function is designed to handle both image and video modalities. If video input is used with this function, this limit might be incorrect or misleading. Should this limit be configurable based on the modality used, or should it allow for both image and video if the model supports both?

examples/offline_inference/vision_language.py

vllm/model_executor/layers/rotary_embedding.py

vllm/model_executor/models/glm4_1v.py

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

zRzRzRzRzRzRzR · 2025-06-09T03:27:57Z

I noticed the part where the video is read in. The original video reading uses OpenCV but does not return the frame rate. Our model definitely needs FPS as a sampling parameter, so I separately implemented Glm4vVideoLoader. I don't know if this is a good approach, or if you will modify OpenCVVideoBackend in future updates to obtain other information needed in transformers, which is included in from transformers.video_utils import VideoMetadata, implemented in the latest original code of transformers.

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

vllm/multimodal/video.py

Isotr0py

Added some initial comments, PTAL :)

vllm/model_executor/models/glm4_1v.py

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

vllm/multimodal/inputs.py

DarkLight1337 · 2025-06-09T09:33:40Z

vllm/multimodal/parse.py

@@ -427,6 +427,9 @@ def _parse_video_data(
        if self._is_embeddings(data):
            return VideoEmbeddingItems(data)

+        if isinstance(data, tuple) and len(data) == 2:
+            frames, metadata = data
+            data = frames


The metadata should be passed to VideoProcessorItems as well so it can be used by the HF processor (preferably via get_processor_data)

update like this?

Yes but make sure the metadata is passed to HF processor correctly as well

For now, it is working, I checked that the HF Processor passed in is working, but there is an issue. This feature is only supported in the current version of the transformers source code, which means the next version of transformers will support it. I contacted their staff, and they said this feature is still being perfected. I know that this class may not be supported in some older models.

Glm4.1V will also be supported in the next version of the transformers implementation, and the transformers team is expected to update the code improvements regarding processors this week.

I see. Perhaps we should add a transformers version check to get_processor_data to maintain back-compatibility then.

Yes, but I happened to encounter their update during implementation. The adaptation of the GLM4.1v model will directly adapt to the new transformers interface, so I am not familiar with what impact the previous versions of transformers, such as 4.52 and earlier, will have on video understanding models with this improvement.

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

zRzRzRzRzRzRzR · 2025-06-09T13:09:49Z

I suggest merging this PR after the transformers team has confirmed their version merge, to ensure that the main branch of transformers is stable.(This Week)

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

zRzRzRzRzRzRzR · 2025-06-23T10:46:28Z

In the latest version of transformers, images and videos are processed separately, with image processors and video processors respectively. Is there any related implementation for reference in vLLM?

In the current implementation, I've noticed that the video processor is never called. The GLM-4.1V model processes videos by converting them into multiple image frames after sampling. However, I've found that only one frame is passed to vLLM each time, while transformers work normally. Do you know in which files the video processing logic is located?

DarkLight1337 · 2025-06-23T10:52:22Z

We use the parent processor that is constructed from AutoProcessor in vLLM. So if only the image processor is being called, I suggest inspecting the inputs to AutoProcessor.__call__ to figure out why.

zRzRzRzRzRzRzR · 2025-06-23T12:07:22Z

I observed that the HF video processor was called, but in the return from the HF processor:

return BatchFeature(data=data, tensor_type=return_tensors)
all = ["Glm4vVideoProcessor"]

both pixel_values_videos and video_grid_thw do not align with the kwargs expected in vLLM.
Below is the print output from transformers right before the return:

pixel_values_videos： tensor([[ 1.4632,  1.4632,  1.4632,  ...,  2.0890,  2.0890,  2.0890],
        [ 1.4778,  1.4778,  1.4778,  ...,  2.0606,  2.0606,  2.0606],
        [ 1.5070,  1.5070,  1.5070,  ...,  2.0890,  2.0890,  2.0890],
        ...,
        [ 0.6457,  0.6749,  0.6749,  ..., -0.6412, -0.6412, -0.6412],
        [ 0.6457,  0.6311,  0.6311,  ..., -0.7977, -0.7692, -0.7692],
        [ 0.7041,  0.7479,  0.7041,  ..., -0.6697, -0.6981, -0.7123]]) with shape(torch.Size([19136, 1176])
        
video_grid_thw：[[1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46]]

But in vLLM, the output is printed as

tensor([[[ 1.4609,  1.4609,  1.4609,  ...,  2.0938,  2.0938,  2.0938],
         [ 1.4766,  1.4766,  1.4766,  ...,  2.0625,  2.0625,  2.0625],
         [ 1.5078,  1.5078,  1.5078,  ...,  2.0938,  2.0938,  2.0938],
         ...,
         [ 0.7930,  0.7930,  0.8359,  ...,  0.9648,  0.9648,  0.9531],
         [-0.1572,  0.0033,  0.2520,  ..., -0.4141, -0.3711, -0.4004],
         [ 0.3398,  0.3105,  0.2656,  ..., -0.5000, -0.5000, -0.3574]]], with shape(torch.Size([299, 1176])
       device='cuda:0', dtype=torch.bfloat16)
tensor([[[ 1, 26, 46]]], device='cuda:0')

DarkLight1337 · 2025-06-23T12:26:18Z

Are the inputs to __call__ the same?

zRzRzRzRzRzRzR · 2025-06-23T12:29:47Z

Same, I printed

class Glm4vMultiModalProcessor(BaseMultiModalProcessor[Glm4vProcessingInfo]):

    def _call_hf_processor(

The return result is the same as HF's.

DarkLight1337 · 2025-06-23T12:31:14Z

Can you try to disable the multimodal preprocessing cache and see if the results are still different? Perhaps HF does some padding based on the other inputs in the same batch

zRzRzRzRzRzRzR · 2025-06-23T12:37:32Z

How to operate this. Also, currently batch is set to 1. GLM's video grid is different from Qwen, it is a list containing multiple image grids.
Additionally, I suspect that the issue is not in glm4_1.py, but in the functions it inherits that are not explicitly written in this file.

DarkLight1337 · 2025-06-23T12:41:38Z

You can set --disable-mm-preprocessor-cache

zRzRzRzRzRzRzR · 2025-06-23T12:45:42Z

still diff

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

zRzRzRzRzRzRzR · 2025-06-25T12:40:10Z

https:github.com/huggingface/transformers/pull/38431
This PR has been merged. Could you please help take a look at the preprocessing part? Only the video preprocessing is not aligned, as each time only one frame of video is passed in. I really couldn't find the issue.

zRzRzRzRzRzRzR · 2025-06-25T12:41:52Z

I can provide the model, the _call_hf_processor in Glm4vMultiModalProcessor returns the correct sample.

Isotr0py · 2025-06-25T13:43:50Z

vllm/model_executor/models/glm4_1v.py

+    def _get_mm_fields_config(
+        self,
+        hf_inputs: BatchFeature,
+        hf_processor_mm_kwargs: Mapping[str, object],
+    ) -> Mapping[str, MultiModalFieldConfig]:
+        return _qwen2vl_field_config(hf_inputs)


Seems that we're reusing qwen2_vl's mm_fields_config here, perhaps this is the cause of incorrect video_grid_thw shape?

vllm/vllm/model_executor/models/qwen2_vl.py

Lines 698 to 716 in bf51815

def _qwen2vl_field_config(hf_inputs: Mapping[str, torch.Tensor]):

image_grid_thw = hf_inputs.get("image_grid_thw", torch.empty((0, 3)))

image_grid_sizes = image_grid_thw.prod(-1)

video_grid_thw = hf_inputs.get("video_grid_thw", torch.empty((0, 3)))

video_grid_sizes = video_grid_thw.prod(-1)

return dict(

pixel_values=MultiModalFieldConfig.flat_from_sizes(

"image", image_grid_sizes),

image_embeds=MultiModalFieldConfig.flat_from_sizes(

"image", image_grid_sizes),

image_grid_thw=MultiModalFieldConfig.batched("image"),

pixel_values_videos=MultiModalFieldConfig.flat_from_sizes(

"video", video_grid_sizes),

video_embeds=MultiModalFieldConfig.flat_from_sizes(

"video", video_grid_sizes),

video_grid_thw=MultiModalFieldConfig.batched("video"),

)

I checked here and printed video_grid_thw, it is correct.

this is the right shape of it

tensor([[ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46], [ 1, 26, 46]])

and I print it at here

def _qwen2vl_field_config(hf_inputs: Mapping[str, torch.Tensor]): image_grid_thw = hf_inputs.get("image_grid_thw", torch.empty((0, 3))) image_grid_sizes = image_grid_thw.prod(-1) video_grid_thw = hf_inputs.get("video_grid_thw", torch.empty((0, 3))) print(video_grid_thw) video_grid_sizes = video_grid_thw.prod(-1)

it is right

but when I print in kwargs at _parse_and_validate_video_input
it turn to

[[[ 1, 26, 46]]]

Yea, hf_inputs should be fine here because it's just the output of _call_hf_processor exactly.

I suspect that mm_field_config (the returned dictionary) may be configured incorrectly here. Perhaps you can take a look at the MultiModalKwargs.from_hf_inputs in processor:

vllm/vllm/multimodal/processing.py

Lines 1320 to 1325 in bf51815

mm_kwargs = MultiModalKwargs.from_hf_inputs(

processed_data,

self._get_mm_fields_config(processed_data, hf_processor_mm_kwargs),

)

I guess video_grid_thw in mm_kwargs will have an incorrect shape starting from there.

mm_kwargs = MultiModalKwargs.from_hf_inputs( processed_data, self._get_mm_fields_config(processed_data, hf_processor_mm_kwargs), ) print(mm_kwargs)

still right

zRzRzRzRzRzRzR · 2025-06-26T07:53:24Z

I feel that the issue doesn't seem to be in the processing.py file; I have printed many places, and they all show complete grids.

Isotr0py · 2025-06-26T08:07:16Z

I see, so the multimodal processor implementation should be fine, and the issue could only be after processor calling. Is the input **kwargs of _parse_and_validate_video_input still correct up to this step?

zRzRzRzRzRzRzR · 2025-06-26T08:21:21Z

I see, so the multimodal processor implementation should be fine, and the issue could only be after processor calling. Is the input **kwargs of _parse_and_validate_video_input still correct up to this step?

No, **kwargs of _parse_and_validate_video_input still wrong, but I found the len video is 1. Is this might cause issues ?

in transformers processing, I first used a special placeholder token for the video, and then split the video into individual frames. So each video corresponds to a list of images and a grid.

Isotr0py · 2025-06-26T09:10:07Z

Hmmm, the only remaining possible cause I can think now is v1's mm_encoder, which is unlikely to cause issue...

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 971 to 1003 in 1d7c29f

    
           def _execute_mm_encoder(self, scheduler_output: "SchedulerOutput"): 
        
               scheduled_encoder_inputs = scheduler_output.scheduled_encoder_inputs 
        
               if not scheduled_encoder_inputs: 
        
                   return 
        
               # Batch the multi-modal inputs. 
        
               mm_inputs = list[MultiModalKwargs]() 
        
               req_ids_pos = list[tuple[str, int, PlaceholderRange]]() 
        
               for req_id, encoder_input_ids in scheduled_encoder_inputs.items(): 
        
                   req_state = self.requests[req_id] 
        
                   for mm_input_id in encoder_input_ids: 
        
                       mm_inputs.append(req_state.mm_inputs[mm_input_id]) 
        
                       req_ids_pos.append( 
        
                           (req_id, mm_input_id, req_state.mm_positions[mm_input_id])) 
        
               # Batch mm inputs as much as we can: if a request in the batch has 
        
               # multiple modalities or a different modality than the previous one, 
        
               # we process it separately to preserve item order. 
        
               # FIXME(ywang96): This is a hacky way to deal with multiple modalities 
        
               # in the same batch while still being able to benefit from batching 
        
               # multimodal inputs. The proper solution should be reordering the 
        
               # encoder outputs. 
        
               grouped_mm_inputs_list = group_mm_inputs_by_modality(mm_inputs) 
        
               encoder_outputs = [] 
        
               for grouped_mm_inputs in grouped_mm_inputs_list: 
        
                   batched_mm_inputs = MultiModalKwargs.batch( 
        
                       grouped_mm_inputs, pin_memory=self.pin_memory) 
        
                   batched_mm_inputs = MultiModalKwargs.as_kwargs( 
        
                       batched_mm_inputs, 
        
                       device=self.device, 
        
                   )

BTW, have you checked if V0 also has this issue? If so, it's unlikely the issue from mm_encoder.

zRzRzRzRzRzRzR · 2025-06-26T12:59:29Z

I printed

        for grouped_mm_inputs in grouped_mm_inputs_list:
            print(grouped_mm_inputs)

The grouped_mm_inputs here is already incorrect.

zRzRzRzRzRzRzR · 2025-06-26T13:02:03Z

I don't seem to have found the issue yet. Can you provide the weights (slack) and help me troubleshoot this problem?

Isotr0py · 2025-06-26T13:06:04Z

I don't seem to have found the issue yet. Can you provide the weights (slack) and help me troubleshoot this problem?

Sure!

Isotr0py · 2025-06-26T15:52:23Z

During my debugging, the incorrect mm_kwargs is from here:

vllm/vllm/multimodal/processing.py

Lines 1588 to 1591 in 0bceac9

    
           mm_kwargs = MultiModalKwargs.from_items([ 
        
               item.value for cache_items in mm_cache_items_merged.values() 
        
               for item in cache_items 
        
           ])

If I set disable_mm_preprocessor_cache=True, then video_grid_thw is something like:

tensor([[ 1, 28, 52],
        [ 1, 28, 52],
        ...
        [ 1, 28, 52]])
shape: 76 x 3

Then it raised error with mismatch num_video:

RuntimeError: Expected there to be 1 video items in keyword arguments corresponding to 1 video data items, but only found 76! There is likely a problem with your implementation of merged multi-modal processor for this model (usually arising from an inconsistency between `_call_hf_processor` and `_get_mm_fields_config`).

Add GLM4 1V model and update multimodal pipeline

43996c6

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

zRzRzRzRzRzRzR requested review from DarkLight1337, ywang96 and aarnphm as code owners June 8, 2025 15:54

gemini-code-assist bot reviewed Jun 8, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) labels Jun 8, 2025

gemini-code-assist bot suggested changes Jun 8, 2025

View reviewed changes

zRzRzRzRzRzRzR added 4 commits June 9, 2025 00:07

rotray changed

1f1446d

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

update

1b0ea56

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

update for ruff

ece4f40

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

update for ruff 1120

0042be2

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

update for ruff 1140

21f6945

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

DarkLight1337 reviewed Jun 9, 2025

View reviewed changes

vllm/multimodal/video.py Show resolved Hide resolved

Isotr0py reviewed Jun 9, 2025

View reviewed changes

zRzRzRzRzRzRzR added 3 commits June 9, 2025 16:51

update gateup

0ca1769

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

change video reader>

2b51a60

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

change video reader>

18e0abb

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

DarkLight1337 reviewed Jun 9, 2025

View reviewed changes

vllm/multimodal/inputs.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Jun 9, 2025

View reviewed changes

update video

3094328

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

zRzRzRzRzRzRzR added 5 commits June 9, 2025 21:34

2140

78299c7

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

2300

7e542ca

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

Merge branch 'vllm-project:main' into glm4_1-v

7811a3f

Merge branch 'vllm-project:main' into glm4_1-v

3a9b2e6

update with video_start_token_id

b84a700

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

zRzRzRzRzRzRzR added 2 commits June 24, 2025 22:59

wrong preprocessor

fd97060

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

wrong preprocessor

0a20888

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

Isotr0py reviewed Jun 25, 2025

View reviewed changes

Merge branch 'vllm-project:main' into glm4_1-v

01155e2

		mrope_position_delta = (llm_positions.max() + 1 -
		len(input_tokens)).item()

		x = self.embeddings(x, seqlens, grid_thw, image_type_ids[:, 0],
		image_type_ids[:, 1])

	def _qwen2vl_field_config(hf_inputs: Mapping[str, torch.Tensor]):
	image_grid_thw = hf_inputs.get("image_grid_thw", torch.empty((0, 3)))
	image_grid_sizes = image_grid_thw.prod(-1)

	video_grid_thw = hf_inputs.get("video_grid_thw", torch.empty((0, 3)))
	video_grid_sizes = video_grid_thw.prod(-1)

	return dict(
	pixel_values=MultiModalFieldConfig.flat_from_sizes(
	"image", image_grid_sizes),
	image_embeds=MultiModalFieldConfig.flat_from_sizes(
	"image", image_grid_sizes),
	image_grid_thw=MultiModalFieldConfig.batched("image"),
	pixel_values_videos=MultiModalFieldConfig.flat_from_sizes(
	"video", video_grid_sizes),
	video_embeds=MultiModalFieldConfig.flat_from_sizes(
	"video", video_grid_sizes),
	video_grid_thw=MultiModalFieldConfig.batched("video"),
	)


	mm_kwargs = MultiModalKwargs.from_hf_inputs(
	processed_data,
	self._get_mm_fields_config(processed_data, hf_processor_mm_kwargs),
	)

Uh oh!

Add GLM4.1V model (Draft) #19331

Are you sure you want to change the base?

Add GLM4.1V model (Draft) #19331

Conversation

zRzRzRzRzRzRzR commented Jun 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zRzRzRzRzRzRzR commented Jun 9, 2025

Uh oh!

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zRzRzRzRzRzRzR Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zRzRzRzRzRzRzR commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zRzRzRzRzRzRzR commented Jun 23, 2025

Uh oh!

zRzRzRzRzRzRzR commented Jun 8, 2025 •

edited by github-actions bot

Loading

DarkLight1337 Jun 9, 2025 •

edited

Loading

zRzRzRzRzRzRzR Jun 9, 2025 •

edited

Loading

zRzRzRzRzRzRzR commented Jun 9, 2025 •

edited

Loading

DarkLight1337 commented Jun 23, 2025 •

edited

Loading

zRzRzRzRzRzRzR commented Jun 23, 2025 •

edited

Loading

DarkLight1337 commented Jun 23, 2025 •

edited

Loading

zRzRzRzRzRzRzR Jun 25, 2025 •

edited

Loading

Isotr0py Jun 25, 2025 •

edited

Loading

zRzRzRzRzRzRzR commented Jun 26, 2025 •

edited

Loading