Skip to content

Add GLM4.1V model (Draft) #19331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

zRzRzRzRzRzRzR
Copy link
Contributor

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR commented Jun 8, 2025

This PR aims to add support for the GLM-4.1V model.
Due to the upgrade of the implementation in the transformers library, some interfaces have been changed.
This model definitely needs to sample FPS and video duration.
The related transformers PR is here.

Copy link

github-actions bot commented Jun 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @zRzRzRzRzRzRzR, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here with a summary of this pull request. This PR introduces support for the GLM-4.1V multimodal model. The primary goal is to integrate this new model into vLLM, enabling inference with both image and video inputs. This required adapting the model architecture, multimodal processing logic, and specific components like positional embeddings from the upstream transformers library implementation. Key changes include adding the model definition, implementing its vision tower and multimodal processing, adding a dedicated video loader with frame sampling capabilities, and updating core vLLM components like rotary embeddings and chat utilities to handle the model's specific requirements and token formats.

Highlights

  • New Model Support: Adds the Glm4vForConditionalGeneration model to support the GLM-4.1V multimodal model.
  • Multimodal Processing: Implements the GLM-4.1V specific vision transformer, patch merger, and multimodal processing logic to handle image and video inputs.
  • Video Loading: Introduces a dedicated video loader (Glm4vVideoLoader) for GLM-4.1V that handles frame sampling and metadata extraction.
  • Positional Encoding: Adds GLM-4.1V specific logic to the rotary embedding layer to correctly calculate positions for multimodal inputs.
  • Chat Utilities: Updates chat utility functions to recognize and handle GLM-4.1V's unique image and video placeholder tokens.
  • Examples and Registry: Adds an example for offline inference with GLM-4.1V and registers the new model within vLLM.

Changelog

Click here to see the changelog
  • examples/offline_inference/vision_language.py
    • Added run_glm4_1v function to demonstrate offline inference for GLM-4.1V, including setting up engine arguments and prompt formatting for image and video.
  • tests/models/registry.py
    • Added Glm4vForConditionalGeneration to the check_available_online registry, mapping it to the Hugging Face model name 'THUDM/GLM-4.1V-9B'.
  • vllm/entrypoints/chat_utils.py
    • Added handling for glm4v model type in _placeholder_str to use the correct image and video placeholder tokens (<|begin_of_image|><|image|><|end_of_image|> and <|begin_of_video|><|video|><|end_of_video|>).
  • vllm/model_executor/layers/rotary_embedding.py
    • Imported itertools, List, Tuple.
    • Added _glm4v_get_input_positions_tensor class method to calculate rotary positional embeddings specifically for GLM-4.1V, considering image and video grid structures.
    • Integrated the call to _glm4v_get_input_positions_tensor within get_input_positions_tensor for glm4v model types.
  • vllm/model_executor/models/glm4_1v.py
    • Added new file implementing the Glm4vForConditionalGeneration model.
    • Defined input types for image and video pixel values and embeddings.
    • Implemented GLM-4.1V specific vision encoder components (Glm4vVisionMLP, Glm4vVisionAttention, Glm4vVisionPatchEmbed, Glm4vPatchMerger, Glm4vVisionEmbeddings, Glm4vVisionRotaryEmbedding, Glm4vVisionTransformer).
    • Implemented multimodal processing logic (Glm4vProcessingInfo, Glm4vDummyInputsBuilder, Glm4vMultiModalProcessor) including handling image resizing, patch merging, and prompt updates.
    • Integrated the vision tower and language model backbone.
    • Added methods for parsing multimodal inputs, processing them into embeddings, and merging them with text embeddings.
    • Implemented weight loading logic specific to GLM-4.1V.
    • Registered the model and its processor with the vLLM multimodal registry.
  • vllm/model_executor/models/registry.py
    • Added Glm4vForConditionalGeneration to the VLLM_REGISTERED_MODELS dictionary, linking it to the new glm4_1v module.
  • vllm/multimodal/video.py
    • Imported Dict and Tuple.
    • Added Glm4vVideoLoader registered under the 'glm4v' backend, using OpenCV to load video frames and metadata with support for uniform frame sampling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) labels Jun 8, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the GLM-4.1V model, which is a valuable addition. The implementation involves adding a new model file, updating the rotary embedding logic for multimodal inputs, and integrating the model into the examples and registry. Overall, the structure of the changes looks good, following the pattern for adding new multimodal models. However, I've identified a few areas, particularly in the multimodal position embedding and video handling, that require attention to ensure correctness and consistency.

Summary of Findings

  • Incorrect handling of video position IDs and grid dimensions: The logic for calculating 3D position IDs for video tokens in _glm4v_get_input_positions_tensor appears to incorrectly use image grid dimensions for video height and width, and the temporal counter logic seems flawed for videos with multiple tokens.
  • Inconsistent video metadata: The _parse_and_validate_video_input function hardcodes FPS and video backend in the metadata, which is inconsistent with the Glm4vVideoLoader that samples these values from the video file.
  • Type mismatch in vision embeddings forward pass: The Glm4vVisionTransformer.forward method passes batch sequence lengths (seqlens) to Glm4vVisionEmbeddings.forward, which expects sequence lengths per image item (lengths).
  • Potential issue in v0 multimodal embedding merging: The v0 compatibility method get_input_embeddings_v0 merges image and video embeddings sequentially, which might be incorrect for interleaved multimodal inputs compared to the simultaneous merging approach in the v1 method.
  • Strict assertion in rotary embedding: An assertion in _glm4v_get_input_positions_tensor requires text segments to have a length greater than 0, which might be too strict.
  • Hardcoded FPS in example script: The example script hardcodes FPS in the mm_processor_kwargs, which might conflict with the model's actual video processing logic.
  • Misleading limit_mm_per_prompt in example script: The example script sets an image-specific limit for a function designed to handle both image and video.

Merge Readiness

This pull request introduces support for a new model and includes significant changes to the multimodal infrastructure. While the overall structure is good, there are several high-severity issues related to the correctness of multimodal position encoding and video metadata handling that need to be addressed before merging. I am unable to approve this pull request, and others should review and approve this code before merging.

Comment on lines +1219 to +1222
video_frame_num,
image_grid_thw[mm_data_idx][1],
image_grid_thw[mm_data_idx][2],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the video processing block, the temporal dimension t is set to video_frame_num (line 1219), and the height h and width w are taken from image_grid_thw (lines 1220-1221). This seems incorrect for video. The height and width should likely come from video_grid_thw instead of image_grid_thw.

                    t, h, w = (
                        video_frame_num,
                        video_grid_thw[mm_data_idx][1],
                        video_grid_thw[mm_data_idx][2],
                    )

torch.stack([t_index, h_index, w_index]) + st_idx)

mm_data_idx += 1
video_frame_num += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The video_frame_num is incremented by 1 after processing a video block. If a single video item consists of multiple 'video' tokens (e.g., representing different frames or chunks), this logic might be incorrect. The temporal dimension t for a video item should likely be derived from the video's properties (like number of frames or temporal patches) rather than a simple counter that increments per video block.

Comment on lines +1257 to +1258
mrope_position_delta = (llm_positions.max() + 1 -
len(input_tokens)).item()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of mrope_position_delta as (llm_positions.max() + 1 - len(input_tokens)).item() seems specific to how GLM4V handles positions. Could you confirm if this delta calculation is correct and aligns with the model's expected positional encoding behavior, especially considering the 3D position IDs?

Comment on lines +736 to +737
x = self.embeddings(x, seqlens, grid_thw, image_type_ids[:, 0],
image_type_ids[:, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In Glm4vVisionTransformer.forward, self.embeddings is called with seqlens as the second argument (lengths). However, Glm4vVisionEmbeddings.adapt_position_encoding expects lengths to be a list or tensor of sequence lengths for each image ([batch_size]). seqlens is a list of sequence lengths for the batch of tokens used in attention, which is derived from cu_seqlens. This seems like a type mismatch and potential correctness issue. The lengths argument for position encoding should likely be derived from grid_thw (which has shape (num_images, 3)).

Comment on lines +1195 to +1199
fps=2.0,
duration=num_frames / 2.0,
total_num_frames=num_frames,
video_backend="pyav",
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In _parse_and_validate_video_input, the video_metadata is constructed by hardcoding fps=2.0 and video_backend="pyav". However, the Glm4vVideoLoader samples the actual FPS and duration from the video file. This hardcoding seems inconsistent with the loader and might lead to incorrect metadata being passed if the actual video properties differ.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to parse VideoMetadata to Glm4vVideoPixelInputs? Seems it's not used by model forwarding.

"max_pixels": 1280 * 28 * 28,
"fps": 1,
},
limit_mm_per_prompt={"image": 1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The limit_mm_per_prompt is set to {"image": 1} here. However, the run_glm4_1v function is designed to handle both image and video modalities. If video input is used with this function, this limit might be incorrect or misleading. Should this limit be configurable based on the modality used, or should it allow for both image and video if the model supports both?

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
@zRzRzRzRzRzRzR
Copy link
Contributor Author

I noticed the part where the video is read in. The original video reading uses OpenCV but does not return the frame rate. Our model definitely needs FPS as a sampling parameter, so I separately implemented Glm4vVideoLoader. I don't know if this is a good approach, or if you will modify OpenCVVideoBackend in future updates to obtain other information needed in transformers, which is included in from transformers.video_utils import VideoMetadata, implemented in the latest original code of transformers.

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some initial comments, PTAL :)

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
@@ -427,6 +427,9 @@ def _parse_video_data(
if self._is_embeddings(data):
return VideoEmbeddingItems(data)

if isinstance(data, tuple) and len(data) == 2:
frames, metadata = data
data = frames
Copy link
Member

@DarkLight1337 DarkLight1337 Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata should be passed to VideoProcessorItems as well so it can be used by the HF processor (preferably via get_processor_data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update like this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but make sure the metadata is passed to HF processor correctly as well

Copy link
Contributor Author

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it is working, I checked that the HF Processor passed in is working, but there is an issue. This feature is only supported in the current version of the transformers source code, which means the next version of transformers will support it. I contacted their staff, and they said this feature is still being perfected. I know that this class may not be supported in some older models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glm4.1V will also be supported in the next version of the transformers implementation, and the transformers team is expected to update the code improvements regarding processors this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Perhaps we should add a transformers version check to get_processor_data to maintain back-compatibility then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I happened to encounter their update during implementation. The adaptation of the GLM4.1v model will directly adapt to the new transformers interface, so I am not familiar with what impact the previous versions of transformers, such as 4.52 and earlier, will have on video understanding models with this improvement.

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
@zRzRzRzRzRzRzR
Copy link
Contributor Author

zRzRzRzRzRzRzR commented Jun 9, 2025

I suggest merging this PR after the transformers team has confirmed their version merge, to ensure that the main branch of transformers is stable.(This Week)

@zRzRzRzRzRzRzR
Copy link
Contributor Author

In the latest version of transformers, images and videos are processed separately, with image processors and video processors respectively. Is there any related implementation for reference in vLLM?

In the current implementation, I've noticed that the video processor is never called. The GLM-4.1V model processes videos by converting them into multiple image frames after sampling. However, I've found that only one frame is passed to vLLM each time, while transformers work normally. Do you know in which files the video processing logic is located?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 23, 2025

We use the parent processor that is constructed from AutoProcessor in vLLM. So if only the image processor is being called, I suggest inspecting the inputs to AutoProcessor.__call__ to figure out why.

@zRzRzRzRzRzRzR
Copy link
Contributor Author

zRzRzRzRzRzRzR commented Jun 23, 2025

I observed that the HF video processor was called, but in the return from the HF processor:

return BatchFeature(data=data, tensor_type=return_tensors)
all = ["Glm4vVideoProcessor"]

both pixel_values_videos and video_grid_thw do not align with the kwargs expected in vLLM.
Below is the print output from transformers right before the return:

pixel_values_videos: tensor([[ 1.4632,  1.4632,  1.4632,  ...,  2.0890,  2.0890,  2.0890],
        [ 1.4778,  1.4778,  1.4778,  ...,  2.0606,  2.0606,  2.0606],
        [ 1.5070,  1.5070,  1.5070,  ...,  2.0890,  2.0890,  2.0890],
        ...,
        [ 0.6457,  0.6749,  0.6749,  ..., -0.6412, -0.6412, -0.6412],
        [ 0.6457,  0.6311,  0.6311,  ..., -0.7977, -0.7692, -0.7692],
        [ 0.7041,  0.7479,  0.7041,  ..., -0.6697, -0.6981, -0.7123]]) with shape(torch.Size([19136, 1176])
        
video_grid_thw:[[1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46], [1, 26, 46]]

But in vLLM, the output is printed as

tensor([[[ 1.4609,  1.4609,  1.4609,  ...,  2.0938,  2.0938,  2.0938],
         [ 1.4766,  1.4766,  1.4766,  ...,  2.0625,  2.0625,  2.0625],
         [ 1.5078,  1.5078,  1.5078,  ...,  2.0938,  2.0938,  2.0938],
         ...,
         [ 0.7930,  0.7930,  0.8359,  ...,  0.9648,  0.9648,  0.9531],
         [-0.1572,  0.0033,  0.2520,  ..., -0.4141, -0.3711, -0.4004],
         [ 0.3398,  0.3105,  0.2656,  ..., -0.5000, -0.5000, -0.3574]]], with shape(torch.Size([299, 1176])
       device='cuda:0', dtype=torch.bfloat16)
tensor([[[ 1, 26, 46]]], device='cuda:0')

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 23, 2025

Are the inputs to __call__ the same?

@zRzRzRzRzRzRzR
Copy link
Contributor Author

Same, I printed

class Glm4vMultiModalProcessor(BaseMultiModalProcessor[Glm4vProcessingInfo]):

    def _call_hf_processor(

The return result is the same as HF's.

@DarkLight1337
Copy link
Member

Can you try to disable the multimodal preprocessing cache and see if the results are still different? Perhaps HF does some padding based on the other inputs in the same batch

@zRzRzRzRzRzRzR
Copy link
Contributor Author

How to operate this. Also, currently batch is set to 1. GLM's video grid is different from Qwen, it is a list containing multiple image grids.
Additionally, I suspect that the issue is not in glm4_1.py, but in the functions it inherits that are not explicitly written in this file.

@DarkLight1337
Copy link
Member

You can set --disable-mm-preprocessor-cache

@zRzRzRzRzRzRzR
Copy link
Contributor Author

still diff

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
@zRzRzRzRzRzRzR
Copy link
Contributor Author

https:github.com/huggingface/transformers/pull/38431
This PR has been merged. Could you please help take a look at the preprocessing part? Only the video preprocessing is not aligned, as each time only one frame of video is passed in. I really couldn't find the issue.

@zRzRzRzRzRzRzR
Copy link
Contributor Author

I can provide the model, the _call_hf_processor in Glm4vMultiModalProcessor returns the correct sample.

Comment on lines +1048 to +1053
def _get_mm_fields_config(
self,
hf_inputs: BatchFeature,
hf_processor_mm_kwargs: Mapping[str, object],
) -> Mapping[str, MultiModalFieldConfig]:
return _qwen2vl_field_config(hf_inputs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that we're reusing qwen2_vl's mm_fields_config here, perhaps this is the cause of incorrect video_grid_thw shape?

def _qwen2vl_field_config(hf_inputs: Mapping[str, torch.Tensor]):
image_grid_thw = hf_inputs.get("image_grid_thw", torch.empty((0, 3)))
image_grid_sizes = image_grid_thw.prod(-1)
video_grid_thw = hf_inputs.get("video_grid_thw", torch.empty((0, 3)))
video_grid_sizes = video_grid_thw.prod(-1)
return dict(
pixel_values=MultiModalFieldConfig.flat_from_sizes(
"image", image_grid_sizes),
image_embeds=MultiModalFieldConfig.flat_from_sizes(
"image", image_grid_sizes),
image_grid_thw=MultiModalFieldConfig.batched("image"),
pixel_values_videos=MultiModalFieldConfig.flat_from_sizes(
"video", video_grid_sizes),
video_embeds=MultiModalFieldConfig.flat_from_sizes(
"video", video_grid_sizes),
video_grid_thw=MultiModalFieldConfig.batched("video"),
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked here and printed video_grid_thw, it is correct.

Copy link
Contributor Author

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the right shape of it

tensor([[ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46],
        [ 1, 26, 46]])

and I print it at here

def _qwen2vl_field_config(hf_inputs: Mapping[str, torch.Tensor]):
    image_grid_thw = hf_inputs.get("image_grid_thw", torch.empty((0, 3)))
    image_grid_sizes = image_grid_thw.prod(-1)

    video_grid_thw = hf_inputs.get("video_grid_thw", torch.empty((0, 3)))
    print(video_grid_thw)
    video_grid_sizes = video_grid_thw.prod(-1)

it is right

but when I print in kwargs at _parse_and_validate_video_input
it turn to

[[[ 1, 26, 46]]]

Copy link
Collaborator

@Isotr0py Isotr0py Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, hf_inputs should be fine here because it's just the output of _call_hf_processor exactly.

I suspect that mm_field_config (the returned dictionary) may be configured incorrectly here. Perhaps you can take a look at the MultiModalKwargs.from_hf_inputs in processor:

mm_kwargs = MultiModalKwargs.from_hf_inputs(
processed_data,
self._get_mm_fields_config(processed_data, hf_processor_mm_kwargs),
)

I guess video_grid_thw in mm_kwargs will have an incorrect shape starting from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm_kwargs = MultiModalKwargs.from_hf_inputs(
            processed_data,
            self._get_mm_fields_config(processed_data, hf_processor_mm_kwargs),
        )
print(mm_kwargs)

still right

@zRzRzRzRzRzRzR
Copy link
Contributor Author

I feel that the issue doesn't seem to be in the processing.py file; I have printed many places, and they all show complete grids.

@Isotr0py
Copy link
Collaborator

I see, so the multimodal processor implementation should be fine, and the issue could only be after processor calling. Is the input **kwargs of _parse_and_validate_video_input still correct up to this step?

@zRzRzRzRzRzRzR
Copy link
Contributor Author

zRzRzRzRzRzRzR commented Jun 26, 2025

I see, so the multimodal processor implementation should be fine, and the issue could only be after processor calling. Is the input **kwargs of _parse_and_validate_video_input still correct up to this step?

No, **kwargs of _parse_and_validate_video_input still wrong, but I found the len video is 1. Is this might cause issues ?

in transformers processing, I first used a special placeholder token for the video, and then split the video into individual frames. So each video corresponds to a list of images and a grid.

@Isotr0py
Copy link
Collaborator

Hmmm, the only remaining possible cause I can think now is v1's mm_encoder, which is unlikely to cause issue...

def _execute_mm_encoder(self, scheduler_output: "SchedulerOutput"):
scheduled_encoder_inputs = scheduler_output.scheduled_encoder_inputs
if not scheduled_encoder_inputs:
return
# Batch the multi-modal inputs.
mm_inputs = list[MultiModalKwargs]()
req_ids_pos = list[tuple[str, int, PlaceholderRange]]()
for req_id, encoder_input_ids in scheduled_encoder_inputs.items():
req_state = self.requests[req_id]
for mm_input_id in encoder_input_ids:
mm_inputs.append(req_state.mm_inputs[mm_input_id])
req_ids_pos.append(
(req_id, mm_input_id, req_state.mm_positions[mm_input_id]))
# Batch mm inputs as much as we can: if a request in the batch has
# multiple modalities or a different modality than the previous one,
# we process it separately to preserve item order.
# FIXME(ywang96): This is a hacky way to deal with multiple modalities
# in the same batch while still being able to benefit from batching
# multimodal inputs. The proper solution should be reordering the
# encoder outputs.
grouped_mm_inputs_list = group_mm_inputs_by_modality(mm_inputs)
encoder_outputs = []
for grouped_mm_inputs in grouped_mm_inputs_list:
batched_mm_inputs = MultiModalKwargs.batch(
grouped_mm_inputs, pin_memory=self.pin_memory)
batched_mm_inputs = MultiModalKwargs.as_kwargs(
batched_mm_inputs,
device=self.device,
)

BTW, have you checked if V0 also has this issue? If so, it's unlikely the issue from mm_encoder.

@zRzRzRzRzRzRzR
Copy link
Contributor Author

I printed

        for grouped_mm_inputs in grouped_mm_inputs_list:
            print(grouped_mm_inputs)

The grouped_mm_inputs here is already incorrect.

@zRzRzRzRzRzRzR
Copy link
Contributor Author

I don't seem to have found the issue yet. Can you provide the weights (slack) and help me troubleshoot this problem?

@Isotr0py
Copy link
Collaborator

I don't seem to have found the issue yet. Can you provide the weights (slack) and help me troubleshoot this problem?

Sure!

@Isotr0py
Copy link
Collaborator

During my debugging, the incorrect mm_kwargs is from here:

mm_kwargs = MultiModalKwargs.from_items([
item.value for cache_items in mm_cache_items_merged.values()
for item in cache_items
])

If I set disable_mm_preprocessor_cache=True, then video_grid_thw is something like:

tensor([[ 1, 28, 52],
        [ 1, 28, 52],
        ...
        [ 1, 28, 52]])
shape: 76 x 3

Then it raised error with mismatch num_video:

RuntimeError: Expected there to be 1 video items in keyword arguments corresponding to 1 video data items, but only found 76! There is likely a problem with your implementation of merged multi-modal processor for this model (usually arising from an inconsistency between `_call_hf_processor` and `_get_mm_fields_config`).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants