Skip to content

Added snapshot feature for vhost-device-vsock. #827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

WeiChungHsu
Copy link

Added snapshot feature for vhost-device-vsock.
There is a hang issue during snapshot disable vring, my solution is to check vring enable/disable status but it requires a new function get_enabled from vring.
The vring new function get_enabled PR is here: rust-vmm/vhost#284

More detailed explain the hang issue:
When the Vring is disabled (and queue set to not ready), the vhu_vsock_thread.rs still try to iterate and read RX queue, therefore, it hang at this line:
https://github.com/rust-vmm/vhost-device/blob/main/vhost-device-vsock/src/vhu_vsock_thread.rs#L591-L594

Furthermore, it hang because this Queue iter() operation return error at this line:
https://github.com/rust-vmm/vm-virtio/blob/main/virtio-queue/src/queue.rs#L592-L597

Requirements

Before submitting your PR, please make sure you addressed the following
requirements:

  • All commits in this PR have Signed-Off-By trailers (with
    git commit -s), and the commit message has max 60 characters for the
    summary and max 75 characters for each description line.
  • All added/changed functionality has a corresponding unit/integration
    test.
  • All added/changed public-facing functionality has entries in the "Upcoming
    Release" section of CHANGELOG.md (if no such section exists, please create one).
  • Any newly added unsafe code is properly documented.

@stefano-garzarella
Copy link
Member

@WeiChungHsu CI is failing, please fix it.

@stefano-garzarella
Copy link
Member

The vring new function get_enabled PR is here: rust-vmm/vhost#284

We can't merge code that depends on ongoing work, so I mark this as draft.
Please also temporary change the vhost crate dep to a git repo that contains the change, so the CI can test this code.
When the changes to vhost will be merged, we can remove that and mark this as ready.

@stefano-garzarella stefano-garzarella marked this pull request as draft March 18, 2025 09:54
@WeiChungHsu
Copy link
Author

@WeiChungHsu CI is failing, please fix it.

Sorry, I am fixing and working on it now.

pub(crate) struct VhostUserVsockBackend {
config: VirtioVsockConfig,
queue_size: usize,
pub threads: Vec<Mutex<VhostUserVsockThread>>,
queues_per_thread: Vec<u64>,
pub exit_event: EventFd,
pub handler_thread: Mutex<HandlerThread>,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this whole abstraction is needed, because if a thread is created, then Option<JoinHandle<()>> is not None. Plus, created field does not guard against handle being None because even if they are guarded under the same mutex, they are accessed separately (!). Change the handler_thread field to Mutex<Option<JoinHandle<()>>> and in check_device_state function you can lock the mutex and use Option::take to move the thread handle and join it if it's not None.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just fixed the codes based on the suggestion. Please let me know if it meets your expectation.

self.handler_thread.lock().unwrap().created = false;
self.handler_thread.lock().unwrap().handle.take().unwrap().join().unwrap();
}
return Ok(());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/return Ok(());/Ok(())

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@epilys epilys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the commit message is all on one line. Please put a short descriptive commit title and use the commit message body for longer text.

There is a hang issue during snapshot disable
vring, my solution is to check vring enable/disable
status but it requires a new function get_enabled
from vring.

Signed-off-by: Wei-Chung Hsu <[email protected]>
@WeiChungHsu
Copy link
Author

Also, the commit message is all on one line. Please put a short descriptive commit title and use the commit message body for longer text.

Fixed the commit message with sign-off. Thanks!

Fixed CI errors and reviewer's comments.

Signed-off-by: Wei-Chung Hsu <[email protected]>
@WeiChungHsu
Copy link
Author

WeiChungHsu commented Mar 25, 2025

I fixed the most of CI failures, but the this failure is based on another PR in vhost crate: "error[E0599]: no method named get_enabled found". Therefore, I have to wait that PR merged first.
The pending PR in vhost crate is: rust-vmm/vhost#291

@stefano-garzarella
Copy link
Member

I fixed the most of CI failures, but the this failure is based on another PR in vhost crate: "error[E0599]: no method named get_enabled found". Therefore, I have to wait that PR merged first. The pending PR in vhost crate is: rust-vmm/vhost#284

As I mentioned here #827 (comment) can you change Cargo.toml to point to your branch used for rust-vmm/vhost#284 ?

In this way we can test also this PR. When that PR will be merged and a new vhost-user-backend released, we can remove that change and mark this PR ready to be merged.

@stefano-garzarella
Copy link
Member

@WeiChungHsu also, instead of clicking on "Update branch", you can go to the arrow and do "Update with rebase" (maybe I should see if I can enable it by default)

WeiChungHsu and others added 2 commits April 15, 2025 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants