Skip to content

zvol: cleanup & fixup zvol destruction sequence and locking #17625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Aug 12, 2025

[Sponsors: Klara, Inc., Railway Corporation]

Motivation and Context

We have a customer whose workload involves rapid creation and destruction of zvols. This could easily result in panics or deadlocks, which ultimately are caused by incorrect or inadequate locking through the zvol lifecycle.

Description

There are three main issues resolved here which all contribute to the overall problem.

OS-side object not destroyed before freeing the zvol_state_t

When destroying a zvol, it is not "unpublished" from the system (that is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the time del_gendisk() and put_disk() are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls to zvol_open() or zvol_release() while the zvol_state_t is partially or fully freed. zvol_open() has some protection against this by checking that private_data is NULL, but zvol_release() does not.

This implements a better ordering for all of this by adding a new OS-side method, zvol_os_remove_minor(), which is responsible for fully decoupling the "private" (OS-side) objects from the zvol_state_t. For Linux, that means calling put_disk(), nulling private_data, and freeing zv_zso.

This takes the place of zvol_os_clear_private(), which was a nod in that direction but did not do enough, and did not do it early enough.

zvol_state_lock used to protect OS-side private data

zvol_state_lock is intended to protect access to the global name->zvol lists (zvol_find_by_name()), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc).

This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on zvol_state_t, so zv_state_lock and zv_suspend_lock can't be used to manage access, but also, with the private object and the zvol_state_t being shutdown and destroyed at the same time in zvol_os_free(), we must ensure that the private object pointer only ever corresponds to a real zvol_state_t, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this.

The problem with this is that zvol_state_lock does not actually protect access to the zvol_state_t internals, so we need to take zv_state_lock and/or zv_suspend_lock. If those are contended, this can then cause OS-side operations (eg zvol_open()) to sleep to wait for them while hold zvol_state_lock. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, which zvol_remove_minors_impl() will do under high load.

However, since the previous commit, we have a guarantee that the private data pointer will always be NULL'd out in zvol_os_remove_minor() before the zvol_state_t is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions of zvol_state_lock from the OS side.

Use of async fallbacks in response to failed lock acquisitions

Since both ZFS- and OS-sides of a zvol now take care of their own locking and don't get in each other's way, there's no need for the very complicated removal code to fall back to async tasks if the locks needed at each stage can't be obtained right now.

Here we change it to be a linear three-step process: select zvols of interest and flag them for removal, then wait for them to shed activity and then remove them, and finally, free them.

How Has This Been Tested?

Imagine I wrote this a week ago. I would have said:

I include a test that aggressively creates and destroys a few hundred zvols in parallel.

It’s been tested against Linux 4.19, 5.10, 6.1, 6.6, 6.9 and 6.12, and FreeBSD 14.3.

On Linux 6.0+, it will break within a few seconds, either a GPF dereferencing the pointer it got from disk->private_data, or by hanging on acquiring locks for first_open/last_close from calls back into the block devices for partition probes or udev activity.

I was in the middle of preparing this patch set (as evidenced by the existence of #17596) when 0b6fd02 (via #17575) landed. And since then, I cannot make this break the same way.

As best I can tell, its simply that creating zvols now fails faster, and we detect errors rather than pushing on, so there’s vastly fewer async operations in flight, and so the remove process is way less likely to need to bounce work out to an async taskq.

Whether that means that we’re now managing ordering much better, not sure. But still, I feel pretty confident that these changes are still good: the remove process was too complicated (#16364) and the OS-side really has no business taking zvol_state_lock, which is only supposed to protect the global name lookup lists. And just generally, the code is a lot easier to follow.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn
Copy link
Member Author

robn commented Aug 12, 2025

As you see from the Ubuntu 24 failure, even when you're trying really harder it's really easy to reenter something with the wrong locks held.

I hadn't seen the kernel call back into the blockdev to get the event mask during shutdown. The fix will be to drop zl_state_lock eariler. I'll do that tomorrow; too late in the day now.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 12, 2025
@robn robn force-pushed the zvol-removal-locking branch 2 times, most recently from 6ca5343 to 20a17e2 Compare August 13, 2025 21:37
@robn
Copy link
Member Author

robn commented Aug 13, 2025

First failure was me dropping a lock too late, fixed yesterday.

Second failure seems unrelated; it occurred in an unrelated part of the code, some 40 minutes after the last test involving zvols.

However, if its not this PR at fault then I find this stack extremely troubling:

[ 3892.893341] ZTS run /usr/share/zfs/zfs-tests/tests/functional/cp_files/cp_stress
[ 3929.314940] VERIFY3B(node->next == ((void *) 0x100 + (0xdead000000000000UL)), ==, node->prev == ((void *) 0x122 + (0xdead000000000000UL))) failed (0 == 1)
[ 3929.315128] PANIC at list.h:188:list_link_active()
[ 3929.315327] Showing stack for process 532055
[ 3929.316203] CPU: 0 UID: 0 PID: 532055 Comm: seekflood Tainted: P           OE     -------  ---  6.12.0-55.24.1.el10_0.x86_64 #1
[ 3929.316400] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 3929.316533] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 3929.317023] Call Trace:
[ 3929.317147]  <TASK>
[ 3929.317275]  dump_stack_lvl+0x4e/0x70
[ 3929.317837]  spl_panic+0xf4/0x10b [spl]
[ 3929.319757]  ? dnode_hold_impl+0x8eb/0x1080 [zfs]
[ 3929.332342]  list_link_active+0x69/0x70 [zfs]
[ 3929.332698]  dnode_is_dirty+0x62/0x190 [zfs]
[ 3929.333018]  dmu_offset_next+0xc5/0x260 [zfs]
[ 3929.333327]  zfs_holey_common+0xa1/0x190 [zfs]
[ 3929.333644]  zfs_holey+0x51/0x80 [zfs]
[ 3929.333922]  zpl_llseek+0x89/0xd0 [zfs]
[ 3929.334191]  ksys_lseek+0x61/0xb0
[ 3929.334888]  do_syscall_64+0x7d/0x160

Rebased and pushed again, we'll see what shakes out this time.

@behlendorf behlendorf self-requested a review August 13, 2025 22:01
@behlendorf
Copy link
Contributor

@fuporovvStack since you've been recently working in this area could you also take a look at this PR.

@robn I don't recall seeing that failure recently. I don't see how this PR could have caused it, and that's concerning.

@amotin
Copy link
Member

amotin commented Aug 14, 2025

I haven't looked close into this area for a while, and it may not necessary be required for this PR, but since you are refactoring it, I would like to sound my long time pain: FreeBSD unlike Linux does allow forced destruction of zvols, both in GEOM and DEV modes. Instead of setting zso_dying and then waiting for last close, that may never happen, on FreeBSD ZVOLs should first request provider/dev destruction and then wait for the close, which should happen as soon as last active request completes.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It's nice to see the refactoring here to simplify this.

@robn
Copy link
Member Author

robn commented Aug 14, 2025

I don't recall seeing that failure recently. I don't see how this PR could have caused it, and that's concerning.

@behlendorf yeah, I still think its unrelated. I'm going to try putting seekflood on a long run on my test rig over the weekend and see if I can smoke something out. There will be a separate issue or PR if I do.

@robn
Copy link
Member Author

robn commented Aug 14, 2025

@amotin this is great intel, thank you. I have another round of zvol rework coming sometime soon and I think it slots nicely in there. (Also I should just generally learn more about geom, because the couple of times I've been near it I've thought that it seemed quite sensible really).

Copy link
Contributor

@fuporovvStack fuporovvStack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, you made minors removal synchronous, but other minors operations are worked thru system_taskq and spa_zvol_taskq.
To be honest, I cannot find for myself, which way is preferred, but I think it will be better to keep it somehow consistent, mean make all minors operations synchronous or leave it asynchronous. Or, at least if removing operations are synchronous, revert creation operations to synchronous mode too.


/* Remove it from the name lookup lists */
rw_enter(&zvol_state_lock, RW_WRITER);
zvol_remove(zv);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zvol_remove() is last step of remove minors. Possible we can do it near zv->zv_flags |= ZVOL_REMOVING; updating little bit earler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made it the last thing because its the only thing that keeps the volume name "in use" on zvol_state_list; without it, it's (theoretically) possible for a dataset create to try to create a new zvol state with the name while the old one is being torn down.

I think we can sidestep all this in the future by doing what the ZPL does and storing the zvol_state_t as the objset user data (dmu_objset_set_user()) and then just let the DSL manage the lifetime of the name, but its more change than I want for this PR.

robn added 5 commits August 17, 2025 10:14
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Signed-off-by: Rob Norris <[email protected]>
zvol_remove_minor_impl() and zvol_remove_minors_impl() should be
identical except for how they select zvols to remove, so lets just use
the same function with a flag to indicate if we should include children
and snapshots or not.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Signed-off-by: Rob Norris <[email protected]>
When destroying a zvol, it is not "unpublished" from the system (that
is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the
time del_gendisk() and put_disk() are called, the device node may still
be have an active hold, from a userspace program or something inside the
kernel (a partition probe). As it is currently, this can lead to calls
to zvol_open() or zvol_release() while the zvol_state_t is partially or
fully freed. zvol_open() has some protection against this by checking
that private_data is NULL, but zvol_release does not.

This implements a better ordering for all of this by adding a new
OS-side method, zvol_os_remove_minor(), which is responsible for fully
decoupling the "private" (OS-side) objects from the zvol_state_t. For
Linux, that means calling put_disk(), nulling private_data, and freeing
zv_zso.

This takes the place of zvol_os_clear_private(), which was a nod in that
direction but did not do enough, and did not do it early enough.

Equivalent changes are made on the FreeBSD side to follow the API
change.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Signed-off-by: Rob Norris <[email protected]>
zvol_state_lock is intended to protect access to the global name->zvol
lists (zvol_find_by_name()), but has also been used to control access to
OS-side private data, accessed through whatever kernel object is used to
represent the volume (gendisk, geom, etc).

This appears to have been necessary to some degree because the OS-side
object is what's used to get a handle on zvol_state_t, so zv_state_lock
and zv_suspend_lock can't be used to manage access, but also, with the
private object and the zvol_state_t being shutdown and destroyed at the
same time in zvol_os_free(), we must ensure that the private object
pointer only ever corresponds to a real zvol_state_t, not one in partial
destruction. Taking the global lock seems like a convenient way to
ensure this.

The problem with this is that zvol_state_lock does not actually protect
access to the zvol_state_t internals, so we need to take zv_state_lock
and/or zv_suspend_lock. If those are contended, this can then cause
OS-side operations (eg zvol_open()) to sleep to wait for them while hold
zvol_state_lock. This then blocks out all other OS-side operations which
want to get the private data, and any ZFS-side control operations that
would take the write half of the lock. It's even worse if ZFS-side
operations induce OS-side calls back into the zvol (eg creating a zvol
triggers a partition probe inside the kernel, and also a userspace
access from udev to set up device links). And it gets even works again
if anything decides to defer those ops to a task and wait on them, which
zvol_remove_minors_impl() will do under high load.

However, since the previous commit, we have a guarantee that the private
data pointer will always be NULL'd out in zvol_os_remove_minor()
_before_ the zvol_state_t is made invalid, but it won't happen until all
users are ejected. So, if we make access to the private object pointer
atomic, we remove the need to take a global lockout to access it, and so
we can remove all acquisitions of zvol_state_lock from the OS side.

While here, I've rewritten much of the locking theory comment at the top
of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've
tried to describe the purpose of each lock in a little more detail, and
in particular describe where it should and shouldn't be used.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Signed-off-by: Rob Norris <[email protected]>
Since both ZFS- and OS-sides of a zvol now take care of their own
locking and don't get in each other's way, there's no need for the very
complicated removal code to fall back to async tasks if the locks needed
at each stage can't be obtained right now.

Here we change it to be a linear three-step process: select zvols of
interest and flag them for removal, then wait for them to shed activity
and then remove them, and finally, free them.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Signed-off-by: Rob Norris <[email protected]>
@robn
Copy link
Member Author

robn commented Aug 17, 2025

Ok, you made minors removal synchronous, but other minors operations are worked thru system_taskq and spa_zvol_taskq.

At this level I'm more in favour of sync than async; upper layers can put it on a taskq or whatever if they want to background something. This allows each upper subsystem to manage its lifetime properly.

That isn't currently the case, which is why eg spa_export_common() has to special-case queueing up removing everything and waiting on spa_zvol_taskq, or even worse, zvol_fini_impl() doing it, because removal release the dataset ownership hold, closes the ZIL, etc. It's a ball of spaghetti that really gets confused with a lot of high-frequency transitions.

But even if not, the "mixed" async that was in the remove path where it would punt things off to a task if it couldn't get a lock just led to a chaos of inversions and deadlocks. So regardless of where any async is done, each specific operation should remain a single unit.

As best I can tell, "be like ZPL" is gonna take care of 99% of all these zvol quirks, so that seems like an uncontroversial path forward.

@robn robn force-pushed the zvol-removal-locking branch from 20a17e2 to 3a41b4d Compare August 17, 2025 00:27
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 19, 2025
@behlendorf
Copy link
Contributor

As best I can tell, "be like ZPL" is gonna take care of 99% of all these zvol quirks

Yup, moving in this direction to align the zvol layer more with the zpl is would be a nice way to go.

behlendorf pushed a commit that referenced this pull request Aug 19, 2025
zvol_remove_minor_impl() and zvol_remove_minors_impl() should be
identical except for how they select zvols to remove, so lets just use
the same function with a flag to indicate if we should include children
and snapshots or not.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Fedor Uporov <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17625
behlendorf pushed a commit that referenced this pull request Aug 19, 2025
When destroying a zvol, it is not "unpublished" from the system (that
is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the
time del_gendisk() and put_disk() are called, the device node may still
be have an active hold, from a userspace program or something inside the
kernel (a partition probe). As it is currently, this can lead to calls
to zvol_open() or zvol_release() while the zvol_state_t is partially or
fully freed. zvol_open() has some protection against this by checking
that private_data is NULL, but zvol_release does not.

This implements a better ordering for all of this by adding a new
OS-side method, zvol_os_remove_minor(), which is responsible for fully
decoupling the "private" (OS-side) objects from the zvol_state_t. For
Linux, that means calling put_disk(), nulling private_data, and freeing
zv_zso.

This takes the place of zvol_os_clear_private(), which was a nod in that
direction but did not do enough, and did not do it early enough.

Equivalent changes are made on the FreeBSD side to follow the API
change.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Fedor Uporov <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17625
behlendorf pushed a commit that referenced this pull request Aug 19, 2025
zvol_state_lock is intended to protect access to the global name->zvol
lists (zvol_find_by_name()), but has also been used to control access to
OS-side private data, accessed through whatever kernel object is used to
represent the volume (gendisk, geom, etc).

This appears to have been necessary to some degree because the OS-side
object is what's used to get a handle on zvol_state_t, so zv_state_lock
and zv_suspend_lock can't be used to manage access, but also, with the
private object and the zvol_state_t being shutdown and destroyed at the
same time in zvol_os_free(), we must ensure that the private object
pointer only ever corresponds to a real zvol_state_t, not one in partial
destruction. Taking the global lock seems like a convenient way to
ensure this.

The problem with this is that zvol_state_lock does not actually protect
access to the zvol_state_t internals, so we need to take zv_state_lock
and/or zv_suspend_lock. If those are contended, this can then cause
OS-side operations (eg zvol_open()) to sleep to wait for them while hold
zvol_state_lock. This then blocks out all other OS-side operations which
want to get the private data, and any ZFS-side control operations that
would take the write half of the lock. It's even worse if ZFS-side
operations induce OS-side calls back into the zvol (eg creating a zvol
triggers a partition probe inside the kernel, and also a userspace
access from udev to set up device links). And it gets even works again
if anything decides to defer those ops to a task and wait on them, which
zvol_remove_minors_impl() will do under high load.

However, since the previous commit, we have a guarantee that the private
data pointer will always be NULL'd out in zvol_os_remove_minor()
_before_ the zvol_state_t is made invalid, but it won't happen until all
users are ejected. So, if we make access to the private object pointer
atomic, we remove the need to take a global lockout to access it, and so
we can remove all acquisitions of zvol_state_lock from the OS side.

While here, I've rewritten much of the locking theory comment at the top
of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've
tried to describe the purpose of each lock in a little more detail, and
in particular describe where it should and shouldn't be used.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Fedor Uporov <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17625
behlendorf pushed a commit that referenced this pull request Aug 19, 2025
Since both ZFS- and OS-sides of a zvol now take care of their own
locking and don't get in each other's way, there's no need for the very
complicated removal code to fall back to async tasks if the locks needed
at each stage can't be obtained right now.

Here we change it to be a linear three-step process: select zvols of
interest and flag them for removal, then wait for them to shed activity
and then remove them, and finally, free them.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Fedor Uporov <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17625
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants