zvol: cleanup & fixup zvol destruction sequence and locking #17625

robn · 2025-08-12T06:44:26Z

[Sponsors: Klara, Inc., Railway Corporation]

Motivation and Context

We have a customer whose workload involves rapid creation and destruction of zvols. This could easily result in panics or deadlocks, which ultimately are caused by incorrect or inadequate locking through the zvol lifecycle.

Description

There are three main issues resolved here which all contribute to the overall problem.

OS-side object not destroyed before freeing the `zvol_state_t`

When destroying a zvol, it is not "unpublished" from the system (that is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the time del_gendisk() and put_disk() are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls to zvol_open() or zvol_release() while the zvol_state_t is partially or fully freed. zvol_open() has some protection against this by checking that private_data is NULL, but zvol_release() does not.

This implements a better ordering for all of this by adding a new OS-side method, zvol_os_remove_minor(), which is responsible for fully decoupling the "private" (OS-side) objects from the zvol_state_t. For Linux, that means calling put_disk(), nulling private_data, and freeing zv_zso.

This takes the place of zvol_os_clear_private(), which was a nod in that direction but did not do enough, and did not do it early enough.

`zvol_state_lock` used to protect OS-side private data

zvol_state_lock is intended to protect access to the global name->zvol lists (zvol_find_by_name()), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc).

This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on zvol_state_t, so zv_state_lock and zv_suspend_lock can't be used to manage access, but also, with the private object and the zvol_state_t being shutdown and destroyed at the same time in zvol_os_free(), we must ensure that the private object pointer only ever corresponds to a real zvol_state_t, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this.

The problem with this is that zvol_state_lock does not actually protect access to the zvol_state_t internals, so we need to take zv_state_lock and/or zv_suspend_lock. If those are contended, this can then cause OS-side operations (eg zvol_open()) to sleep to wait for them while hold zvol_state_lock. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, which zvol_remove_minors_impl() will do under high load.

However, since the previous commit, we have a guarantee that the private data pointer will always be NULL'd out in zvol_os_remove_minor() before the zvol_state_t is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions of zvol_state_lock from the OS side.

Use of async fallbacks in response to failed lock acquisitions

Since both ZFS- and OS-sides of a zvol now take care of their own locking and don't get in each other's way, there's no need for the very complicated removal code to fall back to async tasks if the locks needed at each stage can't be obtained right now.

Here we change it to be a linear three-step process: select zvols of interest and flag them for removal, then wait for them to shed activity and then remove them, and finally, free them.

How Has This Been Tested?

Imagine I wrote this a week ago. I would have said:

I include a test that aggressively creates and destroys a few hundred zvols in parallel.

It’s been tested against Linux 4.19, 5.10, 6.1, 6.6, 6.9 and 6.12, and FreeBSD 14.3.

On Linux 6.0+, it will break within a few seconds, either a GPF dereferencing the pointer it got from disk->private_data, or by hanging on acquiring locks for first_open/last_close from calls back into the block devices for partition probes or udev activity.

I was in the middle of preparing this patch set (as evidenced by the existence of #17596) when 0b6fd02 (via #17575) landed. And since then, I cannot make this break the same way.

As best I can tell, its simply that creating zvols now fails faster, and we detect errors rather than pushing on, so there’s vastly fewer async operations in flight, and so the remove process is way less likely to need to bounce work out to an async taskq.

Whether that means that we’re now managing ordering much better, not sure. But still, I feel pretty confident that these changes are still good: the remove process was too complicated (#16364) and the OS-side really has no business taking zvol_state_lock, which is only supposed to protect the global name lookup lists. And just generally, the code is a lot easier to follow.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS [code style requirements](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#coding-conventions).
I have updated the documentation accordingly.
I have read the contributing document.
I have added [tests](https://github.com/openzfs/zfs/tree/master/tests) to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain [Signed-off-by](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#signed-off-by).

robn · 2025-08-12T11:57:40Z

As you see from the Ubuntu 24 failure, even when you're trying really harder it's really easy to reenter something with the wrong locks held.

I hadn't seen the kernel call back into the blockdev to get the event mask during shutdown. The fix will be to drop zl_state_lock eariler. I'll do that tomorrow; too late in the day now.

robn · 2025-08-13T21:45:31Z

First failure was me dropping a lock too late, fixed yesterday.

Second failure seems unrelated; it occurred in an unrelated part of the code, some 40 minutes after the last test involving zvols.

However, if its not this PR at fault then I find this stack extremely troubling:

[ 3892.893341] ZTS run /usr/share/zfs/zfs-tests/tests/functional/cp_files/cp_stress
[ 3929.314940] VERIFY3B(node->next == ((void *) 0x100 + (0xdead000000000000UL)), ==, node->prev == ((void *) 0x122 + (0xdead000000000000UL))) failed (0 == 1)
[ 3929.315128] PANIC at list.h:188:list_link_active()
[ 3929.315327] Showing stack for process 532055
[ 3929.316203] CPU: 0 UID: 0 PID: 532055 Comm: seekflood Tainted: P           OE     -------  ---  6.12.0-55.24.1.el10_0.x86_64 #1
[ 3929.316400] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 3929.316533] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 3929.317023] Call Trace:
[ 3929.317147]  <TASK>
[ 3929.317275]  dump_stack_lvl+0x4e/0x70
[ 3929.317837]  spl_panic+0xf4/0x10b [spl]
[ 3929.319757]  ? dnode_hold_impl+0x8eb/0x1080 [zfs]
[ 3929.332342]  list_link_active+0x69/0x70 [zfs]
[ 3929.332698]  dnode_is_dirty+0x62/0x190 [zfs]
[ 3929.333018]  dmu_offset_next+0xc5/0x260 [zfs]
[ 3929.333327]  zfs_holey_common+0xa1/0x190 [zfs]
[ 3929.333644]  zfs_holey+0x51/0x80 [zfs]
[ 3929.333922]  zpl_llseek+0x89/0xd0 [zfs]
[ 3929.334191]  ksys_lseek+0x61/0xb0
[ 3929.334888]  do_syscall_64+0x7d/0x160

Rebased and pushed again, we'll see what shakes out this time.

behlendorf · 2025-08-13T22:06:18Z

@fuporovvStack since you've been recently working in this area could you also take a look at this PR.

@robn I don't recall seeing that failure recently. I don't see how this PR could have caused it, and that's concerning.

amotin · 2025-08-14T15:53:26Z

I haven't looked close into this area for a while, and it may not necessary be required for this PR, but since you are refactoring it, I would like to sound my long time pain: FreeBSD unlike Linux does allow forced destruction of zvols, both in GEOM and DEV modes. Instead of setting zso_dying and then waiting for last close, that may never happen, on FreeBSD ZVOLs should first request provider/dev destruction and then wait for the close, which should happen as soon as last active request completes.

behlendorf

LGTM. It's nice to see the refactoring here to simplify this.

robn · 2025-08-14T21:29:41Z

I don't recall seeing that failure recently. I don't see how this PR could have caused it, and that's concerning.

@behlendorf yeah, I still think its unrelated. I'm going to try putting seekflood on a long run on my test rig over the weekend and see if I can smoke something out. There will be a separate issue or PR if I do.

robn · 2025-08-14T21:33:09Z

@amotin this is great intel, thank you. I have another round of zvol rework coming sometime soon and I think it slots nicely in there. (Also I should just generally learn more about geom, because the couple of times I've been near it I've thought that it seemed quite sensible really).

fuporovvStack

Ok, you made minors removal synchronous, but other minors operations are worked thru system_taskq and spa_zvol_taskq.
To be honest, I cannot find for myself, which way is preferred, but I think it will be better to keep it somehow consistent, mean make all minors operations synchronous or leave it asynchronous. Or, at least if removing operations are synchronous, revert creation operations to synchronous mode too.

tests/zfs-tests/tests/functional/zvol/zvol_stress/zvol_stress_destroy.ksh

fuporovvStack · 2025-08-15T08:15:45Z

module/zfs/zvol.c

+
+		/* Remove it from the name lookup lists */
+		rw_enter(&zvol_state_lock, RW_WRITER);
+		zvol_remove(zv);


The zvol_remove() is last step of remove minors. Possible we can do it near zv->zv_flags |= ZVOL_REMOVING; updating little bit earler.

I've made it the last thing because its the only thing that keeps the volume name "in use" on zvol_state_list; without it, it's (theoretically) possible for a dataset create to try to create a new zvol state with the name while the old one is being torn down.

I think we can sidestep all this in the future by doing what the ZPL does and storing the zvol_state_t as the objset user data (dmu_objset_set_user()) and then just let the DSL manage the lifetime of the name, but its more change than I want for this PR.

module/os/freebsd/zfs/zvol_os.c

Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>

zvol_remove_minor_impl() and zvol_remove_minors_impl() should be identical except for how they select zvols to remove, so lets just use the same function with a flag to indicate if we should include children and snapshots or not. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>

When destroying a zvol, it is not "unpublished" from the system (that is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the time del_gendisk() and put_disk() are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls to zvol_open() or zvol_release() while the zvol_state_t is partially or fully freed. zvol_open() has some protection against this by checking that private_data is NULL, but zvol_release does not. This implements a better ordering for all of this by adding a new OS-side method, zvol_os_remove_minor(), which is responsible for fully decoupling the "private" (OS-side) objects from the zvol_state_t. For Linux, that means calling put_disk(), nulling private_data, and freeing zv_zso. This takes the place of zvol_os_clear_private(), which was a nod in that direction but did not do enough, and did not do it early enough. Equivalent changes are made on the FreeBSD side to follow the API change. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>

zvol_state_lock is intended to protect access to the global name->zvol lists (zvol_find_by_name()), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc). This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on zvol_state_t, so zv_state_lock and zv_suspend_lock can't be used to manage access, but also, with the private object and the zvol_state_t being shutdown and destroyed at the same time in zvol_os_free(), we must ensure that the private object pointer only ever corresponds to a real zvol_state_t, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this. The problem with this is that zvol_state_lock does not actually protect access to the zvol_state_t internals, so we need to take zv_state_lock and/or zv_suspend_lock. If those are contended, this can then cause OS-side operations (eg zvol_open()) to sleep to wait for them while hold zvol_state_lock. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, which zvol_remove_minors_impl() will do under high load. However, since the previous commit, we have a guarantee that the private data pointer will always be NULL'd out in zvol_os_remove_minor() _before_ the zvol_state_t is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions of zvol_state_lock from the OS side. While here, I've rewritten much of the locking theory comment at the top of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've tried to describe the purpose of each lock in a little more detail, and in particular describe where it should and shouldn't be used. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>

Since both ZFS- and OS-sides of a zvol now take care of their own locking and don't get in each other's way, there's no need for the very complicated removal code to fall back to async tasks if the locks needed at each stage can't be obtained right now. Here we change it to be a linear three-step process: select zvols of interest and flag them for removal, then wait for them to shed activity and then remove them, and finally, free them. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>

robn · 2025-08-17T00:26:44Z

Ok, you made minors removal synchronous, but other minors operations are worked thru system_taskq and spa_zvol_taskq.

At this level I'm more in favour of sync than async; upper layers can put it on a taskq or whatever if they want to background something. This allows each upper subsystem to manage its lifetime properly.

That isn't currently the case, which is why eg spa_export_common() has to special-case queueing up removing everything and waiting on spa_zvol_taskq, or even worse, zvol_fini_impl() doing it, because removal release the dataset ownership hold, closes the ZIL, etc. It's a ball of spaghetti that really gets confused with a lot of high-frequency transitions.

But even if not, the "mixed" async that was in the remove path where it would punt things off to a task if it couldn't get a lock just led to a chaos of inversions and deadlocks. So regardless of where any async is done, each specific operation should remain a single unit.

As best I can tell, "be like ZPL" is gonna take care of 99% of all these zvol quirks, so that seems like an uncontroversial path forward.

behlendorf · 2025-08-19T17:13:35Z

As best I can tell, "be like ZPL" is gonna take care of 99% of all these zvol quirks

Yup, moving in this direction to align the zvol layer more with the zpl is would be a nice way to go.

zvol_remove_minor_impl() and zvol_remove_minors_impl() should be identical except for how they select zvols to remove, so lets just use the same function with a flag to indicate if we should include children and snapshots or not. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Fedor Uporov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17625

When destroying a zvol, it is not "unpublished" from the system (that is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the time del_gendisk() and put_disk() are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls to zvol_open() or zvol_release() while the zvol_state_t is partially or fully freed. zvol_open() has some protection against this by checking that private_data is NULL, but zvol_release does not. This implements a better ordering for all of this by adding a new OS-side method, zvol_os_remove_minor(), which is responsible for fully decoupling the "private" (OS-side) objects from the zvol_state_t. For Linux, that means calling put_disk(), nulling private_data, and freeing zv_zso. This takes the place of zvol_os_clear_private(), which was a nod in that direction but did not do enough, and did not do it early enough. Equivalent changes are made on the FreeBSD side to follow the API change. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Fedor Uporov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17625

zvol_state_lock is intended to protect access to the global name->zvol lists (zvol_find_by_name()), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc). This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on zvol_state_t, so zv_state_lock and zv_suspend_lock can't be used to manage access, but also, with the private object and the zvol_state_t being shutdown and destroyed at the same time in zvol_os_free(), we must ensure that the private object pointer only ever corresponds to a real zvol_state_t, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this. The problem with this is that zvol_state_lock does not actually protect access to the zvol_state_t internals, so we need to take zv_state_lock and/or zv_suspend_lock. If those are contended, this can then cause OS-side operations (eg zvol_open()) to sleep to wait for them while hold zvol_state_lock. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, which zvol_remove_minors_impl() will do under high load. However, since the previous commit, we have a guarantee that the private data pointer will always be NULL'd out in zvol_os_remove_minor() _before_ the zvol_state_t is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions of zvol_state_lock from the OS side. While here, I've rewritten much of the locking theory comment at the top of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've tried to describe the purpose of each lock in a little more detail, and in particular describe where it should and shouldn't be used. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Fedor Uporov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17625

Since both ZFS- and OS-sides of a zvol now take care of their own locking and don't get in each other's way, there's no need for the very complicated removal code to fall back to async tasks if the locks needed at each stage can't be obtained right now. Here we change it to be a linear three-step process: select zvols of interest and flag them for removal, then wait for them to shed activity and then remove them, and finally, free them. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Fedor Uporov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17625

behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 12, 2025

robn force-pushed the zvol-removal-locking branch 2 times, most recently from 6ca5343 to 20a17e2 Compare August 13, 2025 21:37

behlendorf self-requested a review August 13, 2025 22:01

behlendorf approved these changes Aug 14, 2025

View reviewed changes

fuporovvStack approved these changes Aug 15, 2025

View reviewed changes

robn added 5 commits August 17, 2025 10:14

ZTS: stress test concurrent zvol create/destroy

0bfa1e3

Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>

robn force-pushed the zvol-removal-locking branch from 20a17e2 to 3a41b4d Compare August 17, 2025 00:27

This was referenced Aug 18, 2025

list_link_active panic in dnode_is_dirty #17652

Open

zfs-2.3.4 patchset #17656

Open

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 19, 2025

behlendorf approved these changes Aug 19, 2025

View reviewed changes

behlendorf closed this in 6bb8fe5 Aug 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zvol: cleanup & fixup zvol destruction sequence and locking #17625

zvol: cleanup & fixup zvol destruction sequence and locking #17625

robn commented Aug 12, 2025 •

edited

Loading

Uh oh!

robn commented Aug 12, 2025

Uh oh!

robn commented Aug 13, 2025

Uh oh!

behlendorf commented Aug 13, 2025

Uh oh!

amotin commented Aug 14, 2025

Uh oh!

behlendorf left a comment •

edited

Loading

Uh oh!

robn commented Aug 14, 2025

Uh oh!

robn commented Aug 14, 2025

Uh oh!

fuporovvStack left a comment

Uh oh!

Uh oh!

fuporovvStack Aug 15, 2025

Uh oh!

robn Aug 17, 2025

Uh oh!

Uh oh!

robn commented Aug 17, 2025

Uh oh!

behlendorf commented Aug 19, 2025

Uh oh!

Uh oh!

zvol: cleanup & fixup zvol destruction sequence and locking #17625

zvol: cleanup & fixup zvol destruction sequence and locking #17625

Conversation

robn commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

OS-side object not destroyed before freeing the zvol_state_t

zvol_state_lock used to protect OS-side private data

Use of async fallbacks in response to failed lock acquisitions

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

robn commented Aug 12, 2025

Uh oh!

robn commented Aug 13, 2025

Uh oh!

behlendorf commented Aug 13, 2025

Uh oh!

amotin commented Aug 14, 2025

Uh oh!

behlendorf left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robn commented Aug 14, 2025

Uh oh!

robn commented Aug 14, 2025

Uh oh!

fuporovvStack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fuporovvStack Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

robn Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robn commented Aug 17, 2025

Uh oh!

behlendorf commented Aug 19, 2025

Uh oh!

Uh oh!

robn commented Aug 12, 2025 •

edited

Loading

OS-side object not destroyed before freeing the `zvol_state_t`

`zvol_state_lock` used to protect OS-side private data

behlendorf left a comment •

edited

Loading