-
Notifications
You must be signed in to change notification settings - Fork 1.9k
zvol: cleanup & fixup zvol destruction sequence and locking #17625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As you see from the Ubuntu 24 failure, even when you're trying really harder it's really easy to reenter something with the wrong locks held. I hadn't seen the kernel call back into the blockdev to get the event mask during shutdown. The fix will be to drop |
6ca5343
to
20a17e2
Compare
First failure was me dropping a lock too late, fixed yesterday. Second failure seems unrelated; it occurred in an unrelated part of the code, some 40 minutes after the last test involving zvols. However, if its not this PR at fault then I find this stack extremely troubling:
Rebased and pushed again, we'll see what shakes out this time. |
@fuporovvStack since you've been recently working in this area could you also take a look at this PR. @robn I don't recall seeing that failure recently. I don't see how this PR could have caused it, and that's concerning. |
I haven't looked close into this area for a while, and it may not necessary be required for this PR, but since you are refactoring it, I would like to sound my long time pain: FreeBSD unlike Linux does allow forced destruction of zvols, both in GEOM and DEV modes. Instead of setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It's nice to see the refactoring here to simplify this.
@behlendorf yeah, I still think its unrelated. I'm going to try putting seekflood on a long run on my test rig over the weekend and see if I can smoke something out. There will be a separate issue or PR if I do. |
@amotin this is great intel, thank you. I have another round of zvol rework coming sometime soon and I think it slots nicely in there. (Also I should just generally learn more about geom, because the couple of times I've been near it I've thought that it seemed quite sensible really). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, you made minors removal synchronous, but other minors operations are worked thru system_taskq and spa_zvol_taskq.
To be honest, I cannot find for myself, which way is preferred, but I think it will be better to keep it somehow consistent, mean make all minors operations synchronous or leave it asynchronous. Or, at least if removing operations are synchronous, revert creation operations to synchronous mode too.
tests/zfs-tests/tests/functional/zvol/zvol_stress/zvol_stress_destroy.ksh
Show resolved
Hide resolved
|
||
/* Remove it from the name lookup lists */ | ||
rw_enter(&zvol_state_lock, RW_WRITER); | ||
zvol_remove(zv); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zvol_remove() is last step of remove minors. Possible we can do it near zv->zv_flags |= ZVOL_REMOVING; updating little bit earler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made it the last thing because its the only thing that keeps the volume name "in use" on zvol_state_list
; without it, it's (theoretically) possible for a dataset create to try to create a new zvol state with the name while the old one is being torn down.
I think we can sidestep all this in the future by doing what the ZPL does and storing the zvol_state_t
as the objset user data (dmu_objset_set_user()
) and then just let the DSL manage the lifetime of the name, but its more change than I want for this PR.
Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>
zvol_remove_minor_impl() and zvol_remove_minors_impl() should be identical except for how they select zvols to remove, so lets just use the same function with a flag to indicate if we should include children and snapshots or not. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>
When destroying a zvol, it is not "unpublished" from the system (that is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the time del_gendisk() and put_disk() are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls to zvol_open() or zvol_release() while the zvol_state_t is partially or fully freed. zvol_open() has some protection against this by checking that private_data is NULL, but zvol_release does not. This implements a better ordering for all of this by adding a new OS-side method, zvol_os_remove_minor(), which is responsible for fully decoupling the "private" (OS-side) objects from the zvol_state_t. For Linux, that means calling put_disk(), nulling private_data, and freeing zv_zso. This takes the place of zvol_os_clear_private(), which was a nod in that direction but did not do enough, and did not do it early enough. Equivalent changes are made on the FreeBSD side to follow the API change. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>
zvol_state_lock is intended to protect access to the global name->zvol lists (zvol_find_by_name()), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc). This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on zvol_state_t, so zv_state_lock and zv_suspend_lock can't be used to manage access, but also, with the private object and the zvol_state_t being shutdown and destroyed at the same time in zvol_os_free(), we must ensure that the private object pointer only ever corresponds to a real zvol_state_t, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this. The problem with this is that zvol_state_lock does not actually protect access to the zvol_state_t internals, so we need to take zv_state_lock and/or zv_suspend_lock. If those are contended, this can then cause OS-side operations (eg zvol_open()) to sleep to wait for them while hold zvol_state_lock. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, which zvol_remove_minors_impl() will do under high load. However, since the previous commit, we have a guarantee that the private data pointer will always be NULL'd out in zvol_os_remove_minor() _before_ the zvol_state_t is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions of zvol_state_lock from the OS side. While here, I've rewritten much of the locking theory comment at the top of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've tried to describe the purpose of each lock in a little more detail, and in particular describe where it should and shouldn't be used. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>
Since both ZFS- and OS-sides of a zvol now take care of their own locking and don't get in each other's way, there's no need for the very complicated removal code to fall back to async tasks if the locks needed at each stage can't be obtained right now. Here we change it to be a linear three-step process: select zvols of interest and flag them for removal, then wait for them to shed activity and then remove them, and finally, free them. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Signed-off-by: Rob Norris <[email protected]>
At this level I'm more in favour of sync than async; upper layers can put it on a taskq or whatever if they want to background something. This allows each upper subsystem to manage its lifetime properly. That isn't currently the case, which is why eg But even if not, the "mixed" async that was in the remove path where it would punt things off to a task if it couldn't get a lock just led to a chaos of inversions and deadlocks. So regardless of where any async is done, each specific operation should remain a single unit. As best I can tell, "be like ZPL" is gonna take care of 99% of all these zvol quirks, so that seems like an uncontroversial path forward. |
20a17e2
to
3a41b4d
Compare
Yup, moving in this direction to align the zvol layer more with the zpl is would be a nice way to go. |
zvol_remove_minor_impl() and zvol_remove_minors_impl() should be identical except for how they select zvols to remove, so lets just use the same function with a flag to indicate if we should include children and snapshots or not. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Fedor Uporov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17625
When destroying a zvol, it is not "unpublished" from the system (that is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the time del_gendisk() and put_disk() are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls to zvol_open() or zvol_release() while the zvol_state_t is partially or fully freed. zvol_open() has some protection against this by checking that private_data is NULL, but zvol_release does not. This implements a better ordering for all of this by adding a new OS-side method, zvol_os_remove_minor(), which is responsible for fully decoupling the "private" (OS-side) objects from the zvol_state_t. For Linux, that means calling put_disk(), nulling private_data, and freeing zv_zso. This takes the place of zvol_os_clear_private(), which was a nod in that direction but did not do enough, and did not do it early enough. Equivalent changes are made on the FreeBSD side to follow the API change. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Fedor Uporov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17625
zvol_state_lock is intended to protect access to the global name->zvol lists (zvol_find_by_name()), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc). This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on zvol_state_t, so zv_state_lock and zv_suspend_lock can't be used to manage access, but also, with the private object and the zvol_state_t being shutdown and destroyed at the same time in zvol_os_free(), we must ensure that the private object pointer only ever corresponds to a real zvol_state_t, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this. The problem with this is that zvol_state_lock does not actually protect access to the zvol_state_t internals, so we need to take zv_state_lock and/or zv_suspend_lock. If those are contended, this can then cause OS-side operations (eg zvol_open()) to sleep to wait for them while hold zvol_state_lock. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, which zvol_remove_minors_impl() will do under high load. However, since the previous commit, we have a guarantee that the private data pointer will always be NULL'd out in zvol_os_remove_minor() _before_ the zvol_state_t is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions of zvol_state_lock from the OS side. While here, I've rewritten much of the locking theory comment at the top of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've tried to describe the purpose of each lock in a little more detail, and in particular describe where it should and shouldn't be used. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Fedor Uporov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17625
Since both ZFS- and OS-sides of a zvol now take care of their own locking and don't get in each other's way, there's no need for the very complicated removal code to fall back to async tasks if the locks needed at each stage can't be obtained right now. Here we change it to be a linear three-step process: select zvols of interest and flag them for removal, then wait for them to shed activity and then remove them, and finally, free them. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Fedor Uporov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17625
[Sponsors: Klara, Inc., Railway Corporation]
Motivation and Context
We have a customer whose workload involves rapid creation and destruction of zvols. This could easily result in panics or deadlocks, which ultimately are caused by incorrect or inadequate locking through the zvol lifecycle.
Description
There are three main issues resolved here which all contribute to the overall problem.
OS-side object not destroyed before freeing the
zvol_state_t
When destroying a zvol, it is not "unpublished" from the system (that is,
/dev/zd*
node removed) untilzvol_os_free()
. Under Linux, at the timedel_gendisk()
andput_disk()
are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls tozvol_open()
orzvol_release()
while thezvol_state_t
is partially or fully freed.zvol_open()
has some protection against this by checking thatprivate_data
isNULL
, butzvol_release()
does not.This implements a better ordering for all of this by adding a new OS-side method,
zvol_os_remove_minor()
, which is responsible for fully decoupling the "private" (OS-side) objects from thezvol_state_t
. For Linux, that means callingput_disk()
, nulling private_data, and freeingzv_zso
.This takes the place of
zvol_os_clear_private()
, which was a nod in that direction but did not do enough, and did not do it early enough.zvol_state_lock
used to protect OS-side private datazvol_state_lock
is intended to protect access to the global name->zvol lists (zvol_find_by_name()
), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc).This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on
zvol_state_t
, sozv_state_lock
andzv_suspend_lock
can't be used to manage access, but also, with the private object and thezvol_state_t
being shutdown and destroyed at the same time inzvol_os_free()
, we must ensure that the private object pointer only ever corresponds to a realzvol_state_t
, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this.The problem with this is that
zvol_state_lock
does not actually protect access to thezvol_state_t
internals, so we need to takezv_state_lock
and/orzv_suspend_lock
. If those are contended, this can then cause OS-side operations (egzvol_open()
) to sleep to wait for them while holdzvol_state_lock
. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, whichzvol_remove_minors_impl()
will do under high load.However, since the previous commit, we have a guarantee that the private data pointer will always be
NULL
'd out inzvol_os_remove_minor()
before thezvol_state_t
is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions ofzvol_state_lock
from the OS side.Use of async fallbacks in response to failed lock acquisitions
Since both ZFS- and OS-sides of a zvol now take care of their own locking and don't get in each other's way, there's no need for the very complicated removal code to fall back to async tasks if the locks needed at each stage can't be obtained right now.
Here we change it to be a linear three-step process: select zvols of interest and flag them for removal, then wait for them to shed activity and then remove them, and finally, free them.
How Has This Been Tested?
Imagine I wrote this a week ago. I would have said:
I was in the middle of preparing this patch set (as evidenced by the existence of #17596) when 0b6fd02 (via #17575) landed. And since then, I cannot make this break the same way.
As best I can tell, its simply that creating zvols now fails faster, and we detect errors rather than pushing on, so there’s vastly fewer async operations in flight, and so the remove process is way less likely to need to bounce work out to an async taskq.
Whether that means that we’re now managing ordering much better, not sure. But still, I feel pretty confident that these changes are still good: the remove process was too complicated (#16364) and the OS-side really has no business taking
zvol_state_lock
, which is only supposed to protect the global name lookup lists. And just generally, the code is a lot easier to follow.Types of changes
Checklist:
[Signed-off-by](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#signed-off-by)
.