Skip to content

Fix inode eviction and sync writeback deadlock on Linux #17159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented Mar 19, 2025

If inode eviction and sync writeback happen on the same inode at the same time, inode eviction will set I_FREEING and wait for sync writeback, and sync writeback may eventually calls zfs_get_data and loop in zfs_zget forever because igrab cannot succeed with I_FREEING, thus causing deadlock.

To fix this, in zfs_get_data we call a variant of zfs_zget where we bailout on loop if I_SYNC flag is set, and force the caller to wait for txg sync.

Signed-off-by: Chunwei Chen [email protected]
Fixes #7964
Fixes #9430

Motivation and Context

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

If inode eviction and sync writeback happen on the same inode at the
same time, inode eviction will set I_FREEING and wait for sync
writeback, and sync writeback may eventually calls zfs_get_data and loop
in zfs_zget forever because igrab cannot succeed with I_FREEING, thus
causing deadlock.

To fix this, in zfs_get_data we call a variant of zfs_zget where we
bailout on loop if I_SYNC flag is set, and force the caller to wait for
txg sync.

Signed-off-by: Chunwei Chen <[email protected]>
Fixes openzfs#7964
Fixes openzfs#9430
@@ -1149,10 +1149,21 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
ASSERT3P(lwb, !=, NULL);
ASSERT3U(size, !=, 0);

error = zfs_zget_impl(zfsvfs, object, &zp, B_TRUE);
#if defined(__linux__)
Copy link
Contributor Author

@tuxoko tuxoko Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added ifdef because I'm not sure if it's possible to return EAGAIN on freebsd or not.
But I guess looking at the caller, it doesn't really matter if we return EAGAIN or EIO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it would be more clean to enhance zil_lwb_commit, which is the user of zfs_get_data, to recognize EAGAIN, I think

Copy link
Contributor

@tonyhutter tonyhutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in the znode code, but I don't see any surface level issues.

@@ -1149,10 +1149,21 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
ASSERT3P(lwb, !=, NULL);
ASSERT3U(size, !=, 0);

error = zfs_zget_impl(zfsvfs, object, &zp, B_TRUE);
#if defined(__linux__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it would be more clean to enhance zil_lwb_commit, which is the user of zfs_get_data, to recognize EAGAIN, I think

* path, i.e. zfs_get_data, if inode is being
* evicted and I_SYNC is also set.
*/
if (check_sync && (ZTOI(zp)->i_state & I_SYNC))
Copy link
Contributor

@snajpa snajpa Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can move this before the igrab, take the i_lock and check for I_FREEING and I_SYNC and if they are both set, bail out of the original zfs_zget without the need for zfs_zget_impl - have you already considered it? am I overlooking something?

Copy link
Contributor Author

@tuxoko tuxoko Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place that needs bail out is if you are in writeback context, either you are the one setting I_SYNC or whoever set it is blocked by you.

If you do bailout for other caller then you will get errors randomly in all sorts of places for no good reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I_SYNC is set in two kernel functions, writeback_sb_inodes, writeback_single_inode. So if you check for I_SYNC and I_FREEING and you use i_lock correctly as you're supposed to, you get more clean and also reliable solution. How is what am saying wrong specifically and why? Which tons of errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snajpa
I_FREEING and I_SYNC is set on the inode. Every one trying to do zfs_zget will see them. So if you don't have any restriction on caller then every one will bailout and got error as long as eviction and writeback is happening at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well.. I tried playing around with it and it seems like it would be an interesting project to get to the bottom of maybe few other lurking bugs, forcing one to go around and look at how errors from zfs_zget are handled all over the codebase - I think there's a thing in direct IO I'll try to tackle if I find time; but that's not relevant to resolving the deadlock, so, yeah, no easy way other than the _impl split :)

@tonyhutter
Copy link
Contributor

@snajpa would you mind taking another look at this one when you get a chance?

@snajpa
Copy link
Contributor

snajpa commented May 2, 2025

@snajpa would you mind taking another look at this one when you get a chance?

@tonyhutter

I mean I don't like it, because now evicting inodes with dirty data will cause txg closing, as fallback of not being able to do zfs_get_data to generate the zil records for inodes open with O_SYNC at the very least -- in zil_commit_impl there's is txg_wait_synced for the failure case, if I understand the patch and those codepaths correctly. Great learning opportunity for me if I don't :D

I think we should do something specifically in case the inode is evicting and syncing both, when zfs_zget_impl returns EAGAIN.

@tuxoko how about this:

If there's EAGAIN and a flag is given to zfs_zget_impl, I think we could instead exit zfs_zget holding two locks - the zfs_znode_hold, but also z_lock is important here: z_unlinked should only be modified and checked under z_lock anyway - then the znode could be worked with for zfs_get_data purposes (and unlock and rele at the end)

or please demonstrate if I'm wrong in my expectation that this will cause txg flushing even in cases of memory cgroup triggered evictions, how am I wrong specifically - we can't be closing txgs on evict (in my very strongly held opinion, because I do have problems with closing txgs faster than 2-3 seconds on our setups as we rely on the writeback mechanisms in ZFS, also want them to reduce rewrites)

and also please consider that i_state is supposed to be accessed under i_lock spinlock, it might not matter specifically in this case but again in my opinion it's cheap to take it in the uncontended case so, what's the harm in doing it consistently with rest of Linux

@tuxoko
Copy link
Contributor Author

tuxoko commented May 2, 2025

@snajpa
I_SYNC writeback flag is for dirtied mmap pages. It's a separate thing from O_SYNC.
If you don't do mmap, then I don't think you will be affected by the deadlock or this change.

Also, if you are using O_SYNC, you are expecting your write(2) to be sync during the call, so there's no inode eviction in picture as you hold the fd as you do write(2).

@snajpa
Copy link
Contributor

snajpa commented May 3, 2025

Lots of software do mmap, we in particular do everything, because we run containers in containers on big boxes and OpenZFS so far has been the most performant option for our use-case in practice, even though microbenchmarks on otherwise idle boxes might not agree :) If this ends up triggering txg flushes on in-container inode evicts, I'm going to have to modify it in some way (if as I outlined works then that way)

@snajpa
Copy link
Contributor

snajpa commented May 3, 2025

But then of course currently, due to the very bug you're trying to solve, we have to run with sync=disabled... so, you do have my grattitude for uncovering what's behind the deadlock, but please let's solve it better

@tuxoko
Copy link
Contributor Author

tuxoko commented May 3, 2025

@snajpa
If your performance is affected by the patch, then wouldn't you have already affected by the deadlock as well?

@tuxoko
Copy link
Contributor Author

tuxoko commented May 3, 2025

Just to clarify, I don't disagree trying to gracefully get the znode and let zil commit continues normally would be better.
But either we have to make sure we are in writeback context or if we can somehow prevent inode being evicted under us.
Given this deadlock is not very common, I doubt this will have that much performance impact.
Though I'm welcome to be proven otherwise.

@snajpa
Copy link
Contributor

snajpa commented May 3, 2025

Depends, it's only the first case of this happening currently, that deadlocks the machine, so that's the one we know about :)

but if this happens continually in multiple cgroups in a loop and there isn't a deadlock, then we will still have to run with sync=disabled until a solution for that is in place, because the result would be a txg sync loop, not very keen on seeing those, loadavg goes up (usually doubles at least), meaning useful workload suffers by a few misbehaving workloads (out of my admin jurisdiction even as we're sort of a timesharing coop, members manage the first container level on the inside ;))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants