Fix inode eviction and sync writeback deadlock on Linux #17159

tuxoko · 2025-03-19T23:24:30Z

If inode eviction and sync writeback happen on the same inode at the same time, inode eviction will set I_FREEING and wait for sync writeback, and sync writeback may eventually calls zfs_get_data and loop in zfs_zget forever because igrab cannot succeed with I_FREEING, thus causing deadlock.

To fix this, in zfs_get_data we call a variant of zfs_zget where we bailout on loop if I_SYNC flag is set, and force the caller to wait for txg sync.

Signed-off-by: Chunwei Chen [email protected]
Fixes #7964
Fixes #9430

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

If inode eviction and sync writeback happen on the same inode at the same time, inode eviction will set I_FREEING and wait for sync writeback, and sync writeback may eventually calls zfs_get_data and loop in zfs_zget forever because igrab cannot succeed with I_FREEING, thus causing deadlock. To fix this, in zfs_get_data we call a variant of zfs_zget where we bailout on loop if I_SYNC flag is set, and force the caller to wait for txg sync. Signed-off-by: Chunwei Chen <[email protected]> Fixes openzfs#7964 Fixes openzfs#9430

tuxoko · 2025-03-20T22:22:30Z

module/zfs/zfs_vnops.c

@@ -1149,10 +1149,21 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
 	ASSERT3P(lwb, !=, NULL);
 	ASSERT3U(size, !=, 0);

+	error = zfs_zget_impl(zfsvfs, object, &zp, B_TRUE);
+#if defined(__linux__)


I added ifdef because I'm not sure if it's possible to return EAGAIN on freebsd or not.
But I guess looking at the caller, it doesn't really matter if we return EAGAIN or EIO.

perhaps it would be more clean to enhance zil_lwb_commit, which is the user of zfs_get_data, to recognize EAGAIN, I think

tonyhutter

I'm not an expert in the znode code, but I don't see any surface level issues.

snajpa · 2025-03-23T00:06:34Z

module/zfs/zfs_vnops.c

@@ -1149,10 +1149,21 @@ zfs_get_data(void *arg, uint64_t gen, lr_write_t *lr, char *buf,
 	ASSERT3P(lwb, !=, NULL);
 	ASSERT3U(size, !=, 0);

+	error = zfs_zget_impl(zfsvfs, object, &zp, B_TRUE);
+#if defined(__linux__)


perhaps it would be more clean to enhance zil_lwb_commit, which is the user of zfs_get_data, to recognize EAGAIN, I think

snajpa · 2025-03-23T00:08:21Z

module/os/linux/zfs/zfs_znode_os.c

+				 * path, i.e. zfs_get_data, if inode is being
+				 * evicted and I_SYNC is also set.
+				 */
+				if (check_sync && (ZTOI(zp)->i_state & I_SYNC))


maybe you can move this before the igrab, take the i_lock and check for I_FREEING and I_SYNC and if they are both set, bail out of the original zfs_zget without the need for zfs_zget_impl - have you already considered it? am I overlooking something?

The only place that needs bail out is if you are in writeback context, either you are the one setting I_SYNC or whoever set it is blocked by you.

If you do bailout for other caller then you will get errors randomly in all sorts of places for no good reason.

I_SYNC is set in two kernel functions, writeback_sb_inodes, writeback_single_inode. So if you check for I_SYNC and I_FREEING and you use i_lock correctly as you're supposed to, you get more clean and also reliable solution. How is what am saying wrong specifically and why? Which tons of errors?

@snajpa
I_FREEING and I_SYNC is set on the inode. Every one trying to do zfs_zget will see them. So if you don't have any restriction on caller then every one will bailout and got error as long as eviction and writeback is happening at the same time.

well.. I tried playing around with it and it seems like it would be an interesting project to get to the bottom of maybe few other lurking bugs, forcing one to go around and look at how errors from zfs_zget are handled all over the codebase - I think there's a thing in direct IO I'll try to tackle if I find time; but that's not relevant to resolving the deadlock, so, yeah, no easy way other than the _impl split :)

tonyhutter · 2025-05-02T22:34:28Z

@snajpa would you mind taking another look at this one when you get a chance?

snajpa · 2025-05-02T23:45:15Z

@snajpa would you mind taking another look at this one when you get a chance?

@tonyhutter

I mean I don't like it, because now evicting inodes with dirty data will cause txg closing, as fallback of not being able to do zfs_get_data to generate the zil records for inodes open with O_SYNC at the very least -- in zil_commit_impl there's is txg_wait_synced for the failure case, if I understand the patch and those codepaths correctly. Great learning opportunity for me if I don't :D

I think we should do something specifically in case the inode is evicting and syncing both, when zfs_zget_impl returns EAGAIN.

@tuxoko how about this:

If there's EAGAIN and a flag is given to zfs_zget_impl, I think we could instead exit zfs_zget holding two locks - the zfs_znode_hold, but also z_lock is important here: z_unlinked should only be modified and checked under z_lock anyway - then the znode could be worked with for zfs_get_data purposes (and unlock and rele at the end)

or please demonstrate if I'm wrong in my expectation that this will cause txg flushing even in cases of memory cgroup triggered evictions, how am I wrong specifically - we can't be closing txgs on evict (in my very strongly held opinion, because I do have problems with closing txgs faster than 2-3 seconds on our setups as we rely on the writeback mechanisms in ZFS, also want them to reduce rewrites)

and also please consider that i_state is supposed to be accessed under i_lock spinlock, it might not matter specifically in this case but again in my opinion it's cheap to take it in the uncontended case so, what's the harm in doing it consistently with rest of Linux

tuxoko · 2025-05-02T23:57:45Z

@snajpa
I_SYNC writeback flag is for dirtied mmap pages. It's a separate thing from O_SYNC.
If you don't do mmap, then I don't think you will be affected by the deadlock or this change.

Also, if you are using O_SYNC, you are expecting your write(2) to be sync during the call, so there's no inode eviction in picture as you hold the fd as you do write(2).

snajpa · 2025-05-03T00:05:58Z

Lots of software do mmap, we in particular do everything, because we run containers in containers on big boxes and OpenZFS so far has been the most performant option for our use-case in practice, even though microbenchmarks on otherwise idle boxes might not agree :) If this ends up triggering txg flushes on in-container inode evicts, I'm going to have to modify it in some way (if as I outlined works then that way)

snajpa · 2025-05-03T00:09:31Z

But then of course currently, due to the very bug you're trying to solve, we have to run with sync=disabled... so, you do have my grattitude for uncovering what's behind the deadlock, but please let's solve it better

tuxoko · 2025-05-03T00:11:14Z

@snajpa
If your performance is affected by the patch, then wouldn't you have already affected by the deadlock as well?

tuxoko · 2025-05-03T00:23:23Z

Just to clarify, I don't disagree trying to gracefully get the znode and let zil commit continues normally would be better.
But either we have to make sure we are in writeback context or if we can somehow prevent inode being evicted under us.
Given this deadlock is not very common, I doubt this will have that much performance impact.
Though I'm welcome to be proven otherwise.

snajpa · 2025-05-03T00:27:48Z

Depends, it's only the first case of this happening currently, that deadlocks the machine, so that's the one we know about :)

but if this happens continually in multiple cgroups in a loop and there isn't a deadlock, then we will still have to run with sync=disabled until a solution for that is in place, because the result would be a txg sync loop, not very keen on seeing those, loadavg goes up (usually doubles at least), meaning useful workload suffers by a few misbehaving workloads (out of my admin jurisdiction even as we're sort of a timesharing coop, members manage the first container level on the inside ;))

tuxoko commented Mar 20, 2025

View reviewed changes

tonyhutter approved these changes Mar 20, 2025

View reviewed changes

snajpa reviewed Mar 23, 2025

View reviewed changes

markwiederspahn mentioned this pull request May 5, 2025

simple reproducer for mmap zfs_flush hang #17303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inode eviction and sync writeback deadlock on Linux #17159

Fix inode eviction and sync writeback deadlock on Linux #17159

tuxoko commented Mar 19, 2025

tuxoko Mar 20, 2025 •

edited

Loading

snajpa Mar 23, 2025

tonyhutter left a comment

snajpa Mar 23, 2025

snajpa Mar 23, 2025 •

edited

Loading

tuxoko Mar 23, 2025 •

edited

Loading

snajpa Mar 23, 2025

tuxoko Mar 23, 2025

snajpa Mar 25, 2025

tonyhutter commented May 2, 2025

snajpa commented May 2, 2025

tuxoko commented May 2, 2025 •

edited

Loading

snajpa commented May 3, 2025

snajpa commented May 3, 2025

tuxoko commented May 3, 2025

tuxoko commented May 3, 2025

snajpa commented May 3, 2025 •

edited

Loading

Fix inode eviction and sync writeback deadlock on Linux #17159

Are you sure you want to change the base?

Fix inode eviction and sync writeback deadlock on Linux #17159

Conversation

tuxoko commented Mar 19, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

tuxoko Mar 20, 2025 • edited Loading

Choose a reason for hiding this comment

snajpa Mar 23, 2025

Choose a reason for hiding this comment

tonyhutter left a comment

Choose a reason for hiding this comment

snajpa Mar 23, 2025

Choose a reason for hiding this comment

snajpa Mar 23, 2025 • edited Loading

Choose a reason for hiding this comment

tuxoko Mar 23, 2025 • edited Loading

Choose a reason for hiding this comment

snajpa Mar 23, 2025

Choose a reason for hiding this comment

tuxoko Mar 23, 2025

Choose a reason for hiding this comment

snajpa Mar 25, 2025

Choose a reason for hiding this comment

tonyhutter commented May 2, 2025

snajpa commented May 2, 2025

tuxoko commented May 2, 2025 • edited Loading

snajpa commented May 3, 2025

snajpa commented May 3, 2025

tuxoko commented May 3, 2025

tuxoko commented May 3, 2025

snajpa commented May 3, 2025 • edited Loading

tuxoko Mar 20, 2025 •

edited

Loading

snajpa Mar 23, 2025 •

edited

Loading

tuxoko Mar 23, 2025 •

edited

Loading

tuxoko commented May 2, 2025 •

edited

Loading

snajpa commented May 3, 2025 •

edited

Loading