Skip to content

[2.51.0 Bug] Missing singleton objects in 'git repack -adf --path-walk' #1956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Aug 20, 2025

Now that the --path-walk feature for git repack is out in the wild and getting more visibility than it did in the Git for Windows fork, the following issue was brought to my attention:

Some folks would report missing objects after git repack -adf --path-walk!

It turns out that this snuck through the cracks because it was pretty difficult to create a reproducing test case (patch 1) but it boils down to:

  1. A path has exactly one version across all of the history being repacked.
  2. That path is in the index.
  3. The object at that path is not a loose object.
  4. pack.useSparse=true in the config (this is the default)

It is also something where users don't necessarily notice the missing objects until they fetch and a missing object is used as a delta base. Doing normal checkouts doesn't cause changes to these files, so they are never opened by Git. Users hitting this issue can usually recover using git fetch --refetch to repopulate the missing objects from a remote (unless they never had a remote at all).

Patch 2 introduces the fix for this issue, which is related to forgetting to initialize a struct indicator when walking the pending objects.

When reflecting on the ways that I missed this when building the feature, I think the core issue was an overreliance on using bare repos in testing. I also think that the way that the UNINTERESTING object exploration was implemented was particularly fragile to missing updates to the initialization of the struct, so patch 3 adds a new initializer to reduce duplicate code and to help avoid this mistake in the future.

Thanks,
-Stolee

P.S. CC'ing all original reviewers of the series.

cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]

Users reported an issue where objects were missing from their local
enlistments after a full repack using 'git repack -adf --path-walk'.
This was alarming, but took a while to create a reproducer.

The root cause is that certain objects existed in the index and had no
second versions. These objects are usually blobs, though trees can be
included if a cache-tree exists. The issue is that the revision walk
adds these objects to the "pending" list and the path-walk API forgets
to mark the lists it creates at this point as "maybe_interesting". If
these paths only ever have a single version in the history of the repo
(including the current staged version) then the parent directory never
tries to add a new object to the list and mark the list as
"maybe_interesting". Thus, when walking the list later, the group is
skipped as it is expected that no objects are interesting. This happens
even when there are actually no UNINTERESTING objects at all! This is
based on the optimization enabled by the pack.useSparse=true config
option, which is the default.

Thus, we create a test case that demonstrates the many cases of this
issue for reproducibility:

 1. File a/b/c has only one committed version.
 2. Files a/i and x/y only exists as staged changes.
 3. Tree x/ only exists in the cache-tree.

After performing a non-path-walk repack to force all loose objects into
packfiles, run a --path-walk repack followed by 'git fsck'. This fsck is
what fails with the following errors:

  error: invalid object 100644 f2e41136... for 'a/b/c'

    This is the dropped instance of the single-versioned a/b/c file.

  broken link from    tree cfda31d8...
                to    tree 3f725fcd...

    This is the missing tree for the single-versioned a/b/ directory.

  missing blob 0ddf2bae... (a/i)
  missing blob 975fbec8... (x/y)
  missing blob a60d869d... (file)
  missing blob f2e41136... (a/b/c)

  missing tree 3f725fcd... (a/b/)

  dangling tree 5896d7e... (staged root tree)

Note that since the staged root tree is missing, the fsck output cannot
even report that the staged x/ tree is missing as well.

This bug will be fixed in the next change.

Signed-off-by: Derrick Stolee <[email protected]>
The previous change established a buggy instance of 'git repack -adf
--path-walk' when there exist paths that are tracked in the index and
that is the only instance of those paths in the history of the
repository. This change fixes that bug.

The core problem here is that the "maybe_interesting" member of 'struct
type_and_oid_list' is not initialized to '1'. This member was added in
6333e7a (path-walk: mark trees and blobs as UNINTERESTING,
2024-12-20) in a way to help when creating packfiles for a small commit
range using the sparse path algorithm (enabled by pack.useSparse=true).

The idea here is that the list is marked as "maybe_interesting" if an
object is added that does not have the UNINITERSTING flag on it. Later,
this is checked again in case all objects in the list were marked
UNINTERESTING after that point in time. In this case, the algorithm
skips the list as there is no reason to visit it.

This leads to the problem where the "maybe_interesting" member was not
appropriately initialized when the list is created from pending objects.
This is the fix for now.

To help avoid this from happening in the future, a follow-up change will
make initializing lists use a shared method instead of allowing for an
update to this initialization process to miss some existing copies.

Signed-off-by: Derrick Stolee <[email protected]>
The previous change fixed a bug in 'git repack -adf --path-walk' that
was due to an update to how path lists are initialized and missing some
important cases when processing the pending objects.

This change takes the three critical places where path lists are
initialized and combines them into a static method. This simplifies the
callers somewhat while also helping to avoid a missed update in the
future.

The other places where a path list (struct type_and_oid_list) is
initialized is for the following "fixed" lists:

 * Tag objects.
 * Commit objects.
 * Root trees.
 * Tagged trees.
 * Tagged blobs.

These lists are created and consumed in different ways, with only the
root trees being passed into the logic that cares about the
"maybe_interesting" bit. It is appropriate to keep these uses separate.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee derrickstolee self-assigned this Aug 20, 2025
@derrickstolee
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Aug 20, 2025

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1956/derrickstolee/path-walk-missing-objects-v1

To fetch this version to local tag pr-1956/derrickstolee/path-walk-missing-objects-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1956/derrickstolee/path-walk-missing-objects-v1

@derrickstolee
Copy link
Author

Mentioning @dscho in the PR as this bug also exists in Git for Windows (and has since October 2024).

Copy link

gitgitgadget bot commented Aug 20, 2025

Error: dd458e0 was already submitted

path-walk.c Outdated
@@ -385,6 +385,7 @@ static int setup_pending_objects(struct path_walk_info *info,
list->type = OBJ_TREE;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> The core problem here is that the "maybe_interesting" member of 'struct
> type_and_oid_list' is not initialized to '1'. This member was added in
> 6333e7ae0b (path-walk: mark trees and blobs as UNINTERESTING,
> 2024-12-20) in a way to help when creating packfiles for a small commit
> range using the sparse path algorithm (enabled by pack.useSparse=true).

OK, in other words, the bug is fairly contained within the path-walk
traversal.  We treat things as reachable not just from ref tips and
reflogs (where path-walk code can use the tree object to compute on
what pathname each blob comes from) and the main index array (that
has paths, even though it needs separate way to compute than those
for trees), but also from places like REUC and TREE extensions that
make associations between pathnames and objects.  Are they also OK?

> To help avoid this from happening in the future, a follow-up change will
> make initializing lists use a shared method instead of allowing for an
> update to this initialization process to miss some existing copies.

Great.  Future-proofing is 100 times better than just a bugfix.

>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  path-walk.c       | 2 ++
>  t/t7700-repack.sh | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/path-walk.c b/path-walk.c
> index 2d4ddbadd50f..1215ed398f4f 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -385,6 +385,7 @@ static int setup_pending_objects(struct path_walk_info *info,
>  					list->type = OBJ_TREE;
>  					strmap_put(&ctx->paths_to_lists, path, list);
>  				}
> +				list->maybe_interesting = 1;
>  				oid_array_append(&list->oids, &obj->oid);
>  				free(path);
>  			} else {
> @@ -404,6 +405,7 @@ static int setup_pending_objects(struct path_walk_info *info,
>  					list->type = OBJ_BLOB;
>  					strmap_put(&ctx->paths_to_lists, path, list);
>  				}
> +				list->maybe_interesting = 1;
>  				oid_array_append(&list->oids, &obj->oid);
>  			} else {
>  				/* assume a root tree, such as a lightweight tag. */
> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index 1998d9bf291c..030e9e5b2dc7 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -838,7 +838,7 @@ test_expect_success '-n overrides repack.updateServerInfo=true' '
>  	test_server_info_missing
>  '
>  
> -test_expect_failure 'pending objects are repacked appropriately' '
> +test_expect_success 'pending objects are repacked appropriately' '
>  	git init pending &&
>  
>  	(

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/20/2025 3:02 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> 
>> The core problem here is that the "maybe_interesting" member of 'struct
>> type_and_oid_list' is not initialized to '1'. This member was added in
>> 6333e7ae0b (path-walk: mark trees and blobs as UNINTERESTING,
>> 2024-12-20) in a way to help when creating packfiles for a small commit
>> range using the sparse path algorithm (enabled by pack.useSparse=true).
> 
> OK, in other words, the bug is fairly contained within the path-walk
> traversal.  We treat things as reachable not just from ref tips and
> reflogs (where path-walk code can use the tree object to compute on
> what pathname each blob comes from) and the main index array (that
> has paths, even though it needs separate way to compute than those
> for trees), but also from places like REUC and TREE extensions that
> make associations between pathnames and objects.  Are they also OK?

The key integration point is the "pending" list operating a bit
different from walking directly from tags or commits. I was trying
to reproduce the issue from all of those other sources before unlocking
the "singleton" nature of the problem, and failed to do so.

The resolve-undo cache (REUC) is something that I had not tested
previously. Adding "git rm --cached x/y" to the test in the previous
case leads to the 'git fsck' call giving a "dangling blob" warning,
so that could be an interesting way to strengthen the test. Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, Aug 20, 2025 at 03:42:11PM -0400, Derrick Stolee wrote:
> On 8/20/2025 3:02 PM, Junio C Hamano wrote:
> > "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> > 
> >> The core problem here is that the "maybe_interesting" member of 'struct
> >> type_and_oid_list' is not initialized to '1'. This member was added in
> >> 6333e7ae0b (path-walk: mark trees and blobs as UNINTERESTING,
> >> 2024-12-20) in a way to help when creating packfiles for a small commit
> >> range using the sparse path algorithm (enabled by pack.useSparse=true).
> > 
> > OK, in other words, the bug is fairly contained within the path-walk
> > traversal.  We treat things as reachable not just from ref tips and
> > reflogs (where path-walk code can use the tree object to compute on
> > what pathname each blob comes from) and the main index array (that
> > has paths, even though it needs separate way to compute than those
> > for trees), but also from places like REUC and TREE extensions that
> > make associations between pathnames and objects.  Are they also OK?
> 
> The key integration point is the "pending" list operating a bit
> different from walking directly from tags or commits. I was trying
> to reproduce the issue from all of those other sources before unlocking
> the "singleton" nature of the problem, and failed to do so.
> 
> The resolve-undo cache (REUC) is something that I had not tested
> previously. Adding "git rm --cached x/y" to the test in the previous
> case leads to the 'git fsck' call giving a "dangling blob" warning,
> so that could be an interesting way to strengthen the test. Thanks,

I also wonder a bit about the future -- if we ever add a new source for
pending objects, would the author have to amend "path-walk.c" to take
this new pending source into account?

I guess the answer is "yes", which does make me feel a bit uneasy as
it is very easy to now corrupt the repository.

Patrick

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/21/2025 4:01 AM, Patrick Steinhardt wrote:
> On Wed, Aug 20, 2025 at 03:42:11PM -0400, Derrick Stolee wrote:
>> On 8/20/2025 3:02 PM, Junio C Hamano wrote:
>>> "Derrick Stolee via GitGitGadget" <[email protected]> writes:

>> The key integration point is the "pending" list operating a bit
>> different from walking directly from tags or commits. I was trying
>> to reproduce the issue from all of those other sources before unlocking
>> the "singleton" nature of the problem, and failed to do so.
>>
>> The resolve-undo cache (REUC) is something that I had not tested
>> previously. Adding "git rm --cached x/y" to the test in the previous
>> case leads to the 'git fsck' call giving a "dangling blob" warning,
>> so that could be an interesting way to strengthen the test. Thanks,
> 
> I also wonder a bit about the future -- if we ever add a new source for
> pending objects, would the author have to amend "path-walk.c" to take
> this new pending source into account?

I don't think the risk comes from new ways to add pending objects,
but that the path-walk algorithm added an "optimization" without
appropriately modifying how it handled the pending objects. New
sources of pending objects shouldn't cause any issues, unless the
revision API itself changed substantially.

For example, in my draft for v2 I add changes to my test to account
for the REUC in the repack. The new logic for the path-walk feature
worked just fine. It does point out that 'git fsck' doesn't follow
links in the REUC and so reports the blob as dangling.

If we have a new source for pending objects in the context of
pack-objects or repack, one would hope that tests would be added to
demonstrate the behavior and then that could be double-checked with
GIT_TEST_PACK_PATH_WALK=1. One problem in this situation is that
the tests were not substantial enough for these sources. We are
correcting for that now. 
> I guess the answer is "yes", which does make me feel a bit uneasy as
> it is very easy to now corrupt the repository.

Perhaps this has always been very easy to make a bug that leads to
losing objects, but we're purposefully wary to touch the logic around
the repacking processes.

One feature that could provide increased confidence is some step that
double-checks the object list before and after a step like this to
make sure that no objects are dropped (or only the set we expect is
dropped). Something like comparing the object lists in the pack-indexes
before and after a repack. This might be too problematic to enable in
all cases, but could be enabled in the test suite and certain critical
places. Azure DevOps has something similar in its backend to prevent
any change to the Git object database that can't be undone, but these
steps happen asynchronously to "production data" so it may not be
appropriate for all server architectures.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Aug 20, 2025

This patch series was integrated into seen via git@aa17243.

@gitgitgadget gitgitgadget bot added the seen label Aug 20, 2025
Copy link

gitgitgadget bot commented Aug 20, 2025

This patch series was integrated into seen via git@1c522a9.

@@ -838,4 +838,47 @@ test_expect_success '-n overrides repack.updateServerInfo=true' '
test_server_info_missing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, Aug 20, 2025 at 06:39:54PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index 611755cc139b..1998d9bf291c 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -838,4 +838,47 @@ test_expect_success '-n overrides repack.updateServerInfo=true' '

Tiny nit: I would've probably squashed this patch into the second patch,
as we usually don't use the add-failing-test-and-then-fix-it-later
dance. On the other hand though it gives some nice context, so I
ultimately don't mind it all that much. So please feel free to ignore
this nit.

>  	test_server_info_missing
>  '
>  
> +test_expect_failure 'pending objects are repacked appropriately' '
> +	git init pending &&

We probably also want `test_when_finished "rm -rf pending"` before
calling git-init(1).

> +
> +	(
> +		cd pending &&
> +
> +		mkdir -p a/b &&
> +		echo singleton >file &&
> +		echo stuff >a/b/c &&
> +		echo more >a/d &&
> +		git add file a &&
> +		git commit -m "single blobs" &&
> +
> +		echo d >a/d &&
> +		echo e >a/e &&
> +		git add a &&
> +		git commit -m "more blobs" &&
> +
> +		# This use of a sparse index helps to force
> +		# test that the cache-tree is walked, too.
> +		git sparse-checkout set --sparse-index a x &&
> +
> +		# Just _stage_ the changes.
> +		echo f >a/d &&
> +		echo h >a/e &&
> +		echo i >a/i &&
> +		mkdir x &&
> +		echo y >x/y &&
> +		git add a x &&

Nit: I think I would've moved the explanations you have in the commit
message into these hunks so that the test becomes a bit more
self-explanatory.

> +		# Bring the loose objects into a packfile to avoid
> +		# leftovers in next test. Without this, the loose
> +		# objects persist and the test succeeds for other
> +		# reasons.
> +		git repack -adf &&
> +		git fsck &&
> +
> +		# Test path walk version with pack.useSparse.
> +		git -c pack.useSparse=true repack -adf --path-walk &&
> +		git fsck
> +	)
> +'

Patrick

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/21/2025 4:00 AM, Patrick Steinhardt wrote:
> On Wed, Aug 20, 2025 at 06:39:54PM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
>> index 611755cc139b..1998d9bf291c 100755
>> --- a/t/t7700-repack.sh
>> +++ b/t/t7700-repack.sh
>> @@ -838,4 +838,47 @@ test_expect_success '-n overrides repack.updateServerInfo=true' '
> 
> Tiny nit: I would've probably squashed this patch into the second patch,
> as we usually don't use the add-failing-test-and-then-fix-it-later
> dance. On the other hand though it gives some nice context, so I
> ultimately don't mind it all that much. So please feel free to ignore
> this nit.

I'm probably the person who is always asking folks to create a test
that either fails or demonstrates the "before" behavior before making
the actual change that updates the case. This allows the ability to
"test the test" by checking it in place to confirm that it is indeed
failing.

Using test_expect_failure allows us to avoid breaking bisect. 
>>  	test_server_info_missing
>>  '
>>  
>> +test_expect_failure 'pending objects are repacked appropriately' '
>> +	git init pending &&
> 
> We probably also want `test_when_finished "rm -rf pending"` before
> calling git-init(1).

Good idea.

>> +
>> +	(
>> +		cd pending &&
>> +
>> +		mkdir -p a/b &&
>> +		echo singleton >file &&
>> +		echo stuff >a/b/c &&
>> +		echo more >a/d &&
>> +		git add file a &&
>> +		git commit -m "single blobs" &&
>> +
>> +		echo d >a/d &&
>> +		echo e >a/e &&
>> +		git add a &&
>> +		git commit -m "more blobs" &&
>> +
>> +		# This use of a sparse index helps to force
>> +		# test that the cache-tree is walked, too.
>> +		git sparse-checkout set --sparse-index a x &&
>> +
>> +		# Just _stage_ the changes.
>> +		echo f >a/d &&
>> +		echo h >a/e &&
>> +		echo i >a/i &&
>> +		mkdir x &&
>> +		echo y >x/y &&
>> +		git add a x &&
> 
> Nit: I think I would've moved the explanations you have in the commit
> message into these hunks so that the test becomes a bit more
> self-explanatory.

Hm. That seems to go against our typical pattern of leaving comments
sparse and having the longer explanation available in commit messages
but maybe I'm out of date or tests are a different beast. I'll see
what I can do to make the test more self-documented.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <[email protected]> writes:

> On 8/21/2025 4:00 AM, Patrick Steinhardt wrote:
>> On Wed, Aug 20, 2025 at 06:39:54PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
>>> index 611755cc139b..1998d9bf291c 100755
>>> --- a/t/t7700-repack.sh
>>> +++ b/t/t7700-repack.sh
>>> @@ -838,4 +838,47 @@ test_expect_success '-n overrides repack.updateServerInfo=true' '
>> 
>> Tiny nit: I would've probably squashed this patch into the second patch,
>> as we usually don't use the add-failing-test-and-then-fix-it-later
>> dance. On the other hand though it gives some nice context, so I
>> ultimately don't mind it all that much. So please feel free to ignore
>> this nit.
>
> I'm probably the person who is always asking folks to create a test
> that either fails or demonstrates the "before" behavior before making
> the actual change that updates the case. This allows the ability to
> "test the test" by checking it in place to confirm that it is indeed
> failing.
> Using test_expect_failure allows us to avoid breaking bisect. 

Yes, you can develop that way, but on the reviewing and receiving
end, the second patch that shows only the change from expect_failure
to expect_success pushes the more important "what behaviour was this
thing testing?" out of the hunk context, if you split them into two.

If we really wanted to verify the claim that without the fix this
was broken and we have a test to demonstrate a failure on the
receiving end, we can "checkout" paths touched by that commit
outside t/ from HEAD^ to build to see how the system behaved without
the code change just fine, so such a split does not buy us much.

Unless there is a strong reason not to, please always present such
test in the same patch as the code change to fix that breakage.

Thanks.

path-walk.c Outdated
@@ -385,6 +385,7 @@ static int setup_pending_objects(struct path_walk_info *info,
list->type = OBJ_TREE;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, Aug 20, 2025 at 06:39:55PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
> 
> The previous change established a buggy instance of 'git repack -adf
> --path-walk' when there exist paths that are tracked in the index and
> that is the only instance of those paths in the history of the
> repository. This change fixes that bug.
> 
> The core problem here is that the "maybe_interesting" member of 'struct
> type_and_oid_list' is not initialized to '1'. This member was added in
> 6333e7ae0b (path-walk: mark trees and blobs as UNINTERESTING,
> 2024-12-20) in a way to help when creating packfiles for a small commit
> range using the sparse path algorithm (enabled by pack.useSparse=true).
> 
> The idea here is that the list is marked as "maybe_interesting" if an
> object is added that does not have the UNINITERSTING flag on it. Later,

s/UNINITERSTING/UNINTERESTING/

> this is checked again in case all objects in the list were marked
> UNINTERESTING after that point in time. In this case, the algorithm
> skips the list as there is no reason to visit it.
> 
> This leads to the problem where the "maybe_interesting" member was not
> appropriately initialized when the list is created from pending objects.
> This is the fix for now.
> 
> To help avoid this from happening in the future, a follow-up change will
> make initializing lists use a shared method instead of allowing for an
> update to this initialization process to miss some existing copies.

Yeah, I wanted to say that this feels quite fragile to me and very easy
to miss. Does this mechanism buy us a lot of performance in the first
place? Because if not we might as well just remove it entirely.

But if the answer is "yes" then adding APIs around it feels like a good
alternative.

Patrick

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/21/2025 4:01 AM, Patrick Steinhardt wrote:
> On Wed, Aug 20, 2025 at 06:39:55PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <[email protected]>
>>
>> The previous change established a buggy instance of 'git repack -adf
>> --path-walk' when there exist paths that are tracked in the index and
>> that is the only instance of those paths in the history of the
>> repository. This change fixes that bug.
>>
>> The core problem here is that the "maybe_interesting" member of 'struct
>> type_and_oid_list' is not initialized to '1'. This member was added in
>> 6333e7ae0b (path-walk: mark trees and blobs as UNINTERESTING,
>> 2024-12-20) in a way to help when creating packfiles for a small commit
>> range using the sparse path algorithm (enabled by pack.useSparse=true).
>>
>> The idea here is that the list is marked as "maybe_interesting" if an
>> object is added that does not have the UNINITERSTING flag on it. Later,
> 
> s/UNINITERSTING/UNINTERESTING/

Thanks!

>> this is checked again in case all objects in the list were marked
>> UNINTERESTING after that point in time. In this case, the algorithm
>> skips the list as there is no reason to visit it.
>>
>> This leads to the problem where the "maybe_interesting" member was not
>> appropriately initialized when the list is created from pending objects.
>> This is the fix for now.
>>
>> To help avoid this from happening in the future, a follow-up change will
>> make initializing lists use a shared method instead of allowing for an
>> update to this initialization process to miss some existing copies.
> 
> Yeah, I wanted to say that this feels quite fragile to me and very easy
> to miss. Does this mechanism buy us a lot of performance in the first
> place? Because if not we might as well just remove it entirely.

The details for the space savings and moderate time cost are listed in
e5394794a5 (pack-objects: thread the path-based compression, 2025-05-16).
These improvements are on top of those from --name-hash-version=2 by
using a different way to focus delta calculations.

A larger, internal example for this can be seen in this table (based on
testing today):

Mode      |  Size    | Time
----------+----------+------
version 1 |  16.0 GB | 83min
version 2 |   9.9 GB | 77min
path walk |   6.4 GB | 74min

> But if the answer is "yes" then adding APIs around it feels like a good
> alternative.
Making the code less brittle to changes is good, but also I'm interested
in ways to improve our test infrastructure or adding defensive features.
I mentioned a couple of ideas in an earlier message.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Aug 21, 2025

This branch is now known as ds/path-walk-repack-fix.

Copy link

gitgitgadget bot commented Aug 21, 2025

This patch series was integrated into seen via git@bcf38e7.

@@ -838,4 +838,47 @@ test_expect_success '-n overrides repack.updateServerInfo=true' '
test_server_info_missing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Wed, Aug 20, 2025 at 11:39 AM Derrick Stolee via GitGitGadget
<[email protected]> wrote:
>
> From: Derrick Stolee <[email protected]>
>
> Users reported an issue where objects were missing from their local
> enlistments after a full repack using 'git repack -adf --path-walk'.

What is an enlistment?

> This was alarming, but took a while to create a reproducer.

but => and ?

> The root cause is that certain objects existed in the index and had no
> second versions. These objects are usually blobs, though trees can be
> included if a cache-tree exists. The issue is that the revision walk
> adds these objects to the "pending" list and the path-walk API forgets
> to mark the lists it creates at this point as "maybe_interesting". If
> these paths only ever have a single version in the history of the repo
> (including the current staged version) then the parent directory never
> tries to add a new object to the list and mark the list as
> "maybe_interesting". Thus, when walking the list later, the group is
> skipped as it is expected that no objects are interesting. This happens
> even when there are actually no UNINTERESTING objects at all! This is
> based on the optimization enabled by the pack.useSparse=true config
> option, which is the default.
>
> Thus, we create a test case that demonstrates the many cases of this
> issue for reproducibility:
>
>  1. File a/b/c has only one committed version.
>  2. Files a/i and x/y only exists as staged changes.

exists => exist


I didn't have any questions or spot any issues on the rest of the patch.

path-walk.c Outdated
@@ -385,6 +385,7 @@ static int setup_pending_objects(struct path_walk_info *info,
list->type = OBJ_TREE;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Wed, Aug 20, 2025 at 11:40 AM Derrick Stolee via GitGitGadget
<[email protected]> wrote:
>
> From: Derrick Stolee <[email protected]>
>
> The previous change established a buggy instance of 'git repack -adf
> --path-walk' when there exist paths that are tracked in the index and
> that is the only instance of those paths in the history of the
> repository. This change fixes that bug.
>
> The core problem here is that the "maybe_interesting" member of 'struct
> type_and_oid_list' is not initialized to '1'. This member was added in
> 6333e7ae0b (path-walk: mark trees and blobs as UNINTERESTING,
> 2024-12-20) in a way to help when creating packfiles for a small commit
> range using the sparse path algorithm (enabled by pack.useSparse=true).
>
> The idea here is that the list is marked as "maybe_interesting" if an
> object is added that does not have the UNINITERSTING flag on it. Later,
> this is checked again in case all objects in the list were marked
> UNINTERESTING after that point in time. In this case, the algorithm
> skips the list as there is no reason to visit it.
>
> This leads to the problem where the "maybe_interesting" member was not
> appropriately initialized when the list is created from pending objects.
> This is the fix for now.

What is the fix for now?  I think I can deduce it from the above
paragraphs, but then either this sentence is unnecessary or it's just
confusing.  Perhaps change the last sentence to "Initialize it to fix
this problem." or something along those lines?

> To help avoid this from happening in the future, a follow-up change will

I appreciate the future-proofing, which I think you are alluding to,
but to me, "To help avoid this..." suggests your change might not
(fully?) fix the bug you are discussing.  Perhaps you mean "To help
avoid similar problems..." or "A follow-up change will add some
future-proofing to prevent similar problems by..."?  Or maybe I'm just
reading your sentence weird...

> make initializing lists use a shared method instead of allowing for an
> update to this initialization process to miss some existing copies.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  path-walk.c       | 2 ++
>  t/t7700-repack.sh | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/path-walk.c b/path-walk.c
> index 2d4ddbadd50f..1215ed398f4f 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -385,6 +385,7 @@ static int setup_pending_objects(struct path_walk_info *info,
>                                         list->type = OBJ_TREE;
>                                         strmap_put(&ctx->paths_to_lists, path, list);
>                                 }
> +                               list->maybe_interesting = 1;
>                                 oid_array_append(&list->oids, &obj->oid);
>                                 free(path);
>                         } else {
> @@ -404,6 +405,7 @@ static int setup_pending_objects(struct path_walk_info *info,
>                                         list->type = OBJ_BLOB;
>                                         strmap_put(&ctx->paths_to_lists, path, list);
>                                 }
> +                               list->maybe_interesting = 1;
>                                 oid_array_append(&list->oids, &obj->oid);
>                         } else {
>                                 /* assume a root tree, such as a lightweight tag. */
> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index 1998d9bf291c..030e9e5b2dc7 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -838,7 +838,7 @@ test_expect_success '-n overrides repack.updateServerInfo=true' '
>         test_server_info_missing
>  '
>
> -test_expect_failure 'pending objects are repacked appropriately' '
> +test_expect_success 'pending objects are repacked appropriately' '
>         git init pending &&
>
>         (
> --
> gitgitgadget

Patch looks good.

@@ -105,6 +105,24 @@ static void push_to_stack(struct path_walk_context *ctx,
prio_queue_put(&ctx->path_stack, xstrdup(path));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Wed, Aug 20, 2025 at 11:40 AM Derrick Stolee via GitGitGadget
<[email protected]> wrote:
>
> From: Derrick Stolee <[email protected]>
>
> The previous change fixed a bug in 'git repack -adf --path-walk' that
> was due to an update to how path lists are initialized and missing some
> important cases when processing the pending objects.
>
> This change takes the three critical places where path lists are
> initialized and combines them into a static method. This simplifies the
> callers somewhat while also helping to avoid a missed update in the
> future.
>
> The other places where a path list (struct type_and_oid_list) is
> initialized is for the following "fixed" lists:
>
>  * Tag objects.
>  * Commit objects.
>  * Root trees.
>  * Tagged trees.
>  * Tagged blobs.
>
> These lists are created and consumed in different ways, with only the
> root trees being passed into the logic that cares about the
> "maybe_interesting" bit. It is appropriate to keep these uses separate.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  path-walk.c | 57 +++++++++++++++++++++++------------------------------
>  1 file changed, 25 insertions(+), 32 deletions(-)
>
> diff --git a/path-walk.c b/path-walk.c
> index 1215ed398f4f..f1ceed99e94c 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -105,6 +105,24 @@ static void push_to_stack(struct path_walk_context *ctx,
>         prio_queue_put(&ctx->path_stack, xstrdup(path));
>  }
>
> +static void add_path_to_list(struct path_walk_context *ctx,
> +                            const char *path,
> +                            enum object_type type,
> +                            struct object_id *oid,
> +                            int interesting)
> +{
> +       struct type_and_oid_list *list = strmap_get(&ctx->paths_to_lists, path);
> +
> +       if (!list) {
> +               CALLOC_ARRAY(list, 1);
> +               list->type = type;
> +               strmap_put(&ctx->paths_to_lists, path, list);
> +       }
> +
> +       list->maybe_interesting |= interesting;
> +       oid_array_append(&list->oids, oid);
> +}
> +
>  static int add_tree_entries(struct path_walk_context *ctx,
>                             const char *base_path,
>                             struct object_id *oid)
> @@ -129,7 +147,6 @@ static int add_tree_entries(struct path_walk_context *ctx,
>
>         init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
>         while (tree_entry(&desc, &entry)) {
> -               struct type_and_oid_list *list;
>                 struct object *o;
>                 /* Not actually true, but we will ignore submodules later. */
>                 enum object_type type = S_ISDIR(entry.mode) ? OBJ_TREE : OBJ_BLOB;
> @@ -190,17 +207,10 @@ static int add_tree_entries(struct path_walk_context *ctx,
>                                 continue;
>                 }
>
> -               if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
> -                       CALLOC_ARRAY(list, 1);
> -                       list->type = type;
> -                       strmap_put(&ctx->paths_to_lists, path.buf, list);
> -               }
> -               push_to_stack(ctx, path.buf);
> -
> -               if (!(o->flags & UNINTERESTING))
> -                       list->maybe_interesting = 1;
> +               add_path_to_list(ctx, path.buf, type, &entry.oid,
> +                                !(o->flags & UNINTERESTING));
>
> -               oid_array_append(&list->oids, &entry.oid);
> +               push_to_stack(ctx, path.buf);
>         }
>
>         free_tree_buffer(tree);
> @@ -377,16 +387,9 @@ static int setup_pending_objects(struct path_walk_info *info,
>                         if (!info->trees)
>                                 continue;
>                         if (pending->path) {
> -                               struct type_and_oid_list *list;
>                                 char *path = *pending->path ? xstrfmt("%s/", pending->path)
>                                                             : xstrdup("");
> -                               if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
> -                                       CALLOC_ARRAY(list, 1);
> -                                       list->type = OBJ_TREE;
> -                                       strmap_put(&ctx->paths_to_lists, path, list);
> -                               }
> -                               list->maybe_interesting = 1;
> -                               oid_array_append(&list->oids, &obj->oid);
> +                               add_path_to_list(ctx, path, OBJ_TREE, &obj->oid, 1);
>                                 free(path);
>                         } else {
>                                 /* assume a root tree, such as a lightweight tag. */
> @@ -397,20 +400,10 @@ static int setup_pending_objects(struct path_walk_info *info,
>                 case OBJ_BLOB:
>                         if (!info->blobs)
>                                 continue;
> -                       if (pending->path) {
> -                               struct type_and_oid_list *list;
> -                               char *path = pending->path;
> -                               if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
> -                                       CALLOC_ARRAY(list, 1);
> -                                       list->type = OBJ_BLOB;
> -                                       strmap_put(&ctx->paths_to_lists, path, list);
> -                               }
> -                               list->maybe_interesting = 1;
> -                               oid_array_append(&list->oids, &obj->oid);
> -                       } else {
> -                               /* assume a root tree, such as a lightweight tag. */
> +                       if (pending->path)
> +                               add_path_to_list(ctx, pending->path, OBJ_BLOB, &obj->oid, 1);
> +                       else
>                                 oid_array_append(&tagged_blobs->oids, &obj->oid);
> -                       }
>                         break;
>
>                 case OBJ_COMMIT:
> --
> gitgitgadget

Looks like a straightforward collapsing of common code into a new function.

Copy link

gitgitgadget bot commented Aug 22, 2025

This patch series was integrated into seen via git@6f7d3e1.

Copy link

gitgitgadget bot commented Aug 23, 2025

There was a status update in the "Cooking" section about the branch ds/path-walk-repack-fix on the Git mailing list:

"git repack --path-walk" lost objects in some corner cases, which
has been corrected.

Comments?
source: <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant