fix: Skip callbacks with dead weakrefs while iterating over callbacks #2310

wernerd-cern · 2023-08-31T09:34:50Z

Description

Resolves #2311

The bug fixed by this PR occured in my work in the repository https://gitlab.cern.ch/dawerner/statanabenchmark.
I am currently working on a summer student project at CERN benchmarking statistical analysis tools in python including pyhf.
When performing these benchmarks, pyhf is called multiple times in a row with different backends. In some cases an error occurs then as objects are called through weakrefs that are already dead and thus NoneType objects.

When events (such as 'tensorlib_changed') are called, a bug occured previously where objects (i.e. the TensorViewer) are already dead. In this case an error occured when member functions where using attributes such as self._sorted_indices in src/pyhf/tensor/common.py:33.

This bug is fixed by catching the cases, in which the object is invalidated by earlier callbacks.

Checklist Before Requesting Reviewer

Tests are passing
"WIP" removed from the title of the pull request
Selected an Assignee for the PR to be responsible for the log summary

Before Merging

For the PR Assignees:

Summarize commit messages into a comprehensive review of the PR

* Check refs while processing callbacks in case a callee is de-ref'd during
  the callback process.
* Additionally flush after processing callbacks in case any callback de-ref's
  a callee.
* Note that no additional tests are added for this case as the problem arises from
  and intermediate callback triggering a garbage-collect. It is unclear how to force
  this scenario deterministically in testing.
* Add Jonas Rembser to contributors list.

Co-authored-by: Jonas Rembser <[email protected]>

matthewfeickert

@wernerd-cern Can you please report what the actual bug is and give a reproducer as well for documentation? It would be better to make an Issue that clearly documents the bug and then have this PR close it. This PR will need tests as well.

codecov · 2023-08-31T09:47:48Z

Codecov Report

Patch coverage: 80.00% and project coverage change: +0.66% 🎉

Comparison is base (4954010) 97.61% compared to head (5cb11ae) 98.28%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2310      +/-   ##
==========================================
+ Coverage   97.61%   98.28%   +0.66%     
==========================================
  Files          69       69              
  Lines        4535     4538       +3     
  Branches      802      803       +1     
==========================================
+ Hits         4427     4460      +33     
+ Misses         65       45      -20     
+ Partials       43       33      -10

Flag	Coverage Δ
contrib	`97.86% <80.00%> (?)`
doctest	`60.99% <40.00%> (-0.07%)`	⬇️
unittests-3.10	`96.29% <80.00%> (?)`
unittests-3.11	`96.29% <80.00%> (?)`
unittests-3.8	`96.31% <80.00%> (?)`
unittests-3.9	`96.34% <80.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
src/pyhf/events.py	`98.61% <80.00%> (-1.39%)`	⬇️

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

matthewfeickert · 2023-08-31T09:56:50Z

For reviewer reference, the last time this code was edited was in PR #1035.

When events (such as 'tensorlib_changed') are called, a bug occured previously where objects (i.e. the `TensorViewer`) are already dead. In this case an error occured when member functions where using attributes such as `self._sorted_indices`. This bug is fixed by catching the cases, in which the object is invalidated by earlier callbacks.

for more information, see https://pre-commit.ci

matthewfeickert

Seems like this is indeed needed, to be able to do what we do in our tests without providing all of conftest.py's machinery. So this should go in and then a patch release made from it. 👍

This code doesn't have full coverage though, so appropriate tests will need to be added.

Aside: Thanks for making https://gitlab.cern.ch/dawerner/statanabenchmark public. This is a nice project. :)

src/pyhf/events.py

kratsg

This PR doesn't seem to introduce any new functionality that does not already exist. __call__ here in the changes is including changes covered by the _flush function. The _flush() should be called before trying to access the callbacks to determine the valid callbacks first. So I'm not sure how you ran into a situation that's covered by this PR that isn't already covered by the existing code...

kratsg · 2023-09-05T22:47:45Z

In addition, the case you're describing is covered by our tests right now.

matthewfeickert · 2023-09-06T01:10:24Z

I'm going to put this into draft until we can determine if Issue #2310 is actually reproducible in any way that pyhf can support it.

guitargeek · 2023-09-06T08:48:03Z

This PR doesn't seem to introduce any new functionality that does not already exist. __call__ here in the changes is including changes covered by the _flush function. The _flush() should be called before trying to access the callbacks to determine the valid callbacks first. So I'm not sure how you ran into a situation that's covered by this PR that isn't already covered by the existing code...

No, it's actually not the same logic. Sorry, our bad, the comment should have been more precise in the code. Maybe:

        # Flush after calling all the callbacks, not before, as callbacks in the
        # beginning of the iteration might cause new dead arg weakrefs in
        # callbacks that are iterated over later.
        # That's why we check for dead weakrefs in each iteration and
        # flush at the end to avoid redundant dead weakref checking in
        # subsequent calls.

I suspect the problem is that some callbacks delete objects that owns the arguments of other callbacks, which is why flushing only at the beginning is not sufficient.

kratsg · 2023-09-06T14:49:05Z

I suspect the problem is that some callbacks delete objects that owns the arguments of other callbacks, which is why flushing only at the beginning is not sufficient.

Again, this doesn't explain or change the logic. All callbacks are called through this function. Anytime you trigger the callbacks, the _flush() function is called, whose logic was just copied into the __call__() part which is already called by accessing self.callbacks (this got changed in the PR to access self._callbacks instead directly).

Without the changes, the flow is

switch backend
trigger callbacks through __call__(self)
access self.callbacks which triggers self._flush()
self._flush() removes dead refs
return self.callbacks and trigger callbacks on individual refs

With the changes, the flow is

switch backend
trigger callbacks through self.__call__()
access self._callbacks
trigger callbacks on individual refs that are not dead
self._flush() removes dead refs

Either way, __call__ only runs on non-dead-refs. In before these changes, if a callback removes a reference to a weak ref, it will get garbage collected the next time callbacks are triggered. Which is why I'm saying that the changes aren't making a difference. The only thing that could be happening is that garbage-collection runs in between steps 4 and 5 before these changes -- which isn't currently caught, and would explain why you'd only see it if you run heavily enough to force GC to not run as frequently as it might typically do so.

guitargeek · 2023-09-06T15:17:25Z

Thanks for making the bullet point list, maybe it's now easier for me to explain what I think is the crucial difference.

To quote the list you made and annotate with what I think is the flaw in the logic:

switch backend
trigger callbacks through __call__(self)
access self.callbacks which triggers self._flush()
self._flush() removes dead refs
return self.callbacks and iteratively trigger many callbacks on individual refs, where some of the first callbacks in the loop might drop the references on the arguments of callbacks later in the loop. The earlier call to flush has no chance of flushing these

After the change, the flush will be called after iterating over the callbacks, catching any new dead refs that were the results of calling the callbacks.

kratsg · 2023-09-06T15:42:44Z

After the change, the flush will be called after iterating over the callbacks, catching any new dead refs that were the results of calling the callbacks.

Yes, after iterating over all callbacks, not after each one... so before the change is identical to after the change. They will get caught by the next access for callbacks before the change (delayed), but that's still ok. This is why I'm still confused. I don't see a scenario not currently caught by the original code.

guitargeek · 2023-09-06T15:52:38Z

Here is a scenario that would not be caught by the original code:

There are two callbacks in the list:
- func_1 with arg_1. Calling func_1(arg_1) will delete all references to arg_2
- func_2 with arg_2
  Let's assume both arg_1 and arg_2 are not dead initially
We are calling Callables.__call__(), flush does nothing as there are no dead refs to any args
func_1(arg_1) is called and there are no references to arg_2 anymore as a result. arg_2 is dead.
func_2 is called, but arg_2 is now dead, resulting in an error.

I didn't dig into the pyhf code deep enough to exactly tell you what func_1, func_2, arg_1, and arg_2 are, but since this scenario was the only was I could think of to get dead references even after flushing, I assumed that this must be happening somewhere. And indeed, when @wernerd-cern implemented the new logic that catches this case, the errors in the reproducer in #2311 go away.

kratsg · 2023-09-06T15:54:57Z

4. func_2 is called, but arg_2 is now dead, resulting in an error.

Ok, but again, you've been stating that moving the flush to the end catches this case when the original code does not -- this is an incorrect statement.

The difference is that you're skipping deadrefs as you iterate through the callbacks depending on if the ordering of the callbacks causes some refs to go dead during processing.

I bet if you just simply add in logic to skip refs that are dead, and remove all other changes -- that you'll catch exactly this case. Moving around the flush will do nothing though.

guitargeek · 2023-09-06T15:59:21Z

Ok, but again, you've been stating that moving the flush to the end catches this case when the original code does not -- this is an incorrect statement.

No, I never intended to say that (sorry if it came across like that). The position of the flush actually doesn't matter, it's all about the explicit dead arg check in the loop. We only put the flush at the end now to be a bit more efficient: like this, the flush() will also flush the new dead refs that came from calling the callbacks, so that the next time __call__ is called it has to iterate over less elements.

You might as well put the flush also back on top.

But I'm happy we are now on the same page about skipping dead refs 👍 Sorry for not explaining explicitly why we moved the flush to the end.

kratsg

Approved, but with the caveat that this is somewhat impossible to properly force a test force, as it is only catching the scenario of an intermediate callback triggering a garbage-collect.

Co-authored-by: Jonas Rembser <[email protected]>

… debugging

guitargeek · 2023-09-06T19:38:32Z

Thanks, @kratsg and @matthewfeickert !

matthewfeickert

Approved, but with the caveat that this is somewhat impossible to properly force a test force, as it is only catching the scenario of an intermediate callback triggering a garbage-collect.

@kratsg if you are also happy to not attempt to figure out a way to deterministically test this then we can just note it in the commit message (I've updated it) and accept this.

Thanks @wernerd-cern and @guitargeek. I'll prepare pyhf v0.7.4 and get that out before the end of the week.

guitargeek · 2023-09-06T19:42:42Z

As pointed out by @kratsg, also the new title is maybe not completely appropriate because it's about the skipping of dead weakrefs and not the flushing. Maybe fix: Skip callbacks with dead weakrefs while iterating over callbacks would be more appropriate. But not that important after all :)

… callbacks (#2321) * Backport PR #2310 * Check refs while processing callbacks in case a callee is de-ref'd during the callback process. * Additionally flush after processing callbacks in case any callback de-ref's a callee. * Note that no additional tests are added for this case as the problem arises from and intermediate callback triggering a garbage-collect. It is unclear how to force this scenario deterministically in testing. * Add Jonas Rembser to contributors list. Co-authored-by: Daniel Werner <[email protected]> Co-authored-by: Jonas Rembser <[email protected]>

matthewfeickert · 2023-09-07T04:01:08Z

@wernerd-cern @guitargeek pyhf v0.7.4 is now out on PyPI with this fix. Thanks!

wernerd-cern changed the title ~~Fix: Dead objects being called through weakrefs~~ fix: Dead objects being called through weakrefs Aug 31, 2023

matthewfeickert assigned wernerd-cern Aug 31, 2023

matthewfeickert added the fix A bug fix label Aug 31, 2023

matthewfeickert requested a review from kratsg August 31, 2023 09:43

matthewfeickert requested changes Aug 31, 2023

View reviewed changes

wernerd-cern mentioned this pull request Aug 31, 2023

weakrefs to dead objects occuring when changing backends in pyhf benchmark #2311

Closed

1 task

wernerd-cern marked this pull request as ready for review August 31, 2023 14:51

wernerd-cern and others added 2 commits September 5, 2023 12:07

[pre-commit.ci] auto fixes from pre-commit.com hooks

b138897

for more information, see https://pre-commit.ci

matthewfeickert force-pushed the fix_dead_weakrefs branch from 08cf346 to b138897 Compare September 5, 2023 17:07

shorten summary

6f95bca

matthewfeickert requested changes Sep 5, 2023

View reviewed changes

src/pyhf/events.py Show resolved Hide resolved

kratsg requested changes Sep 5, 2023

View reviewed changes

matthewfeickert marked this pull request as draft September 6, 2023 01:10

kratsg marked this pull request as ready for review September 6, 2023 18:46

Merge branch 'main' into fix_dead_weakrefs

708fab7

kratsg approved these changes Sep 6, 2023

View reviewed changes

kratsg self-assigned this Sep 6, 2023

Add more detailed comment

bc288e9

Co-authored-by: Jonas Rembser <[email protected]>

Add Jonas to contributors given the level of detailed explentation in…

5cb11ae

… debugging

matthewfeickert changed the title ~~fix: Dead objects being called through weakrefs~~ fix: Flush after calling callbacks to avoid dead weakrefs Sep 6, 2023

matthewfeickert approved these changes Sep 6, 2023

View reviewed changes

matthewfeickert changed the title ~~fix: Flush after calling callbacks to avoid dead weakrefs~~ fix: Skip callbacks with dead weakrefs while iterating over callbacks Sep 6, 2023

matthewfeickert merged commit 1d43c7e into scikit-hep:main Sep 6, 2023
20 of 21 checks passed

matthewfeickert added need-to-backport tmp label until can be backported to patch release branch docs Documentation related labels Sep 6, 2023

matthewfeickert mentioned this pull request Sep 6, 2023

fix(backport): Skip callbacks with dead weakrefs while iterating over callbacks #2321

Merged

4 tasks

matthewfeickert removed the need-to-backport tmp label until can be backported to patch release branch label Sep 6, 2023

wernerd-cern deleted the fix_dead_weakrefs branch September 8, 2023 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Skip callbacks with dead weakrefs while iterating over callbacks #2310

fix: Skip callbacks with dead weakrefs while iterating over callbacks #2310

wernerd-cern commented Aug 31, 2023 •

edited by matthewfeickert

Loading

matthewfeickert left a comment •

edited

Loading

codecov bot commented Aug 31, 2023 •

edited

Loading

matthewfeickert commented Aug 31, 2023

matthewfeickert left a comment •

edited

Loading

kratsg left a comment

kratsg commented Sep 5, 2023

matthewfeickert commented Sep 6, 2023 •

edited

Loading

guitargeek commented Sep 6, 2023

kratsg commented Sep 6, 2023

guitargeek commented Sep 6, 2023 •

edited

Loading

kratsg commented Sep 6, 2023

guitargeek commented Sep 6, 2023 •

edited

Loading

kratsg commented Sep 6, 2023

guitargeek commented Sep 6, 2023 •

edited

Loading

kratsg left a comment

guitargeek commented Sep 6, 2023

matthewfeickert left a comment

guitargeek commented Sep 6, 2023

matthewfeickert commented Sep 7, 2023

fix: Skip callbacks with dead weakrefs while iterating over callbacks #2310

fix: Skip callbacks with dead weakrefs while iterating over callbacks #2310

Conversation

wernerd-cern commented Aug 31, 2023 • edited by matthewfeickert Loading

Description

Checklist Before Requesting Reviewer

Before Merging

matthewfeickert left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Aug 31, 2023 • edited Loading

Codecov Report

matthewfeickert commented Aug 31, 2023

matthewfeickert left a comment • edited Loading

Choose a reason for hiding this comment

kratsg left a comment

Choose a reason for hiding this comment

kratsg commented Sep 5, 2023

matthewfeickert commented Sep 6, 2023 • edited Loading

guitargeek commented Sep 6, 2023

kratsg commented Sep 6, 2023

guitargeek commented Sep 6, 2023 • edited Loading

kratsg commented Sep 6, 2023

guitargeek commented Sep 6, 2023 • edited Loading

kratsg commented Sep 6, 2023

guitargeek commented Sep 6, 2023 • edited Loading

kratsg left a comment

Choose a reason for hiding this comment

guitargeek commented Sep 6, 2023

matthewfeickert left a comment

Choose a reason for hiding this comment

guitargeek commented Sep 6, 2023

matthewfeickert commented Sep 7, 2023

wernerd-cern commented Aug 31, 2023 •

edited by matthewfeickert

Loading

matthewfeickert left a comment •

edited

Loading

codecov bot commented Aug 31, 2023 •

edited

Loading

matthewfeickert left a comment •

edited

Loading

matthewfeickert commented Sep 6, 2023 •

edited

Loading

guitargeek commented Sep 6, 2023 •

edited

Loading

guitargeek commented Sep 6, 2023 •

edited

Loading

guitargeek commented Sep 6, 2023 •

edited

Loading