Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Skip callbacks with dead weakrefs while iterating over callbacks #2310

Merged
merged 6 commits into from
Sep 6, 2023

Conversation

wernerd-cern
Copy link
Contributor

@wernerd-cern wernerd-cern commented Aug 31, 2023

Description

Resolves #2311

The bug fixed by this PR occured in my work in the repository https://gitlab.cern.ch/dawerner/statanabenchmark.
I am currently working on a summer student project at CERN benchmarking statistical analysis tools in python including pyhf.
When performing these benchmarks, pyhf is called multiple times in a row with different backends. In some cases an error occurs then as objects are called through weakrefs that are already dead and thus NoneType objects.

When events (such as 'tensorlib_changed') are called, a bug occured previously where objects (i.e. the TensorViewer) are already dead. In this case an error occured when member functions where using attributes such as self._sorted_indices in src/pyhf/tensor/common.py:33.

This bug is fixed by catching the cases, in which the object is invalidated by earlier callbacks.

Checklist Before Requesting Reviewer

  • Tests are passing
  • "WIP" removed from the title of the pull request
  • Selected an Assignee for the PR to be responsible for the log summary

Before Merging

For the PR Assignees:

  • Summarize commit messages into a comprehensive review of the PR
* Check refs while processing callbacks in case a callee is de-ref'd during
  the callback process.
* Additionally flush after processing callbacks in case any callback de-ref's
  a callee.
* Note that no additional tests are added for this case as the problem arises from
  and intermediate callback triggering a garbage-collect. It is unclear how to force
  this scenario deterministically in testing.
* Add Jonas Rembser to contributors list.

Co-authored-by: Jonas Rembser <[email protected]>

@wernerd-cern wernerd-cern changed the title Fix: Dead objects being called through weakrefs fix: Dead objects being called through weakrefs Aug 31, 2023
@matthewfeickert matthewfeickert added the fix A bug fix label Aug 31, 2023
@matthewfeickert matthewfeickert requested a review from kratsg August 31, 2023 09:43
Copy link
Member

@matthewfeickert matthewfeickert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wernerd-cern Can you please report what the actual bug is and give a reproducer as well for documentation? It would be better to make an Issue that clearly documents the bug and then have this PR close it. This PR will need tests as well.

@codecov
Copy link

codecov bot commented Aug 31, 2023

Codecov Report

Patch coverage: 80.00% and project coverage change: +0.66% 🎉

Comparison is base (4954010) 97.61% compared to head (5cb11ae) 98.28%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2310      +/-   ##
==========================================
+ Coverage   97.61%   98.28%   +0.66%     
==========================================
  Files          69       69              
  Lines        4535     4538       +3     
  Branches      802      803       +1     
==========================================
+ Hits         4427     4460      +33     
+ Misses         65       45      -20     
+ Partials       43       33      -10     
Flag Coverage Δ
contrib 97.86% <80.00%> (?)
doctest 60.99% <40.00%> (-0.07%) ⬇️
unittests-3.10 96.29% <80.00%> (?)
unittests-3.11 96.29% <80.00%> (?)
unittests-3.8 96.31% <80.00%> (?)
unittests-3.9 96.34% <80.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
src/pyhf/events.py 98.61% <80.00%> (-1.39%) ⬇️

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@matthewfeickert
Copy link
Member

For reviewer reference, the last time this code was edited was in PR #1035.

wernerd-cern and others added 2 commits September 5, 2023 12:07
When events (such as 'tensorlib_changed') are called, a bug occured
previously where objects (i.e. the `TensorViewer`) are already dead.
In this case an error occured when member functions where using
attributes such as `self._sorted_indices`.

This bug is fixed by catching the cases, in which the object is
invalidated by earlier callbacks.
Copy link
Member

@matthewfeickert matthewfeickert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is indeed needed, to be able to do what we do in our tests without providing all of conftest.py's machinery. So this should go in and then a patch release made from it. 👍

This code doesn't have full coverage though, so appropriate tests will need to be added.


Aside: Thanks for making https://gitlab.cern.ch/dawerner/statanabenchmark public. This is a nice project. :)

src/pyhf/events.py Show resolved Hide resolved
Copy link
Contributor

@kratsg kratsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't seem to introduce any new functionality that does not already exist. __call__ here in the changes is including changes covered by the _flush function. The _flush() should be called before trying to access the callbacks to determine the valid callbacks first. So I'm not sure how you ran into a situation that's covered by this PR that isn't already covered by the existing code...

@kratsg
Copy link
Contributor

kratsg commented Sep 5, 2023

In addition, the case you're describing is covered by our tests right now.

@matthewfeickert
Copy link
Member

matthewfeickert commented Sep 6, 2023

I'm going to put this into draft until we can determine if Issue #2310 is actually reproducible in any way that pyhf can support it.

@matthewfeickert matthewfeickert marked this pull request as draft September 6, 2023 01:10
@guitargeek
Copy link
Contributor

This PR doesn't seem to introduce any new functionality that does not already exist. __call__ here in the changes is including changes covered by the _flush function. The _flush() should be called before trying to access the callbacks to determine the valid callbacks first. So I'm not sure how you ran into a situation that's covered by this PR that isn't already covered by the existing code...

No, it's actually not the same logic. Sorry, our bad, the comment should have been more precise in the code. Maybe:

        # Flush after calling all the callbacks, not before, as callbacks in the
        # beginning of the iteration might cause new dead arg weakrefs in
        # callbacks that are iterated over later.
        # That's why we check for dead weakrefs in each iteration and
        # flush at the end to avoid redundant dead weakref checking in
        # subsequent calls.

I suspect the problem is that some callbacks delete objects that owns the arguments of other callbacks, which is why flushing only at the beginning is not sufficient.

@kratsg
Copy link
Contributor

kratsg commented Sep 6, 2023

I suspect the problem is that some callbacks delete objects that owns the arguments of other callbacks, which is why flushing only at the beginning is not sufficient.

Again, this doesn't explain or change the logic. All callbacks are called through this function. Anytime you trigger the callbacks, the _flush() function is called, whose logic was just copied into the __call__() part which is already called by accessing self.callbacks (this got changed in the PR to access self._callbacks instead directly).

Without the changes, the flow is

  1. switch backend
  2. trigger callbacks through __call__(self)
  3. access self.callbacks which triggers self._flush()
  4. self._flush() removes dead refs
  5. return self.callbacks and trigger callbacks on individual refs

With the changes, the flow is

  1. switch backend
  2. trigger callbacks through self.__call__()
  3. access self._callbacks
  4. trigger callbacks on individual refs that are not dead
  5. self._flush() removes dead refs

Either way, __call__ only runs on non-dead-refs. In before these changes, if a callback removes a reference to a weak ref, it will get garbage collected the next time callbacks are triggered. Which is why I'm saying that the changes aren't making a difference. The only thing that could be happening is that garbage-collection runs in between steps 4 and 5 before these changes -- which isn't currently caught, and would explain why you'd only see it if you run heavily enough to force GC to not run as frequently as it might typically do so.

@guitargeek
Copy link
Contributor

guitargeek commented Sep 6, 2023

Thanks for making the bullet point list, maybe it's now easier for me to explain what I think is the crucial difference.

To quote the list you made and annotate with what I think is the flaw in the logic:

  • switch backend

  • trigger callbacks through __call__(self)

  • access self.callbacks which triggers self._flush()

  • self._flush() removes dead refs

  • return self.callbacks and iteratively trigger many callbacks on individual refs, where some of the first callbacks in the loop might drop the references on the arguments of callbacks later in the loop. The earlier call to flush has no chance of flushing these

After the change, the flush will be called after iterating over the callbacks, catching any new dead refs that were the results of calling the callbacks.

@kratsg
Copy link
Contributor

kratsg commented Sep 6, 2023

After the change, the flush will be called after iterating over the callbacks, catching any new dead refs that were the results of calling the callbacks.

Yes, after iterating over all callbacks, not after each one... so before the change is identical to after the change. They will get caught by the next access for callbacks before the change (delayed), but that's still ok. This is why I'm still confused. I don't see a scenario not currently caught by the original code.

@guitargeek
Copy link
Contributor

guitargeek commented Sep 6, 2023

Here is a scenario that would not be caught by the original code:

  1. There are two callbacks in the list:

    • func_1 with arg_1. Calling func_1(arg_1) will delete all references to arg_2
    • func_2 with arg_2
      Let's assume both arg_1 and arg_2 are not dead initially
  2. We are calling Callables.__call__(), flush does nothing as there are no dead refs to any args

  3. func_1(arg_1) is called and there are no references to arg_2 anymore as a result. arg_2 is dead.

  4. func_2 is called, but arg_2 is now dead, resulting in an error.

I didn't dig into the pyhf code deep enough to exactly tell you what func_1, func_2, arg_1, and arg_2 are, but since this scenario was the only was I could think of to get dead references even after flushing, I assumed that this must be happening somewhere. And indeed, when @wernerd-cern implemented the new logic that catches this case, the errors in the reproducer in #2311 go away.

@kratsg
Copy link
Contributor

kratsg commented Sep 6, 2023

4. func_2 is called, but arg_2 is now dead, resulting in an error.

Ok, but again, you've been stating that moving the flush to the end catches this case when the original code does not -- this is an incorrect statement.

The difference is that you're skipping deadrefs as you iterate through the callbacks depending on if the ordering of the callbacks causes some refs to go dead during processing.

I bet if you just simply add in logic to skip refs that are dead, and remove all other changes -- that you'll catch exactly this case. Moving around the flush will do nothing though.

@guitargeek
Copy link
Contributor

guitargeek commented Sep 6, 2023

Ok, but again, you've been stating that moving the flush to the end catches this case when the original code does not -- this is an incorrect statement.

No, I never intended to say that (sorry if it came across like that). The position of the flush actually doesn't matter, it's all about the explicit dead arg check in the loop. We only put the flush at the end now to be a bit more efficient: like this, the flush() will also flush the new dead refs that came from calling the callbacks, so that the next time __call__ is called it has to iterate over less elements.

You might as well put the flush also back on top.

But I'm happy we are now on the same page about skipping dead refs 👍 Sorry for not explaining explicitly why we moved the flush to the end.

@kratsg kratsg marked this pull request as ready for review September 6, 2023 18:46
Copy link
Contributor

@kratsg kratsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but with the caveat that this is somewhat impossible to properly force a test force, as it is only catching the scenario of an intermediate callback triggering a garbage-collect.

@kratsg kratsg self-assigned this Sep 6, 2023
Co-authored-by: Jonas Rembser <[email protected]>
@guitargeek
Copy link
Contributor

Thanks, @kratsg and @matthewfeickert !

@matthewfeickert matthewfeickert changed the title fix: Dead objects being called through weakrefs fix: Flush after calling callbacks to avoid dead weakrefs Sep 6, 2023
Copy link
Member

@matthewfeickert matthewfeickert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but with the caveat that this is somewhat impossible to properly force a test force, as it is only catching the scenario of an intermediate callback triggering a garbage-collect.

@kratsg if you are also happy to not attempt to figure out a way to deterministically test this then we can just note it in the commit message (I've updated it) and accept this.

Thanks @wernerd-cern and @guitargeek. I'll prepare pyhf v0.7.4 and get that out before the end of the week.

@guitargeek
Copy link
Contributor

As pointed out by @kratsg, also the new title is maybe not completely appropriate because it's about the skipping of dead weakrefs and not the flushing. Maybe fix: Skip callbacks with dead weakrefs while iterating over callbacks would be more appropriate. But not that important after all :)

@matthewfeickert matthewfeickert changed the title fix: Flush after calling callbacks to avoid dead weakrefs fix: Skip callbacks with dead weakrefs while iterating over callbacks Sep 6, 2023
@matthewfeickert matthewfeickert merged commit 1d43c7e into scikit-hep:main Sep 6, 2023
20 of 21 checks passed
@matthewfeickert matthewfeickert added need-to-backport tmp label until can be backported to patch release branch docs Documentation related labels Sep 6, 2023
@matthewfeickert matthewfeickert removed the need-to-backport tmp label until can be backported to patch release branch label Sep 6, 2023
matthewfeickert added a commit that referenced this pull request Sep 6, 2023
… callbacks (#2321)

* Backport PR #2310
* Check refs while processing callbacks in case a callee is de-ref'd during
  the callback process.
* Additionally flush after processing callbacks in case any callback de-ref's
  a callee.
* Note that no additional tests are added for this case as the problem arises from
  and intermediate callback triggering a garbage-collect. It is unclear how to force
  this scenario deterministically in testing.
* Add Jonas Rembser to contributors list.

Co-authored-by: Daniel Werner <[email protected]>
Co-authored-by: Jonas Rembser <[email protected]>
@matthewfeickert
Copy link
Member

@wernerd-cern @guitargeek pyhf v0.7.4 is now out on PyPI with this fix. Thanks!

@wernerd-cern wernerd-cern deleted the fix_dead_weakrefs branch September 8, 2023 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related fix A bug fix
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

weakrefs to dead objects occuring when changing backends in pyhf benchmark
4 participants