Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change lldb breakpoint and stepping algorithm #10026

Open
wants to merge 8 commits into
base: stable/20240723
Choose a base branch
from

Conversation

jasonmolenda
Copy link

No description provided.

jasonmolenda and others added 6 commits February 13, 2025 13:41
xusheng added support for swbreak/hwbreak a month ago, and no special
support was needed in ProcessGDBRemote when they're received because
lldb already marks a thread as having hit a breakpoint when it stops at
a breakpoint site. However, with changes I am working on, we need to
know the real stop reason a thread stopped or the breakpoint hit will
not be recognized.

This is similar to how lldb processes the "watch/rwatch/awatch" keys in
a thread stop packet -- we set the `reason` to `watchpoint`, and these
set it to `breakpoint` so we set the stop reason correctly later in
these methods.

(cherry picked from commit 65a4d11)
lldb-server built with NativeProcessLinux.cpp and
NativeProcessFreeBSD.cpp can use breakpoints to implement instruction
stepping on cores where there is no native instruction-step primitive.
Currently these set a breakpoint, continue, and if we hit the breakpoint
with the original thread, set the stop reason to be "trace".

I am wrapping up a change to lldb's breakpoint algorithm where I change
its current behavior of

"if a thread stops at a breakpoint site, we set
the thread's stop reason to breakpoint-hit, even if the breakpoint
hasn't been executed" +
"when resuming any thread at a breakpoint site, instruction-step past
the breakpoint before resuming"

to a behavior of

"when a thread executes a breakpoint, set the stop reason to
breakpoint-hit" +
"when a thread has hit a breakpoint, when the thread resumes, we
silently step past the breakpoint and then resume the thread".

For these lldb-server targets doing breakpoint stepping, this means that
if we are sitting on a breakpoint that has not yet executed, and
instruction-step the thread, we will execute the breakpoint instruction
at $pc (instead of $next-pc where it meant to go), and stop again -- at
the same pc value. Then we will rewrite the stop reason to 'trace'. The
higher level logic will see that we haven't hit the breakpoint
instruction again, so it will try to instruction step again, hitting the
breakpoint again forever.

To fix this, I'm checking that the thread matches the one we are
instruction-stepping-by-breakpoint AND that we've stopped at the
breakpoint address we are stepping to. Only in that case will the stop
reason be rewritten to "trace" hiding the implementation detail that the
step was done by breakpoints.

(cherry picked from commit 213c59d)
…lvm#109643)

Apparently a typo is causing compile error, added by llvm#108504.

(cherry picked from commit 85220a0)
lldb will change how it reports stop reasons around breakpoints in the
near future. I landed an earlier version of this change and noticed
debuginfo test failures on the CI bots due to the changes. I'm
addressing the issues found by CI at
llvm#105594 and will re-land once
I've done all of them.

Currently, when lldb stops at a breakpoint instruction -- but has not
yet executed the instruction -- it will overwrite the thread's Stop
Reason with "breakpoint-hit". This caused bugs when the original stop
reason was important to the user - for instance, a watchpoint on an
AArch64 system where we have to instruction-step past the watchpoint to
find the new value. Normally we would instruction step, fetch the new
value, then report the user that a watchpoint has been hit with the old
and new values. But if the instruction after this access is a breakpoint
site, we overwrite the "watchpoint hit" stop reason (and related
actions) with "breakpoint hit".

dexter sets breakpoints on all source lines, then steps line-to-line,
hitting the breakpoints. But with this new behavior, we see two steps
per source line: The first step gets us to the start of the next line,
with a "step completed" stop reason. Then we step again and we execute
the breakpoint instruction, stop with the pc the same, and report
"breakpoint hit". Now we can step a second time and move past the
breakpoint.

I've changed the `step` method in LLDB.py to check if we step to a
breakpoint site but have a "step completed" stop reason -- in which case
we have this new breakpoint behavior, and we need to step a second time
to actually hit the breakpoint like the debuginfo tests expect.

(cherry picked from commit 93e45a6)
lldb today has two rules: When a thread stops at a BreakpointSite, we
set the thread's StopReason to be "breakpoint hit" (regardless if we've
actually hit the breakpoint, or if we've merely stopped *at* the
breakpoint instruction/point and haven't tripped it yet). And second,
when resuming a process, any thread sitting at a BreakpointSite is
silently stepped over the BreakpointSite -- because we've already
flagged the breakpoint hit when we stopped there originally.

In this patch, I change lldb to only set a thread's stop reason to
breakpoint-hit when we've actually executed the instruction/triggered
the breakpoint. When we resume, we only silently step past a
BreakpointSite that we've registered as hit. We preserve this state
across inferior function calls that the user may do while stopped, etc.

Also, when a user adds a new breakpoint at $pc while stopped, or changes
$pc to be the address of a BreakpointSite, we will silently step past
that breakpoint when the process resumes. This is purely a UX call, I
don't think there's any person who wants to set a breakpoint at $pc and
then hit it immediately on resuming.

One non-intuitive UX from this change, butt is necessary: If you're
stopped at a BreakpointSite that has not yet executed, you `stepi`, you
will hit the breakpoint and the pc will not yet advance. This thread has
not completed its stepi, and the ThreadPlanStepInstruction is still on
the stack. If you then `continue` the thread, lldb will now stop and
say, "instruction step completed", one instruction past the
BreakpointSite. You can continue a second time to resume execution.

The bugs driving this change are all from lldb dropping the real stop
reason for a thread and setting it to breakpoint-hit when that was not
the case. Jim hit one where we have an aarch64 watchpoint that triggers
one instruction before a BreakpointSite. On this arch we are notified of
the watchpoint hit after the instruction has been unrolled -- we disable
the watchpoint, instruction step, re-enable the watchpoint and collect
the new value. But now we're on a BreakpointSite so the watchpoint-hit
stop reason is lost.

Another was reported by ZequanWu in
https://discourse.llvm.org/t/lldb-unable-to-break-at-start/78282 we
attach to/launch a process with the pc at a BreakpointSite and
misbehave. Caroline Tice mentioned it is also a problem they've had with
putting a breakpoint on _dl_debug_state.

The change to each Process plugin that does execution control is that

1. If we've stopped at a BreakpointSite that has not been executed yet,
we will call Thread::SetThreadStoppedAtUnexecutedBP(pc) to record that.
When the thread resumes, if the pc is still at the same site, we will
continue, hit the breakpoint, and stop again.

2. When we've actually hit a breakpoint (enabled for this thread or
not), the Process plugin should call
Thread::SetThreadHitBreakpointSite(). When we go to resume the thread,
we will push a step-over-breakpoint ThreadPlan before resuming.

The biggest set of changes is to StopInfoMachException where we
translate a Mach Exception into a stop reason. The Mach exception codes
differ in a few places depending on the target (unambiguously), and I
didn't want to duplicate the new code for each target so I've tested
what mach exceptions we get for each action on each target, and
reorganized StopInfoMachException::CreateStopReasonWithMachException to
document these possible values, and handle them without specializing
based on the target arch.

I first landed this patch in July 2024 via
llvm#96260

but the CI bots and wider testing found a number of test case failures
that needed to be updated, I reverted it. I've fixed all of those issues
in separate PRs and this change should run cleanly on all the CI bots
now.

rdar://123942164
(cherry picked from commit b666ac3)
@jasonmolenda jasonmolenda requested a review from a team as a code owner February 13, 2025 23:34
@jasonmolenda
Copy link
Author

@swift-ci test

@jasonmolenda
Copy link
Author

Windows CI had one failure on TestConsecutiveBreakpoints,

======================================================================

FAIL: test_single_step_thread_specific (TestConsecutiveBreakpoints.ConsecutiveBreakpointsTestCase)
   Test that single step stops, even though the second breakpoint is not valid.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\swift-ci\jenkins\workspace\apple-llvm-project-pull-request-windows\llvm-project\lldb\packages\Python\lldbsuite\test\decorators.py", line 452, in wrapper
    return func(self, *args, **kwargs)
  File "C:\Users\swift-ci\jenkins\workspace\apple-llvm-project-pull-request-windows\llvm-project\lldb\test\API\functionalities\breakpoint\consecutive_breakpoints\TestConsecutiveBreakpoints.py", line 121, in test_single_step_thread_specific
    self.finish_test()
  File "C:\Users\swift-ci\jenkins\workspace\apple-llvm-project-pull-request-windows\llvm-project\lldb\test\API\functionalities\breakpoint\consecutive_breakpoints\TestConsecutiveBreakpoints.py", line 42, in finish_test
    self.assertState(self.process.GetState(), lldb.eStateExited)
  File "C:\Users\swift-ci\jenkins\workspace\apple-llvm-project-pull-request-windows\llvm-project\lldb\packages\Python\lldbsuite\test\lldbtest.py", line 2590, in assertState
    self.fail(self._formatMessage(msg, error))
AssertionError: stopped (5) != exited (10)
Config=x86_64-T:\5\bin\clang.exe

FAILED (failures=1)

which is the kind of test that would detect an actual regression, will try re-running once again but I might have to stare at ProcessWindowsNative and this test and see if I can't imagine what is happening.

@jasonmolenda
Copy link
Author

@swift-ci test windows

@jasonmolenda
Copy link
Author

@swift-ci test windows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants