Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash reports sent only after second launch, not immediately after app restart #1114

Open
1 of 3 tasks
dalnoki opened this issue Jan 14, 2025 · 8 comments
Open
1 of 3 tasks

Comments

@dalnoki
Copy link

dalnoki commented Jan 14, 2025

Description

A customer is encountering an issue with the timing of crash report submissions in a Qt-based application using Sentry Native with Breakpad as the backend. When the application crashes, they use the on_crash hook to restart the app immediately. However, crash reports are only appearing in Sentry after the second launch of the application, not after the app is restarted due to the crash. This delay in sending crash reports is problematic, as the app runs as a background service and relying on users to manually restart it would result in unacceptable delays in receiving crash reports.

When does the problem happen

  • During build
  • During run-time
  • When capturing a hard crash

Environment

  • OS: macOS Sequoia 15.2 (24C101)
  • Compiler: Clang (C++ arm64 bit)
  • CMake version and config: 3.28.1 Debug config

Steps To Reproduce

Please find a video recording of the issue in the shadow Jira ticket.. The issue is reproducible on the latest SDK version as well.

@supervacuus
Copy link
Collaborator

Hi @dalnoki. Thanks for the report!

When the application crashes, they use the on_crash hook to restart the app immediately.

No crash report will be written to disk if the user terminates/restarts the program from the on_crash hook. This means the Native SDK's transport won't have a report to send on the next start. The on_crash callback must return so the crash handler can finish processing. When the handler invokes the on_crash hook, no minidump is generated, and only a preliminary crash event exists in memory. I am surprised the user can see any crash report at all.

I recommend using macOS' service-manager launchd to restart the application whenever it crashes and decouple that mechanism from the on_crash hook, which cannot be used to terminate the application. In particular, the KeepAlive job property will restart the application whenever it terminates.

@dalnoki
Copy link
Author

dalnoki commented Jan 15, 2025

Hey @supervacuus the customer shared the following: "Also worth mentioning that we can reproduce the issue on Ubuntu as well. I want to highlight that disabling the reboot does not solve the issue. We have tried this at an early stage and it's did not work. And we are spawning a completely new instance of our app, so it should not interfere with Sentry SDK's work."

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jan 15, 2025
@oa-mega
Copy link

oa-mega commented Jan 15, 2025

Hi @supervacuus ,

I want to highlight that I can see a ".envelope" file created in the database path after the crash. however for some reason it does not get picked up and sent on the immediate run after the crash (regardless our reboot hook enabled or disabled)
but rather the one after it.

@kahest kahest moved this from Needs Discussion to Needs Investigation in Mobile & Cross Platform SDK Jan 16, 2025
@supervacuus
Copy link
Collaborator

I want to highlight that I can see a ".envelope" file created in the database path after the crash. however for some reason it does not get picked up and sent on the immediate run after the crash (regardless our reboot hook enabled or disabled)
but rather the one after it.

Thanks. The crash handler could finish processing if an envelope file were created. Otherwise, you'd probably only see a .dmp file in the run directory.

And we are spawning a completely new instance of our app, so it should not interfere with Sentry SDK's work.

One reason for the described behavior could be that a previous run still locks the crashed run. We introduced file locks so multiple processes could share a database path without interfering with their runs (i.e., each run directory is locked as long as a Native SDK using process keeps their file descriptor alive).

Is it possible that the crashed process is still being executed while you initialize the Native SDK in the newly started process? This would explain why it doesn't get picked up on the start "following" the crash but does get picked up on the next. Keep in mind that the file locks depend on the file descriptors being released. This means that any forked/spawned process that inherits file descriptors from main process could hold on to it longer than you'd expect.

@oa-mega
Copy link

oa-mega commented Jan 23, 2025

Thank you for the reply!
After some investigations, we do have a process that outlives the crashed process and it is what's causing the issue.

Does it mean there's noway to reboot our application from the crashed process? Can we force Sentry SDK to process these envlopes after the initialization?

@supervacuus
Copy link
Collaborator

After some investigations, we do have a process that outlives the crashed process and it is what's causing the issue.

I am glad you found the cause.

Does it mean there's noway to reboot our application from the crashed process?

The problem is not the rebooting of your application but keeping the crashed process running while you initialize the Native SDK in another process that shares the same database path and expecting it to process the crashed run whose process is still running. That is a race condition. You can either delay the initialization of the Native SDK until the crashed process is fully terminated or accept that the crash will be sent at the next start.

Can we force Sentry SDK to process these envlopes after the initialization?

That is currently impossible and could also change the semantics of last_crash, so I am unsure if this will ever be exposed since sentry_init() acts as a transactional boundary for all previous runs and last_crash.

@kahest kahest moved this from Needs Investigation to Needs More Information in Mobile & Cross Platform SDK Jan 30, 2025
@getsantry getsantry bot moved this to Waiting for: Community in GitHub Issues with 👀 3 Jan 30, 2025
@oa-mega
Copy link

oa-mega commented Feb 4, 2025

It was indeed the case, and as a work around we managed to identify the file descriptor causing the issue and change
its flags to add FD_CLOEXEC using fcntl. This worked perfectly on MacOS. However our application targets all major desktopp platforms, and the same solution did not work in Ubunt. Are there other descriptors/locks that Sentry uses in Linux ?
Any idea on what might cause this behavior?

@getsantry getsantry bot moved this from Waiting for: Community to Waiting for: Product Owner in GitHub Issues with 👀 3 Feb 4, 2025
@supervacuus
Copy link
Collaborator

Are there other descriptors/locks that Sentry uses in Linux ?

No, the entire underlying file-locking implementation is shared across all supported UNIXes. It is also the only file lock in the SDK; no other file descriptors should block the sending of persisted envelopes. Do you have any debug logs or a quick backtrace in a debugger against the blocked process that could verify that you are stuck in the same place?

Any idea on what might cause this behavior?

If there is still an active file lock on Ubuntu, without knowing anything about your application, I could only imagine:

  • since the FD_CLOEXEC bit only affects exec()-family functions, maybe your Ubuntu implementation uses other means to start processes (fork, clone without an exec()) that would inherit file descriptors?
  • if you duplicate your file descriptors in your Linux implementation for some reason, those duplicates wouldn't typically inherit the FD_CLOEXEC (dup3 with O_CLOEXEC being the exception), leaving the file lock active via the duplicate without further intervention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Waiting for: Community
Status: Needs More Information
Development

No branches or pull requests

4 participants