Skip to content

Conversation

@untitaker
Copy link
Member

@untitaker untitaker commented Dec 2, 2025

Implement a background thread that dumps all stacktraces when the main
thread gets stuck. Log as a warning, so that the message makes it to
Sentry too.

variations of this are:

  • Dump stack trace during consumer shutdown sentry#100857 -- unlike that PR,
    this one can run enabled in all consumers, since it only reports
    stacktraces when we're actually stuck.
  • wip: Stuck detector #442 -- this is a previous
    version that only reported on the main thread, and in an overly
    complicated manner. We cannot use faulthandler because that one can
    only report to files (that have a real fileno, so no StringIO), and
    I want to report the stuck consumers to logging/Sentry.

Implement a background thread that dumps all stacktraces when the main
thread gets stuck.

variations of this are:

* getsentry/sentry#100857 -- unlike that PR,
  this one can run enabled in all consumers, since it only reports
  stacktraces when we're actually stuck.
* #442 -- this is a previous
  version that only reported on the main thread, and in an overly
  complicated manner. We cannot use faulthandler because that one can
  only report to "real" files, and I want to report the stuck consumers
  to logging/Sentry.
@untitaker untitaker marked this pull request as ready for review December 2, 2025 12:29
@untitaker untitaker requested review from a team as code owners December 2, 2025 12:29
@untitaker
Copy link
Member Author

doc CI fix: #513

Comment on lines 394 to 398
if self.__stuck:
i += 1
else:
i = 0
self.__stuck = True

This comment was marked as outdated.

Copy link
Contributor

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comments before merging

i = 0
self.__stuck = True

if i >= stuck_detector_timeout:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you comparing a counter to a timeout in milliseconds ? I think you should get the current time and check if that is greater than the previous time you logged the threads

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i sleep one second per iteration so the counter is rougly counting the number of seconds since the last time stuck was set to false. i can adjust it to use time instead

i += 1
else:
i = 0
self.__stuck = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using the timestap of the last call to _run to decide whether the consumer is stuck rather than keeping resetting the stuck flag to true and false. I think it the behavior would be more obvious:

  • Set last_run to time.time() at initialization
  • Reset it every time _run is called
  • in this thread just check every second whether last_run + timeout < time()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +53 to +59
for thread_id, frame in frames.items():
thread = threads_by_id.get(thread_id)
thread_name = thread.name if thread else f"Unknown-{thread_id}"
stack = "".join(traceback.format_stack(frame))
stacks.append(f"Thread {thread_name} ({thread_id}):\n{stack}")

return "\n\n".join(stacks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just

from io import StringIO

f = StringIO()
faulthandler.dump_traceback(file=, all_threads=True)
return f.getvalue()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

faulthandler needs a real file descriptor, it will try to access f.fileno which StringIO does not have

@untitaker untitaker merged commit 28847a6 into main Dec 2, 2025
15 of 16 checks passed
@untitaker untitaker deleted the ref/stuck-detector-2 branch December 2, 2025 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants