ref: Stuck detector #512

untitaker · 2025-12-02T12:28:59Z

Implement a background thread that dumps all stacktraces when the main
thread gets stuck. Log as a warning, so that the message makes it to
Sentry too.

variations of this are:

Dump stack trace during consumer shutdown sentry#100857 -- unlike that PR,
this one can run enabled in all consumers, since it only reports
stacktraces when we're actually stuck.
wip: Stuck detector #442 -- this is a previous
version that only reported on the main thread, and in an overly
complicated manner. We cannot use faulthandler because that one can
only report to files (that have a real fileno, so no StringIO), and
I want to report the stuck consumers to logging/Sentry.

Implement a background thread that dumps all stacktraces when the main thread gets stuck. variations of this are: * getsentry/sentry#100857 -- unlike that PR, this one can run enabled in all consumers, since it only reports stacktraces when we're actually stuck. * #442 -- this is a previous version that only reported on the main thread, and in an overly complicated manner. We cannot use faulthandler because that one can only report to "real" files, and I want to report the stuck consumers to logging/Sentry.

untitaker · 2025-12-02T12:36:28Z

doc CI fix: #513

arroyo/processing/processor.py

+                if self.__stuck:
+                    i += 1
+                else:
+                    i = 0
+                    self.__stuck = True


fpacifici

Please address the comments before merging

fpacifici · 2025-12-02T14:10:22Z

arroyo/processing/processor.py

+                    i = 0
+                    self.__stuck = True
+
+                if i >= stuck_detector_timeout:


Why are you comparing a counter to a timeout in milliseconds ? I think you should get the current time and check if that is greater than the previous time you logged the threads

i sleep one second per iteration so the counter is rougly counting the number of seconds since the last time stuck was set to false. i can adjust it to use time instead

fpacifici · 2025-12-02T14:15:33Z

arroyo/processing/processor.py

+                    i += 1
+                else:
+                    i = 0
+                    self.__stuck = True


I'd recommend using the timestap of the last call to _run to decide whether the consumer is stuck rather than keeping resetting the stuck flag to true and false. I think it the behavior would be more obvious:

Set last_run to time.time() at initialization

Reset it every time _run is called

in this thread just check every second whether last_run + timeout < time()

fpacifici · 2025-12-02T14:22:41Z

arroyo/processing/processor.py

+    for thread_id, frame in frames.items():
+        thread = threads_by_id.get(thread_id)
+        thread_name = thread.name if thread else f"Unknown-{thread_id}"
+        stack = "".join(traceback.format_stack(frame))
+        stacks.append(f"Thread {thread_name} ({thread_id}):\n{stack}")
+
+    return "\n\n".join(stacks)


Why not just

from io import StringIO f = StringIO() faulthandler.dump_traceback(file=, all_threads=True) return f.getvalue()

faulthandler needs a real file descriptor, it will try to access f.fileno which StringIO does not have

untitaker marked this pull request as ready for review December 2, 2025 12:29

untitaker requested review from a team as code owners December 2, 2025 12:29

shut down thread to prevent leaks in tests

1b61a37

sentry bot reviewed Dec 2, 2025

View reviewed changes

arroyo/processing/processor.py Outdated

Comment on lines 394 to 398

if self.__stuck:

i += 1

else:

i = 0

self.__stuck = True

This comment was marked as outdated.

Sign in to view

shutdown properly again in tests

879fa7a

fpacifici approved these changes Dec 2, 2025

View reviewed changes

make it timestamp based, make test faster

19f98f7

untitaker merged commit 28847a6 into main Dec 2, 2025
15 of 16 checks passed

untitaker deleted the ref/stuck-detector-2 branch December 2, 2025 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ref: Stuck detector #512

ref: Stuck detector #512

Uh oh!

untitaker commented Dec 2, 2025 •

edited

Loading

Uh oh!

untitaker commented Dec 2, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

fpacifici left a comment

Uh oh!

fpacifici Dec 2, 2025

Uh oh!

untitaker Dec 2, 2025

Uh oh!

fpacifici Dec 2, 2025

Uh oh!

untitaker Dec 2, 2025

Uh oh!

fpacifici Dec 2, 2025

Uh oh!

untitaker Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

ref: Stuck detector #512

ref: Stuck detector #512

Uh oh!

Conversation

untitaker commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

untitaker commented Dec 2, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

fpacifici left a comment

Choose a reason for hiding this comment

Uh oh!

fpacifici Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

untitaker Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

fpacifici Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

untitaker Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

fpacifici Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

untitaker Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

untitaker commented Dec 2, 2025 •

edited

Loading