-
-
Notifications
You must be signed in to change notification settings - Fork 4
ref: Stuck detector #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref: Stuck detector #512
Conversation
Implement a background thread that dumps all stacktraces when the main thread gets stuck. variations of this are: * getsentry/sentry#100857 -- unlike that PR, this one can run enabled in all consumers, since it only reports stacktraces when we're actually stuck. * #442 -- this is a previous version that only reported on the main thread, and in an overly complicated manner. We cannot use faulthandler because that one can only report to "real" files, and I want to report the stuck consumers to logging/Sentry.
|
doc CI fix: #513 |
arroyo/processing/processor.py
Outdated
| if self.__stuck: | ||
| i += 1 | ||
| else: | ||
| i = 0 | ||
| self.__stuck = True |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
fpacifici
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address the comments before merging
arroyo/processing/processor.py
Outdated
| i = 0 | ||
| self.__stuck = True | ||
|
|
||
| if i >= stuck_detector_timeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you comparing a counter to a timeout in milliseconds ? I think you should get the current time and check if that is greater than the previous time you logged the threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i sleep one second per iteration so the counter is rougly counting the number of seconds since the last time stuck was set to false. i can adjust it to use time instead
arroyo/processing/processor.py
Outdated
| i += 1 | ||
| else: | ||
| i = 0 | ||
| self.__stuck = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend using the timestap of the last call to _run to decide whether the consumer is stuck rather than keeping resetting the stuck flag to true and false. I think it the behavior would be more obvious:
- Set last_run to time.time() at initialization
- Reset it every time
_runis called - in this thread just check every second whether
last_run + timeout < time()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| for thread_id, frame in frames.items(): | ||
| thread = threads_by_id.get(thread_id) | ||
| thread_name = thread.name if thread else f"Unknown-{thread_id}" | ||
| stack = "".join(traceback.format_stack(frame)) | ||
| stacks.append(f"Thread {thread_name} ({thread_id}):\n{stack}") | ||
|
|
||
| return "\n\n".join(stacks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
from io import StringIO
f = StringIO()
faulthandler.dump_traceback(file=, all_threads=True)
return f.getvalue()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
faulthandler needs a real file descriptor, it will try to access f.fileno which StringIO does not have
Implement a background thread that dumps all stacktraces when the main
thread gets stuck. Log as a warning, so that the message makes it to
Sentry too.
variations of this are:
this one can run enabled in all consumers, since it only reports
stacktraces when we're actually stuck.
version that only reported on the main thread, and in an overly
complicated manner. We cannot use faulthandler because that one can
only report to files (that have a real fileno, so no StringIO), and
I want to report the stuck consumers to logging/Sentry.