-
Notifications
You must be signed in to change notification settings - Fork 220
Fix: Handle dead consumer threads in Google Cloud Functions environment #1473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix: Handle dead consumer threads in Google Cloud Functions environment #1473
Conversation
- Implement health checks for media upload consumer threads to determine activity status. - Add fallback logic to process media uploads synchronously if threads are unhealthy or queue is not drained within the timeout. - Update media upload consumer to track the last activity timestamp and add error handling in the processing loop.
langfuse/_client/resource_manager.py
Outdated
| # Check if threads are alive AND healthy (recently active) | ||
| healthy_threads = [ | ||
| c for c in self._media_upload_consumers | ||
| if c.is_alive() and c.is_healthy(timeout_seconds=5.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid magic numbers: consider extracting the 5.0s health check and 30s timeout into configurable constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 9a3506c
| self._media_manager.process_next_media_upload() | ||
| try: | ||
| # Update activity timestamp before processing | ||
| self.last_activity = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating last_activity both before and after processing may misrepresent actual work; consider updating only after successful processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an upload takes long, I think we'd rather have a signature of when the task starts so that it serves the purpose of knowing the thread is_healthy
| # Update after successful processing | ||
| self.last_activity = time.time() | ||
| except Exception as e: | ||
| self._log.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use _log.exception() in the exception block to capture the full stack trace for debugging.
| self._log.error( | |
| self._log.exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 9a3506c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
…igurable for media consumer threads
Issue discussion langfuse/langfuse#11104
Problem
The
flush()method would hang indefinitely in serverless environments like Google Cloud Functions when consumer threads died before flush was called. This occurred because:queue.join()waits for all queued items to be processed viatask_done()callstask_done()on remaining queue items, causingjoin()to wait foreverRelated issue: The flush would successfully process OTEL and score ingestion queues, but hang on the media upload queue, never reaching the final log statements.
Root Cause
In serverless environments like Google Cloud Functions, daemon threads can be killed at any point during execution. As documented in the official Google Cloud documentation:
A function has access to its allocated resources (memory and CPU) only for the duration of function execution. Code run outside of the execution period is not guaranteed to execute, and it can be stopped at any time. ([Source](https://cloud.google.com/functions/1stgendocs/concepts/execution-environment))
For background activities specifically:
Background activity is anything that happens after your function has terminated. Any code run after graceful termination cannot access the CPU and will not make any progress. ([Source](https://cloud.google.com/run/docs/tips/functions-best-practices))
From Google's blog on avoiding Cloud Functions anti-patterns:
A background task started by a Cloud Function is not guaranteed to complete. As soon as the Functions completes, e.g. the Function returns or a timeout error occurs, the Function instance can be terminated at any time. ([Source](https://cloud.google.com/blog/topics/developers-practitioners/avoiding-gcf-anti-patterns-part-5-how-run-background-processes-correctly-python))
This behavior is not unique to Cloud Functions - AWS Lambda has similar restrictions as noted in their [Lambda execution environment documentation](https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html), where background processes may be frozen when the execution context is frozen for reuse.
Solution
This PR implements a graceful fallback mechanism:
is_healthy()method toMediaUploadConsumerthat tracks thread activity vialast_activitytimestampflush()now checks if consumer threads are alive AND recently active before waitingChanges
last_activitytimestamp tracking toMediaUploadConsumeris_healthy()method to detect stalled/frozen threadsqueue.join()with conditional logic that falls back to synchronous processingTesting
Tested in Google Cloud Functions environment where the issue was consistently reproducible. After this change:
Important
Fixes hanging
flush()in serverless environments by adding consumer thread health checks and synchronous fallback processing inresource_manager.py.flush()inresource_manager.pynow checks consumer thread health usingis_healthy()before waiting on the queue.is_healthy()toMediaUploadConsumerto check thread activity based onlast_activitytimestamp.last_activityinMediaUploadConsumerto monitor thread health.flush()to indicate when synchronous processing is triggered and to log errors during synchronous processing.MediaUploadConsumer.run()loop.This description was created by
for 7893a44. You can customize this summary. It will automatically update as commits are pushed.
Disclaimer: Experimental PR review
Greptile Overview
Greptile Summary
This PR implements a graceful fallback mechanism to handle dead consumer threads in serverless environments like Google Cloud Functions, preventing infinite hangs during
flush()operations.Key Changes
last_activitytimestamp tracking toMediaUploadConsumerto detect stalled threadsis_healthy()method that checks both thread liveness and recent activityqueue.join()with conditional logic that checks thread health before waitingCritical Issue
Missing Import: The code uses
time.time()andtime.sleep()inresource_manager.pybut thetimemodule is not imported at the top of the file. This will cause aNameErrorat runtime when the new code path is executed.Architecture
The solution correctly addresses the root cause: daemon threads in serverless environments can be terminated by the runtime, leaving queued items unprocessed. The fallback to synchronous processing ensures no data loss while maintaining backward compatibility with normal environments.
Confidence Score: 1/5
timemodule import inresource_manager.pyis a critical issue that will cause an immediateNameErrorwhen the new flush logic executes. While the overall design and approach are sound, this syntax error makes the code non-functional. The PR cannot be safely merged until this import is added.langfuse/_client/resource_manager.py- add the missingimport timestatement at the top of the moduleImportant Files Changed
File Analysis
timeimport will cause runtime errorSequence Diagram
sequenceDiagram participant Main as Main Thread participant RM as ResourceManager participant Consumer as MediaUploadConsumer Thread participant Queue as Media Upload Queue participant MM as MediaManager Note over Main,MM: Normal Operation Main->>RM: flush() RM->>Consumer: Check is_alive() && is_healthy(5s) Consumer-->>RM: True (healthy threads exist) RM->>Queue: Check empty() with 30s timeout loop While queue not empty and timeout not reached Consumer->>Queue: get(block=True, timeout=1) Queue-->>Consumer: UploadMediaJob Consumer->>Consumer: Update last_activity Consumer->>MM: process_next_media_upload() MM->>Queue: task_done() Consumer->>Consumer: Update last_activity end Queue-->>RM: Queue empty RM-->>Main: Success Note over Main,MM: Serverless Environment (Dead Threads) Main->>RM: flush() RM->>Consumer: Check is_alive() && is_healthy(5s) Note over Consumer: Thread killed by runtime Consumer-->>RM: False (threads dead/unhealthy) RM->>Queue: Check qsize() loop While queue not empty RM->>MM: process_next_media_upload() MM->>Queue: get(block=True, timeout=1) Queue-->>MM: UploadMediaJob MM->>MM: Upload media to storage MM->>Queue: task_done() end RM-->>Main: Success (synchronous fallback)