Skip to content

Conversation

@cfsmp3
Copy link
Contributor

@cfsmp3 cfsmp3 commented Dec 29, 2025

Summary

  • Fixed 502 timeout errors in /progress-reporter/ endpoint that occurred when tests completed
  • Made GCP VM deletion fire-and-forget instead of blocking synchronously
  • Root cause: wait_for_operation() could block for 60+ seconds, exceeding nginx's 60s proxy timeout

Investigation Findings

Analysis of production server (ssh ccextractor) revealed:

  • 11% webhook failure rate (11 out of 100 requests today returned 502)
  • nginx error logs showed: upstream timed out (110: Unknown error) while reading response header
  • Pattern in application logs: 60-70 seconds between "Test completed" and "Test " due to wait_for_operation blocking

Example from logs:

2025/12/29 21:01:17 [error] upstream timed out ... request: "POST /progress-reporter/7424/..."

Corresponding app logs:

[2025-12-29 21:00:18] [Test: 7424] Test completed: 0 crashes, 3 results
[2025-12-29 21:01:27] [Test: 7424] Test <completed>  # 69 seconds later!

Changes

  • Removed wait_for_operation() call after delete_instance()
  • Added informational log message to track initiated deletions
  • The deletion will complete asynchronously - we don't need confirmation since:
    1. All test results are already saved to the database
    2. GitHub status is already updated
    3. The VM will be cleaned up eventually

Test plan

  • Existing test_progress_type_request passes
  • Existing test_progress_type_request_empty_token passes
  • Deploy and verify no more 502 errors on production

🤖 Generated with Claude Code

The progress_reporter endpoint was timing out (502 errors) when tests
completed because it synchronously waited for GCP VM deletion, which
can take 60+ seconds. This exceeded nginx's default 60s proxy timeout.

The fix makes VM deletion fire-and-forget:
- Initiate deletion but don't block waiting for completion
- The VM will be deleted eventually - we don't need confirmation
- All critical work (test results, GitHub status) completes first
- Added logging to track initiated deletions

Investigation on production server showed:
- 11% of webhook requests returned 502 errors (11 out of 100 today)
- nginx error logs showed "upstream timed out" for /progress-reporter/
- Pattern: Test completion logs showed ~60-70s between "Test completed"
  and "Test <completed>" due to wait_for_operation blocking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Break long log line into two lines to comply with 120 char limit.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sonarqubecloud
Copy link

@canihavesomecoffee canihavesomecoffee merged commit 4be5eee into master Dec 29, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants