Skip to content

Conversation

Copy link

Copilot AI commented Nov 19, 2025

Multipart uploads fail intermittently because S3 returns quoted ETags ("abc123") but CompleteMultipartUpload expects unquoted values. The existing mutex protection for the parts slice was already correct.

Changes

  • ETag normalization: Trim quotes from S3 ETag responses before storing in parts array

    eTag = resp.Header.Get("ETag")
    eTag = strings.Trim(eTag, `"`)  // Normalize S3's quoted format
  • Debug logging: Added logging when appending parts and before completion to aid diagnosis of future failures

    • Log each part append with part number and ETag
    • Log full parts manifest (count + all ETags) before CompleteMultipartUpload
  • Import: Added strings package for ETag processing

Alternative approaches considered

  1. Per-worker part buffers: Eliminates lock contention but requires channel coordination and more memory
  2. Channel-based notifications: Cleaner concurrency model but adds complexity for marginal benefit given short critical section
  3. Part number validation: Check for gaps/duplicates before completion, but existing count check at line 162 is sufficient

The minimal mutex+trim approach fixes the immediate ETag mismatch with lowest risk.

Original prompt

Problem summary

Users report intermittent failures during multipart uploads with logs like "error uploading file to bucket: multipart upload failed". Investigation shows the multipart upload implementation (gen3-client/g3cmd/upload-multipart.go) appends to a shared parts slice from multiple goroutines without synchronization, which is a data race and can lead to corrupted or missing part entries. Additionally, ETag values returned by S3-like services are often quoted (e.g., ""abc123""), and the code does not normalize/trims ETag quotes before sending the CompleteMultipartUpload request, which can cause completion to fail.

Goal

Create a small, focused patch that:

  • Adds proper synchronization when worker goroutines add entries to the shared parts slice.
  • Normalizes/trims ETag values (remove surrounding quotes) before storing them in parts.
  • Adds brief debug logging and a small pre-completion validation/log to aid future debugging.

Files to change

  1. gen3-client/g3cmd/upload-multipart.go
  • Add import of strings.
  • After receiving eTag from the HTTP PUT response, normalize it: eTag = strings.Trim(eTag, ").
  • Ensure concurrent appends to the shared parts slice are protected with multipartUploadLock (use the existing global variable).
  • Add a small log.Printf when appending parts and before calling CompleteMultipartUpload to show part count, and optionally display part numbers+ETags at debug/verbose level.

Change rationale

  • The fundamental bug is a race on parts, which can corrupt in-memory slice state or lose part entries. Protecting appends with a mutex eliminates the race with minimal changes.
  • Trimming ETag quotes avoids mismatches between ETag formats expected by the server when calling CompleteMultipartUpload.
  • Additional logging provides visibility into what is sent to the server, so future failures can be diagnosed faster.

Patch requirements

  • Keep changes minimal and local to upload-multipart.go.
  • No dependency or API changes.
  • Do not enable Shepherd multipart upload — leave existing behavior.
  • Follow existing code patterns where possible.

Other alternatives (analysis)

Below is a writeup of a more defensive rework that reduces locking contention and adds validation before completing the multipart upload.

  1. Per-worker part buffers and single merge
  • Each worker maintains a local slice of MultipartPartObject for parts it uploads. After all workers finish, the main goroutine merges these per-worker slices under a single lock or by concatenation. This reduces lock contention (no lock on each append) and avoids global shared mutation during upload. Implementation steps:
    • Create partsCh := make(chan []MultipartPartObject, numOfWorkers).
    • Each worker collects its own localParts := make([]MultipartPartObject, 0) and appends without locking.
    • After finishing, a worker sends localParts into partsCh.
    • Main goroutine drains partsCh numOfWorkers times and concatenates all parts into the global parts slice, then sorts.
    • Optionally validate that the merged set contains exactly the expected PartNumbers 1..numOfChunks.

Pros: reduces lock contention and potential for contention under many small parts; cleaner semantics.
Cons: requires more code changes and memory for per-worker buffers, but memory is proportional to number of parts, not chunk size.

  1. Use a concurrency-safe data structure or channel for part notifications
  • Instead of appending to a slice, workers can send their part metadata into a buffered channel partsNotifyCh. The main goroutine reads from that channel and appends into parts slice (only one goroutine mutates parts). Implementation steps:
    • partsNotifyCh := make(chan MultipartPartObject, numOfChunks)
    • Workers send parts via partsNotifyCh when upload succeeds.
    • A separate goroutine or main goroutine reads from partsNotifyCh and appends into parts until it has numOfChunks entries.

Pros: simple to reason about; avoids locks entirely.
Cons: requires channel read loop and coordination to close the channel when all workers are done.

  1. Additional validations before CompleteMultipartUpload
  • Validate that the merged parts list contains unique PartNumber values and that every PartNumber from 1..numOfChunks is present. If missing parts are detected, fail early and optionally retry those missing parts.
  • Validate ETag format (non-empty, expected length/charset) and log discrepancies.

Why the minimal mutex+ETag trim is chosen

  • Fastest to implement with lowest risk: it fixes the immediate data race and common ETag formatting mismatch.
  • Keeps behavior nearly identical and addresses the most likely causes of the "multipart upload failed" errors reported.

PR contents & testing notes

  • Modify gen3-client/g3cmd/upload-multipart.go to:

    • import strings
    • normalize eTag via eTag = strings.Trim(eTag, ")
    • wrap parts append and bar.Add(n) under multipartUploadLock
    • add debug logging showing appended part and pre-completion part count
  • Run go vet a...

This pull request was created as a result of the following prompt from Copilot chat.

Problem summary

Users report intermittent failures during multipart uploads with logs like "error uploading file to bucket: multipart upload failed". Investigation shows the multipart upload implementation (gen3-client/g3cmd/upload-multipart.go) appends to a shared parts slice from multiple goroutines without synchronization, which is a data race and can lead to corrupted or missing part entries. Additionally, ETag values returned by S3-like services are often quoted (e.g., ""abc123""), and the code does not normalize/trims ETag quotes before sending the CompleteMultipartUpload request, which can cause completion to fail.

Goal

Create a small, focused patch that:

  • Adds proper synchronization when worker goroutines add entries to the shared parts slice.
  • Normalizes/trims ETag values (remove surrounding quotes) before storing them in parts.
  • Adds brief debug logging and a small pre-completion validation/log to aid future debugging.

Files to change

  1. gen3-client/g3cmd/upload-multipart.go
  • Add import of strings.
  • After receiving eTag from the HTTP PUT response, normalize it: eTag = strings.Trim(eTag, ").
  • Ensure concurrent appends to the shared parts slice are protected with multipartUploadLock (use the existing global variable).
  • Add a small log.Printf when appending parts and before calling CompleteMultipartUpload to show part count, and optionally display part numbers+ETags at debug/verbose level.

Change rationale

  • The fundamental bug is a race on parts, which can corrupt in-memory slice state or lose part entries. Protecting appends with a mutex eliminates the race with minimal changes.
  • Trimming ETag quotes avoids mismatches between ETag formats expected by the server when calling CompleteMultipartUpload.
  • Additional logging provides visibility into what is sent to the server, so future failures can be diagnosed faster.

Patch requirements

  • Keep changes minimal and local to upload-multipart.go.
  • No dependency or API changes.
  • Do not enable Shepherd multipart upload — leave existing behavior.
  • Follow existing code patterns where possible.

Other alternatives (analysis)

Below is a writeup of a more defensive rework that reduces locking contention and adds validation before completing the multipart upload.

  1. Per-worker part buffers and single merge
  • Each worker maintains a local slice of MultipartPartObject for parts it uploads. After all workers finish, the main goroutine merges these per-worker slices under a single lock or by concatenation. This reduces lock contention (no lock on each append) and avoids global shared mutation during upload. Implementation steps:
    • Create partsCh := make(chan []MultipartPartObject, numOfWorkers).
    • Each worker collects its own localParts := make([]MultipartPartObject, 0) and appends without locking.
    • After finishing, a worker sends localParts into partsCh.
    • Main goroutine drains partsCh numOfWorkers times and concatenates all parts into the global parts slice, then sorts.
    • Optionally validate that the merged set contains exactly the expected PartNumbers 1..numOfChunks.

Pros: reduces lock contention and potential for contention under many small parts; cleaner semantics.
Cons: requires more code changes and memory for per-worker buffers, but memory is proportional to number of parts, not chunk size.

  1. Use a concurrency-safe data structure or channel for part notifications
  • Instead of appending to a slice, workers can send their part metadata into a buffered channel partsNotifyCh. The main goroutine reads from that channel and appends into parts slice (only one goroutine mutates parts). Implementation steps:
    • partsNotifyCh := make(chan MultipartPartObject, numOfChunks)
    • Workers send parts via partsNotifyCh when upload succeeds.
    • A separate goroutine or main goroutine reads from partsNotifyCh and appends into parts until it has numOfChunks entries.

Pros: simple to reason about; avoids locks entirely.
Cons: requires channel read loop and coordination to close the channel when all workers are done.

  1. Additional validations before CompleteMultipartUpload
  • Validate that the merged parts list contains unique PartNumber values and that every PartNumber from 1..numOfChunks is present. If missing parts are detected, fail early and optionally retry those missing parts.
  • Validate ETag format (non-empty, expected length/charset) and log discrepancies.

Why the minimal mutex+ETag trim is chosen

  • Fastest to implement with lowest risk: it fixes the immediate data race and common ETag formatting mismatch.
  • Keeps behavior nearly identical and addresses the most likely causes of the "multipart upload failed" errors reported.

PR contents & testing notes

  • Modify gen3-client/g3cmd/upload-multipart.go to:

    • import strings
    • normalize eTag via eTag = strings.Trim(eTag, ")
    • wrap parts append and bar.Add(n) under multipartUploadLock
    • add debug logging showing appended part and pre-completion part count
  • Run go vet and go test as applicable; run the client with -race enabled while exercising multipart uploads to verify the race is gone.

Please apply this small change as a focused PR. The PR should include the change summary and the "other alternatives" section from above (so it's captured in the PR description). Also include brief instructions for manual testing (run with -race; test multipart upload of a file split into multiple parts).

Do the patch now.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits November 19, 2025 22:47
- Add strings import for ETag processing
- Normalize ETag values by trimming quotes from S3 responses
- Add debug logging when appending parts
- Add pre-completion validation logging showing part count and ETags
- Mutex protection already in place for parts slice append

Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com>
Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix multipart upload failures and ETag normalization Fix multipart upload ETag normalization and add debug logging Nov 19, 2025
Copilot AI requested a review from bwalsh November 19, 2025 22:52
@bwalsh bwalsh changed the title Fix multipart upload ETag normalization and add debug logging chore: multipart upload ETag normalization and add debug logging Nov 19, 2025
@bwalsh
Copy link
Contributor

bwalsh commented Nov 19, 2025

@copilot Read UploadSingleMultipart is it thread safe?

Copy link
Author

Copilot AI commented Nov 19, 2025

@copilot Read UploadSingleMultipart is it thread safe?

The multipartUpload function is thread-safe for concurrent part uploads within a single file (internal goroutines are properly synchronized with the mutex). However, it uses a global multipartUploadLock mutex, which means:

  • Currently safe: The function is only called sequentially in the codebase (in processMultipartUpload loop and retry logic)
  • ⚠️ Design issue: If multipartUpload were called concurrently for different files, they would unnecessarily block each other since they share the same global lock

The mutex should ideally be per-upload (local variable) rather than global. The current global mutex works because:

  1. It's only locked for brief append operations (lines 144-148)
  2. The function is called sequentially, not concurrently

Would you like me to refactor to use a per-upload mutex instead of the global one?

@matthewpeterkort
Copy link
Collaborator

matthewpeterkort commented Dec 17, 2025

added in another branch see 4089807

@matthewpeterkort matthewpeterkort deleted the copilot/fix-multipart-upload-issues branch December 17, 2025 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants