Skip to content

sync: RWMutex Readers can starve writers for many seconds #76808

@evanj

Description

@evanj

Go version

go version go1.25.5 darwin/arm64

Output of go env in your module/workspace:

AR='ar'
CC='cc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='c++'
GCCGO='gccgo'
GO111MODULE=''
GOARCH='arm64'
GOARM64='v8.0'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/Users/evan.jones/Library/Caches/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/Users/evan.jones/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/pp/tvwz4y2x2qz97pf8bftqxhrw0000gp/T/go-build2627804322=/tmp/go-build -gno-record-gcc-switches -fno-common'
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMOD='/Users/evan.jones/unfairlocks/go.mod'
GOMODCACHE='/Users/evan.jones/go/pkg/mod'
GOOS='darwin'
GOPATH='/Users/evan.jones/go'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.25.5/libexec'
GOSUMDB='sum.golang.org'
GOTELEMETRY='on'
GOTELEMETRYDIR='/Users/evan.jones/Library/Application Support/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.25.5/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.25.5'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

A contented RWMutex allows readers to starve waiting writers for an extremely long time (e.g. >10 seconds). I believe this is because RWMutex.Unlock() first unblocks all readers, before unchecking if there are any waiting writers. I think this disagrees with the documentation of RLock: "a blocked Lock call excludes new readers from acquiring the lock".

We observed this causing some goroutines for an "overloaded" server to be blocked for a very long time (at least >1 second). For this scenario, I think it would be better if Unlock() did not unblock readers if there is another writer waiting. This would then allow continuously arriving writers to starve readers instead, but I think that would better match the package documentation.

The attached program simulates how Datadog's metrics package datadog-go uses an RWMutex for a map of counters in aggregator.count:

  • RLock()
  • Read the map. If the counter key exists: increment it.
  • RUnlock()
  • If the key did not exist:
    • Lock()
    • Check for the key again, increment or insert the new key
    • Unlock()

I think the following is happening in both the attached demo program and the production server:

  • A writer acquires the RWMutex.Lock().
  • Many more writers block in the internal Mutex.Lock().
  • Many readers block in Mutex.RLock().
  • The writer calls RWMutex.Unlock(). It releases all blocked readers (// Announce to readers there is no active writer. in RWMutex.Unlock ).
  • New readers can now acquire the RLock(). Since requests are continuously arriving, there are always running readers.
  • The writer unblocks all blocked readers with a for loop in Unlock.
  • The writer unblocks the next waiting writer (// Allow other writers to proceed.)
  • The unblocked writer get scheduled, then finally blocks new readers.
  • The next writer now must wait for all readers to finish, and the writer can finally enter the critical section.
  • Repeat this for the next writer. The result is it takes ~100 ms to get to each writer in the queue.

Demo program: https://github.com/evanj/unfairlocks/blob/main/unfairlocks.go#L47

What did you see happen?

In servers that use this in a very hot loop, we occasionally see "stuck" goroutines that are blocked for > 1 second.

The attached demo program prints "slow" increment requests. When I run it with increasing numbers of requests, the slowest increment time continues to increase. It appears the waiting time is basically unbounded. I can make an increment block basically ~forever by continuously adding more simulated requests.

$ go run . -requests=50000
...
Shard shard-35 increment duration: 598.276583ms
$ go run . -requests=500000
...
Shard shard-32 increment duration: 6.82352725s
$ go run . -requests=5000000
...
Shard shard-31 increment duration: 17.665173458s

This output shows the waiting time increases as I add more requests. The last line shows some goroutines were blocked for up to ~17 seconds. With this particular program and configuration, this is the worst delay I can observe: after that, the queue of blocked writers is gone.

What did you expect to see?

The mutex should sometimes be unfair, but not exceptionally unfair, and it should not starve writers ~forever. The demo program also prints timing of a version that only uses a Mutex, and it only shows waits up to ~200 ms.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions