Skip to content

Conversation

@gbuisson
Copy link
Contributor

@gbuisson gbuisson commented Dec 10, 2025

Fix stale connections and add connection pool configuration

Problem: Production NoHttpResponseException errors caused by stale pooled
connections that the server had closed but the client was still trying to reuse.

Solution: Add validate-after-inactivity option (default: 5000ms) that checks idle
connections before reuse, preventing stale connection errors.

Additional improvements:

  • Connection pool tuning: :threads, :default-per-route, :insecure? options
  • Request timeouts: :connection-timeout (default: 10s), :socket-timeout (no
    default, for long-running operations)
  • BREAKING: Renamed :timeout → :connection-ttl for clarity (in seconds, default:
    60s)

Migration:
;; Before

  (conn/connect {:host "localhost" :port 9200})

;; After (same behavior, just explicit)

  (conn/connect {:host "localhost"
                 :port 9200
                 :connection-ttl 60              ; seconds
                 :validate-after-inactivity 5000 ; ms
                 :connection-timeout 10000})     ; ms

BREAKING CHANGE: Renamed :timeout to :connection-ttl for clarity.

Connection pool options:
- :connection-ttl (default: 30000ms) - how long connections live in the pool
- :validate-after-inactivity (default: 5000ms) - checks idle connections
  before reuse, preventing NoHttpResponseException from stale connections
- :threads (default: 100) - max total connections in pool
- :default-per-route (default: 100) - max connections per route
- :insecure? (default: false) - allow self-signed SSL certificates

Request timeout options (applied to every request):
- :connection-timeout (default: 10000ms) - time to establish TCP connection
- :socket-timeout (default: none) - time to wait for response data

Also removes deprecated PoolingClientConnectionManager from schema.
@gbuisson gbuisson force-pushed the fix-stale-connections branch from 8f23af1 to 15a34ed Compare December 10, 2025 00:48
@gbuisson gbuisson merged commit f65762b into master Dec 10, 2025
2 checks passed
sayerada added a commit to threatgrid/ctia that referenced this pull request Dec 15, 2025
Addresses socket timeout errors occurring at exactly 10 seconds for
long-running ElasticSearch queries (e.g., queries with 1000+ sub-requests
that take several minutes).

Root cause: After ductile PR #45 was merged, the new connection management
defaults include a 10-second connection-timeout that is being reused as
socket-timeout when not explicitly set. This causes intermittent failures
for requests that take longer than 10 seconds.

Solution: Explicitly set timeout parameters when creating ES connections:
- socket-timeout: 600000ms (10 minutes) - allows long-running queries
- connection-timeout: 10000ms (10 seconds) - reasonable for establishing connection
- validate-after-inactivity: 5000ms (5 seconds) - prevents NoHttpResponseException

This is a temporary workaround until ctia's properties schema is updated
to support these new ductile parameters (socket-timeout, connection-timeout,
validate-after-inactivity) as configurable properties.

Related:
- ductile PR #45: threatgrid/ductile#45
- Symptom: Requests failing at exactly 10s with socket timeout errors
- Evidence: Some requests succeed at 16s, 24s, 28s while others fail at 10s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants