-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Observation
We are seeing file descriptors associated with the plane proxy process grow steadily over time. Then it hits a tipping point, presumably when the number of file descriptors used by the plane proxy exceeds the soft ulimit set on the host. At this point the host goes into an accelerated decline where CPU spikes and disk writes spike (presumably due to the plan proxy logging errors messages).
We are seeing two specific error messages that occur when the host gets into a bad state
Failed to accept connection.andOs { code: 24, kind: Uncategorized, message: "Too many open files" }Upstream request failedandRequestFailed(hyper_util::client::legacy::Error(Connect, Boxed(ConnectError("tcp open error", Os { code: 24, kind: Uncategorized, message: "Too many open files" }))))
Theory
After some investigation, the working theory is that there are cases where the connection pool (and associated underlying file descriptors) are not being cleaned up in some cases. Maybe when the drone doesn't close the connection cleanly? When the FD pool is exhausted, the proxy can no longer accept new incoming requests or make any outbound requests to the upstream drone service.
Potential Changes
When creating the HttpConnector, we can configure the pool to mange the connections better and cleanup stale connections.
let mut connector = HttpConnector::new();
connector.set_pool_idle_timeout(Duration::from_secs(90));
connector.set_pool_max_idle_per_host(10);
connector.set_keepalive(Some(Duration::from_secs(60)));