-
Notifications
You must be signed in to change notification settings - Fork 36
Open
Description
This is unfortunatley a little vague at the moment however it seems like when we put PG001 (host) into downtime on 5.11.20230318 naemon ends up locking up or getting broken in some way.
Under normal circumstances this returns this:
Every 2.0s: lsof /omd/sites/default/tmp/run/live OMD002: Wed Mar 29 10:43:51 2023
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
naemon 444013 default 12u unix 0x000000003aced7f1 0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM
naemon 444027 default 12u unix 0x000000003aced7f1 0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM
However when it is broken (i.e. thruk is timing out communicating with the socket) lsof shows:
OMD[default@OMD002]:~$ lsof /omd/sites/default/tmp/run/live
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
naemon 348464 default 12u unix 0x00000000106e0b91 0t0 1394514 /omd/sites/default/tmp/run/live type=STREAM
naemon 348464 default 19u unix 0x000000003ea4cdc5 0t0 1772423 /omd/sites/default/tmp/run/live type=STREAM
naemon 348477 default 12u unix 0x00000000106e0b91 0t0 1394514 /omd/sites/default/tmp/run/live type=STREAM
Which looks the same naemon has spun up another file handle to the socket or something.
Thruk has the following errors;
[2023/03/29 10:31:06][OMD002][ERROR] 491: failed to connect - failed to connect to /omd/sites/default/tmp/run/live: Resource temporarily unavailable at /omd/sites/default/share/thruk/lib/Thruk/Backend/Manager.pm line 1631.
[2023/03/29 10:31:07][OMD002][ERROR] 491: failed to connect - failed to connect to /omd/sites/default/tmp/run/live: Resource temporarily unavailable at /omd/sites/default/share/thruk/lib/Thruk/Backend/Manager.pm line 1631.
There is nothing significant or that looks like errors in the naemon.log itself nor the livestatus.log.
Our resolution for the problem is:
killall -9 naemon; omd restart naemon
We are not convinced that the downtime action is actually what is causing it, it may just be that it has correllated with the event multiple times.
Metadata
Metadata
Assignees
Labels
No labels