Skip to content

OMD 5.11.20230318-labs-edition seems to freeze/block the livestatus socket #163

@infraweavers

Description

@infraweavers

This is unfortunatley a little vague at the moment however it seems like when we put PG001 (host) into downtime on 5.11.20230318 naemon ends up locking up or getting broken in some way.

Under normal circumstances this returns this:

Every 2.0s: lsof /omd/sites/default/tmp/run/live                                                                                                                                                             OMD002: Wed Mar 29 10:43:51 2023
COMMAND    PID    USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
naemon  444013 default   12u  unix 0x000000003aced7f1      0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM
naemon  444027 default   12u  unix 0x000000003aced7f1      0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM

However when it is broken (i.e. thruk is timing out communicating with the socket) lsof shows:

OMD[default@OMD002]:~$ lsof /omd/sites/default/tmp/run/live
COMMAND    PID    USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
naemon  348464 default   12u  unix 0x00000000106e0b91      0t0 1394514 /omd/sites/default/tmp/run/live type=STREAM
naemon  348464 default   19u  unix 0x000000003ea4cdc5      0t0 1772423 /omd/sites/default/tmp/run/live type=STREAM
naemon  348477 default   12u  unix 0x00000000106e0b91      0t0 1394514 /omd/sites/default/tmp/run/live type=STREAM

Which looks the same naemon has spun up another file handle to the socket or something.

Thruk has the following errors;

[2023/03/29 10:31:06][OMD002][ERROR] 491: failed to connect - failed to connect to /omd/sites/default/tmp/run/live: Resource temporarily unavailable at /omd/sites/default/share/thruk/lib/Thruk/Backend/Manager.pm line 1631.
[2023/03/29 10:31:07][OMD002][ERROR] 491: failed to connect - failed to connect to /omd/sites/default/tmp/run/live: Resource temporarily unavailable at /omd/sites/default/share/thruk/lib/Thruk/Backend/Manager.pm line 1631.

There is nothing significant or that looks like errors in the naemon.log itself nor the livestatus.log.

Our resolution for the problem is:

killall -9 naemon; omd restart naemon

We are not convinced that the downtime action is actually what is causing it, it may just be that it has correllated with the event multiple times.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions