avoid crash/assert on periodic events with too large a random spread #19

hmh · 2021-01-08T22:16:34Z

On production builds where assert() does nothing, a large random spread can cause event_gaga() to schedule a libevent2 "event->fire_event" past the time where the next libevent2 "event->trigger_event" will run.

When that trigger event runs, it schedules a new fire_event (which might come earlier or later than the one already schedule, depending on the two spreads). libevent2 suports this, and will callback fire_cb (the callback for event->fire_event) twice. But fire_cb does not, and lmapd will lose track of the situation. Eventually, one fire_cb will run with event->fire_event already set to NULL, and crash.

Also, events could be reordered by this situation.

On debug builds, you hit an assert at runtime when the situation first happens and event_gaga is called with event->fire_event already pending. The assert() will kill lmapd with a signal, and it will die without any exit cleanups (atexit handlers do not run in this situation).

The proposed fix attempts to turn such a situation (arguably caused by user error, but it can be hard to predict in complex calendar events) into properly logged warnings/errors and recover sanely, so that lmapd isn't stopped and measurements can continue, etc.

Internal errors (bugs) could cause lmapd to call event_free() on a NULL pointer. event_free() is *not* documented to handle NULL pointers, unlike free(), and in fact it does not handle them well at all (at least on libevent2 2.1). That could cause a NULL pointer derreference, and a crash. It is better to cope with it: log the problem, and keep the daemon alive. This issue has been triggered in production, it can happen. CEPTRO.br-issue: MDC-1016

Should the random spread mechanism cause two firings of an event to accumulate, we'd call event_gaga() on an event that already has event->fire_event set and active. This either causes an assertion failure on event_gaga() when trying to add the second event while the first is still active, or if assertions are disabled (production build), it overwrites event->fire_event, forgetting about the already active event. This causes two "fire" events to be active, something the current code is not prepared to support. The obvious issue is that fire_cb() will run twice for the same (lmapd) struct event, which would crash before the change of that codepath to safe_event_free(). But there could be other issues too, such as incorrect event timestamp metadata. Explicitly detect that an event is already active in event_gaga() and output an error message (and cope gracefully, by losing the new event). This ensures that the event that does trigger will be handled correctly and have correct metadata, etc.

Refuse configurations that would knowingly cause periodic event reordering due to a too large random spread. It would be nice to detect this also for the calendar event type, but that's non-trivial, and not worth the hassle now that lmapd has been changed to handle this gracefully (instead of triggering a crash or assertion failure). CEPTRO.br-issue: MDC-1016

hmh added 3 commits January 8, 2021 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

avoid crash/assert on periodic events with too large a random spread #19

avoid crash/assert on periodic events with too large a random spread #19

Uh oh!

hmh commented Jan 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

avoid crash/assert on periodic events with too large a random spread #19

Are you sure you want to change the base?

avoid crash/assert on periodic events with too large a random spread #19

Uh oh!

Conversation

hmh commented Jan 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant