Skip to content

Conversation

@hmh
Copy link
Contributor

@hmh hmh commented Jan 8, 2021

On production builds where assert() does nothing, a large random spread can cause event_gaga() to schedule a libevent2 "event->fire_event" past the time where the next libevent2 "event->trigger_event" will run.

When that trigger event runs, it schedules a new fire_event (which might come earlier or later than the one already schedule, depending on the two spreads). libevent2 suports this, and will callback fire_cb (the callback for event->fire_event) twice. But fire_cb does not, and lmapd will lose track of the situation. Eventually, one fire_cb will run with event->fire_event already set to NULL, and crash.

Also, events could be reordered by this situation.

On debug builds, you hit an assert at runtime when the situation first happens and event_gaga is called with event->fire_event already pending. The assert() will kill lmapd with a signal, and it will die without any exit cleanups (atexit handlers do not run in this situation).

The proposed fix attempts to turn such a situation (arguably caused by user error, but it can be hard to predict in complex calendar events) into properly logged warnings/errors and recover sanely, so that lmapd isn't stopped and measurements can continue, etc.

hmh added 3 commits January 8, 2021 18:50
Internal errors (bugs) could cause lmapd to call event_free() on a NULL
pointer.  event_free() is *not* documented to handle NULL pointers,
unlike free(), and in fact it does not handle them well at all (at least
on libevent2 2.1).

That could cause a NULL pointer derreference, and a crash.  It is better
to cope with it: log the problem, and keep the daemon alive.

This issue has been triggered in production, it can happen.

CEPTRO.br-issue: MDC-1016
Should the random spread mechanism cause two firings of an event to
accumulate, we'd call event_gaga() on an event that already has
event->fire_event set and active.

This either causes an assertion failure on event_gaga() when trying to
add the second event while the first is still active, or if assertions
are disabled (production build), it overwrites event->fire_event,
forgetting about the already active event.

This causes two "fire" events to be active, something the current code
is not prepared to support.  The obvious issue is that fire_cb() will
run twice for the same (lmapd) struct event, which would crash before
the change of that codepath to safe_event_free().  But there could be
other issues too, such as incorrect event timestamp metadata.

Explicitly detect that an event is already active in event_gaga() and
output an error message (and cope gracefully, by losing the new event).
This ensures that the event that does trigger will be handled correctly
and have correct metadata, etc.
Refuse configurations that would knowingly cause periodic event
reordering due to a too large random spread.

It would be nice to detect this also for the calendar event type, but
that's non-trivial, and not worth the hassle now that lmapd has been
changed to handle this gracefully (instead of triggering a crash or
assertion failure).

CEPTRO.br-issue: MDC-1016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant