avoid crash/assert on periodic events with too large a random spread #19
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On production builds where assert() does nothing, a large random spread can cause event_gaga() to schedule a libevent2 "event->fire_event" past the time where the next libevent2 "event->trigger_event" will run.
When that trigger event runs, it schedules a new fire_event (which might come earlier or later than the one already schedule, depending on the two spreads). libevent2 suports this, and will callback fire_cb (the callback for event->fire_event) twice. But fire_cb does not, and lmapd will lose track of the situation. Eventually, one fire_cb will run with event->fire_event already set to NULL, and crash.
Also, events could be reordered by this situation.
On debug builds, you hit an assert at runtime when the situation first happens and event_gaga is called with event->fire_event already pending. The assert() will kill lmapd with a signal, and it will die without any exit cleanups (atexit handlers do not run in this situation).
The proposed fix attempts to turn such a situation (arguably caused by user error, but it can be hard to predict in complex calendar events) into properly logged warnings/errors and recover sanely, so that lmapd isn't stopped and measurements can continue, etc.