Skip to content

Potential overflow in touch_cache #7028

@jc-harrison

Description

@jc-harrison

The FlowKit cache scoring mechanism (borrowed from https://github.com/dask/cachey) aims to retain cache entries that are frequently and/or recently used, by calculating a cache score with a contribution from each cache touch that decays exponentially over time, with more recent touches weighted more heavily.

This is achieved by adding to a cache entry's score a contribution proportional to $(1+\varepsilon)^t$ each time the cache entry is accessed (where $t$ is the "time" of the cache touch, with "time" defined as an incrementing counter of all cache touches across all cache entries).

The problem with this approach is that the value added to cache_score_multiplier grows exponentially with the number of cache touches, so in a long-running FlowKit instance the cache scores can get very large. In the worst case, if the number of cache touches exceeds 1024 * cache_half_life(), attempting to touch a cache record will result in a double-precision overflow.

This can be mitigated by resetting the cache via flowmachine.core.cache.reset_cache(), which resets the counter. But ideally we would want to be able to keep the cache active indefinitely without resetting.

An alternative would be to scale down the previous cache score multiplier each time the cache record is touched, instead of increasing the value added for each cache touch. I.e instead of

UPDATE cache.cached
SET cache_score_multiplier = cache_score_multiplier + power( 1 + ln(2)/cache_half_life(),
                                                             nextval('cache.cache_touches') - 2
                                                           );

we could record the last cache touch counter value, and then do

SELECT nextval('cache.cache_touches') INTO this_cache_touch_number;
UPDATE cache.cached
SET cache_score_multiplier = cache_score_multiplier*power( 1-ln(2)/cache_half_life(),
                                                           this_cache_touch_number-cache_touch_number
                                                         ) + 1,
    cache_touch_number = this_cache_touch_number;

(or equivalently,

cache_score_multiplier = cache_score_multiplier*power( 2,
                                                       (cache_touch_number-this_cache_touch_number)/cache_half_life()
                                                     ) + 1

).

This would require adding a cache_touch_number column to cache.cached, and also we would need to scale down the cache score here:

RETURN cache_score_multiplier*((compute_time/1000)/tablesize);
by a factor of power(2, (cache_touch_number - currval('cache.cache_touches'))/cache_half_life()). Alternatively we could replace (cache_touch_number-this_cache_touch_number) with EXTRACT('seconds' FROM last_accessed - NOW()), and do away with the cache_touches counter altogether, so that the contribution from each cache touch decays exponentially with real time rather than total number of cache touches.

This approach is equivalent to the current one, but by normalising the cache score multipliers at each touch we prevent them from growing excessively large. There is then the possibility of underflow, instead of overflow, if a cache entry has been previously touched but not for a very long time - but in the event this occurs we could simply ignore the contribution from previous cache touches because it will be negligible compared to that from the new cache touch (i.e. set cache_score_multiplier=1).

Metadata

Metadata

Assignees

No one assigned

    Labels

    FlowDBIssues related to FlowDBbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions