-
Notifications
You must be signed in to change notification settings - Fork 25
Streaming Aggregates for Queries #1178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…emental evaluation
…in streaming mode
…reaming aggregates
…enhance aggregate function handling in select and group processing
…d streamline group updates; improve finalization of aggregate states
zonotope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to start a discussion about the approach and whether or not my suggestions are feasible.
src/fluree/db/query/exec/eval.cljc
Outdated
|
|
||
| (declare compare*) | ||
|
|
||
| (def streaming-aggregate-registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having these grouped in a map like this implies that they are dynamic and will change, but these are a static, closed list as far as I can tell.
I think we're reinventing facilities the language already has. I think we could implement the same functionality with a StreamingAggregator protocol whose methods are initialize!, step!, and finalize, and then records that implement that protocol for all the keys in this registry map. Those records could also encapsulate their state .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…enhance selector handling; improve error logging for aggregate selectors
…factor related functions in eval and group modules
| #?(:clj (defmulti ->offset-date-time | ||
| #(when-let [t (#{OffsetDateTime LocalDateTime LocalDate} (type %))] | ||
| t))) | ||
| #?(:clj (defmethod ->offset-date-time OffsetDateTime | ||
| [^OffsetDateTime datetime] | ||
| datetime)) | ||
| #?(:clj (defmethod ->offset-date-time LocalDateTime | ||
| [^LocalDateTime datetime] | ||
| (.atOffset datetime ZoneOffset/UTC))) | ||
| #?(:clj (defmethod ->offset-date-time LocalDate | ||
| [^LocalDate date] | ||
| (.atOffset (.atStartOfDay date) ZoneOffset/UTC))) | ||
| #?(:clj (defmethod ->offset-date-time :default | ||
| [x] | ||
| (throw (ex-info "Cannot convert value to OffsetDateTime." | ||
| {:value x | ||
| :status 400 | ||
| :error :db/invalid-fn-call})))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this fits better as a protocol since we're dealing with predefined types directly. I like to reserve multimethods for flat clojure data like maps and vectors, but protocols get much better performance if we have the option to let the jvm dispatch on type.
Here's how I would define and implement a protocol to convert to an offset datetime. I haven't run this code, so there might be some typos.
| #?(:clj (defmulti ->offset-date-time | |
| #(when-let [t (#{OffsetDateTime LocalDateTime LocalDate} (type %))] | |
| t))) | |
| #?(:clj (defmethod ->offset-date-time OffsetDateTime | |
| [^OffsetDateTime datetime] | |
| datetime)) | |
| #?(:clj (defmethod ->offset-date-time LocalDateTime | |
| [^LocalDateTime datetime] | |
| (.atOffset datetime ZoneOffset/UTC))) | |
| #?(:clj (defmethod ->offset-date-time LocalDate | |
| [^LocalDate date] | |
| (.atOffset (.atStartOfDay date) ZoneOffset/UTC))) | |
| #?(:clj (defmethod ->offset-date-time :default | |
| [x] | |
| (throw (ex-info "Cannot convert value to OffsetDateTime." | |
| {:value x | |
| :status 400 | |
| :error :db/invalid-fn-call})))) | |
| #?(:clj | |
| (defprotocol OffsetDateTimeConverter | |
| (->offset-date-time [this]))) | |
| #?(:clj | |
| (extend-protocol OffsetDateTimeConverter | |
| OffsetDateTime | |
| (->offset-date-time | |
| [^OffsetDateTime this] | |
| this) | |
| LocalDateTime | |
| (->offset-date-time | |
| [^LocalDateTime this] | |
| (.atOffset this ZoneOffset/UTC)) | |
| LocalDate | |
| (->offset-date-time | |
| [^LocalDate this] | |
| (-> this .atStartOfDay (.atOffset ZoneOffset/UTC))) | |
| Object | |
| (->offset-date-time | |
| [this] | |
| (throw (ex-info "Cannot convert value to OffsetDateTime." | |
| {:value this | |
| :status 400 | |
| :error :db/invalid-fn-call}))))) |
| #?(:clj (defmulti ->offset-time | ||
| #(when-let [t (#{OffsetTime LocalTime} (type %))] | ||
| t))) | ||
| #?(:clj (defmethod ->offset-time OffsetTime | ||
| [^OffsetTime time] | ||
| time)) | ||
| #?(:clj (defmethod ->offset-time LocalTime | ||
| [^LocalTime time] | ||
| (.atOffset time ZoneOffset/UTC))) | ||
| #?(:clj (defmethod ->offset-time :default | ||
| [x] | ||
| (throw (ex-info "Cannot convert value to OffsetTime." | ||
| {:value x | ||
| :status 400 | ||
| :error :db/invalid-fn-call})))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, here's how I would define and implement an offset time converter protocol.
| #?(:clj (defmulti ->offset-time | |
| #(when-let [t (#{OffsetTime LocalTime} (type %))] | |
| t))) | |
| #?(:clj (defmethod ->offset-time OffsetTime | |
| [^OffsetTime time] | |
| time)) | |
| #?(:clj (defmethod ->offset-time LocalTime | |
| [^LocalTime time] | |
| (.atOffset time ZoneOffset/UTC))) | |
| #?(:clj (defmethod ->offset-time :default | |
| [x] | |
| (throw (ex-info "Cannot convert value to OffsetTime." | |
| {:value x | |
| :status 400 | |
| :error :db/invalid-fn-call})))) | |
| #?(:clj | |
| (defprotocol OffsetTimeConverter | |
| (->offset-time [this]))) | |
| #?(:clj | |
| (extend-protocol OffsetTimeConverter | |
| OffsetTime | |
| (->offset-time | |
| [^OffsetTime this] | |
| this) | |
| LocalTime | |
| (->offset-time | |
| [^LocalTime this] | |
| (.atOffset time ZoneOffset/UTC)) | |
| Object | |
| (->offset-time | |
| [this] | |
| (throw (ex-info "Cannot convert value to OffsetTime." | |
| {:value this | |
| :status 400 | |
| :error :db/invalid-fn-call}))))) |
src/fluree/db/query/exec/group.cljc
Outdated
| (defn- streaming-agg-selector? | ||
| "Returns true if selector supports streaming aggregation." | ||
| [sel] | ||
| (or (instance? fluree.db.query.exec.select.StreamingAggregateSelector sel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could require this type to avoid this long line
src/fluree/db/query/exec/group.cljc
Outdated
| (every? streaming-agg-selector? selectors) | ||
| (and (seq group-vars) | ||
| (every? (fn [sel] | ||
| (or (and (instance? fluree.db.query.exec.select.VariableSelector sel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could require this type to avoid this long line
src/fluree/db/query/exec/group.cljc
Outdated
| selectors))))) | ||
|
|
||
| (defn- update-streaming-groups | ||
| "Reducer function that updates streaming aggregate states for each solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read "reducer function", I think a function you pass to reduce. Then the function signature with four arguments is immediately confusing. Perhaps the docstring could be tweaked so that the first sentence is "Updates streaming aggregae states for a solution". I think that's clearer.
| (defrecord CountAggregator [] | ||
| StreamingAggregator | ||
| (initialize [_] 0) | ||
| (step [_ state tv] | ||
| (if (some-> tv :value some?) | ||
| (inc state) | ||
| state)) | ||
| (finalize [_ state] | ||
| (where/->typed-val state))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code would be simpler if the aggregator records encapsulated their state. Then you could use the aggregators themselves to keep track of the running tally. I also would remove the initialize method in favor of a constructor. Here's how I'd change the CountAggregator:
| (defrecord CountAggregator [] | |
| StreamingAggregator | |
| (initialize [_] 0) | |
| (step [_ state tv] | |
| (if (some-> tv :value some?) | |
| (inc state) | |
| state)) | |
| (finalize [_ state] | |
| (where/->typed-val state))) | |
| (defrecord CountAggregator [tally] | |
| StreamingAggregator | |
| (step [this tv] | |
| (if (some-> tv :value some?) | |
| (update this :tally inc))) | |
| (finalize [_] | |
| (where/->typed-val tally))) | |
| (defn count-aggregator | |
| [] | |
| (->CountAggregator 0)) |
I'd change the other records similarly.
src/fluree/db/query/exec/select.cljc
Outdated
| (cond-> (->AggregateSelector agg-function) | ||
| (->AggregateSelector agg-function)) | ||
| ([agg-function streaming-agg] | ||
| (if streaming-agg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this conditional should be lifted to the caller of this constructor function, and this function should be replaced with two different functions. In general I think conditionals should be evaluated as soon as possible. This simplifies the code by making clear execution branches instead of putting everything together and transmitting conditionals without context. If there's shared functionality between the two branches, then that could still be implemented with helper functions.
src/fluree/db/query/fql/parse.cljc
Outdated
| (let [code (parse-code f) | ||
| fn-name (when (seq? code) (first code)) | ||
| agg-vars (variables code) | ||
| agg-fn (eval/compile code context) | ||
| streaming-agg (build-streaming-agg code) | ||
| agg-info {:fn-name fn-name | ||
| :vars agg-vars}] | ||
| (select/aggregate-selector agg-fn streaming-agg agg-info))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to my previous comment about lifting the conditional, I would rewrite this function this way:
| (let [code (parse-code f) | |
| fn-name (when (seq? code) (first code)) | |
| agg-vars (variables code) | |
| agg-fn (eval/compile code context) | |
| streaming-agg (build-streaming-agg code) | |
| agg-info {:fn-name fn-name | |
| :vars agg-vars}] | |
| (select/aggregate-selector agg-fn streaming-agg agg-info))) | |
| (let [code (parse-code f) | |
| fn-name (when (seq? code) (first code)) | |
| agg-vars (variables code) | |
| agg-fn (eval/compile code context) | |
| agg-info {:fn-name fn-name | |
| :vars agg-vars}] | |
| (if-let [streaming-agg (build-streaming-agg code)] | |
| (select/streaming-aggregate-selector agg-fn streaming-agg agg-info) | |
| (select/aggregate-selector agg-fn agg-info)))) |
I'd do the same thing to parse-select-as-fn too, although a lot of what I'd change was here before this pr.
| streaming-aggs (->> selectors | ||
| (filter streaming-agg-selector?) | ||
| (mapv :streaming-agg)) | ||
| implicit? (and (empty? group-vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is necessary, then I think it's a bug unless I'm missing something. select/implicit-grouping?. If I remember correctly, select/implicit-grouping? should only return a truthy value if there are no grouped variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select/implicit-grouping? only checks if a selector is an aggregate type, not whether there are group variables. Both conditions are needed for implicit grouping detection
src/fluree/db/query/exec/group.cljc
Outdated
| (mapv :streaming-agg)) | ||
| implicit? (and (empty? group-vars) | ||
| (some select/implicit-grouping? selectors)) | ||
| streaming? (streaming-eligible? having streaming-aggs implicit? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be a separate binding. I think (if (streaming-eligible? having ...) is just as if not more readable.
|
@zonotope ready for re-review: Changes Made
|
|
Note on ci/cd errors, this is fixed with #1190 -- we use 'latest' version of clj-kondo and I think a new version is flagging these. |
Summary
Implements incremental streaming aggregation for queries, reducing memory overhead for aggregate queries over large result sets. Instead of collecting all solutions into groups before computing aggregates, this approach updates aggregate state incrementally as solutions flow through the query pipeline.
Key Changes
src/fluree/db/query/exec/eval.cljc) - Registry of{:init :step! :final}functions forcount,count-distinct,sum,avg,min,maxsrc/fluree/db/query/exec/group.cljc) - Automatically uses streaming path when query structure permits (no HAVING clause, all selectors are group vars or streamable aggregates)src/fluree/db/query/exec/select.cljc) -AggregateSelectorandAsSelectornow carry optionalstreaming-aggdescriptorsrc/fluree/db/query/fql/parse.cljc) - Builds streaming descriptors for simple aggregate forms like(count ?x),(sum ?y)How It Works
Traditional grouping collects all solutions, then applies aggregate functions:
solutions → [collect all] → group → apply agg-fn → results
Streaming aggregation updates state incrementally:
solutions → step!(state, value) per solution → final(state) → results
Supported Aggregates
countcountwith*count-distinctsumavgminmaxLimitations
Falls back to traditional grouping when:
HAVINGclause(sum (+ ?x ?y)))sample,group_concat)