-
Notifications
You must be signed in to change notification settings - Fork 18
feat: Add GenAI/LLM telemetry support for ClickHouse #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add --materialize-genai-fields flag to DDL tool for creating materialized columns (GenAIOperationName, GenAIProviderName, token counts, etc.) - Add --partition-by-service-name flag for multi-tenant optimization - Add genai_redaction_processor.py for PII redaction in LLM messages - Update documentation with usage examples
rjenkins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brightsparc, did a quick skim and this looks pretty good! I'm going to dig into the Python code tomorrow and should be able to finish review.
Question on the ddl, do you know if there has been any changes on the otel collector side in regards to schema for CH exporter with genai changes? I checked here https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/clickhouseexporter/internal/sqltemplates, but doesn't look like anyone has done any work on this yet there either. Either way excited to get this in.
|
Can you merge from main to include this PR: #256? That should allow us to schedule the status checks |
I'm not aware of any specific new fields at the table level, most of these changes are under the span attributes AFAIK. |
rjenkins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brightsparc. I left a comment on the processor in regards to overlap with the existing redaction processor, would like to hear your thoughts on that. Additionally I'm curious as to how you are using/querying the data with these changes today. Are you using ClickStack or are you directly querying ClickHouse in some other manner?
The other big change here is the schema changes to CH. Nice job on making them "opt-in" via args on clickhouse-ddl. I'm reaching out to our friends at Clickhouse to get their thoughts on these, so give me a day or two to get back to you.
From a high level essentially there are two things we should consider.
- What is the performance impact on insert for the materialized columns and partition/order by changes. As mentioned it's opt-in, so not necessarily a blocker but there is likely additional load on the CH server and theoretically a drop in insert performance. Ideally we'd quantify that and make a note in the README.md about the perf impact. On the flip side the query performance will be better for aligned queries.
We can likely help with some of the perf analysis if you can share some code to generate synthetic spans with the related genai fields or perhaps even more useful we could modify our load generator https://github.com/streamfold/otel-loadgen to include these new fields and we can run some benchmarks.
- Does this have an impact on ClickStack? I suspect ClickStack will ignore these additional columns, so might not be an issue but we should test and verify we don't break anything with their UI.
If we find that these additional columns are very interesting from the ClickHouse folks we may want to consider more closely how we generate these (materialized or transformed natively in Rotel's CH exporter) and whether we should add these to the default OTel traces table or if their better suited in a seperated GenAI traces table.
| @@ -0,0 +1,463 @@ | |||
| """ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brightsparc - Thanks for this processor, it seems like the bulk improvements here are some DX around handling redaction as it related items with the tool_call prefix within the "gen_ai.input.messages" and "gen_ai.output.messages"? This seems useful to me and we definitely want to add more out-of-the-box processors so we might land this as-is, however I'm curious if you looked at the generic redaction processor here: https://github.com/streamfold/rotel/blob/main/rotel_python_processor_sdk/processors/redaction_processor.py.
Seems like there is some fair amount of overlap. The redaction_processor and its configuration arguments are based loosely on the similar go processor. Would some of these changes slot into the existing redaction processor as additional configuration options? WDYT about the overlap and what was missing from the redaction processor that makes this genai_redaction_processor easier to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wasn't sure if there was a way to handle the same sort of genai specific redaction with the existing out of the box solution. I implemented this as more of a POC to see if I could include as part of the broader GenAI materialization of properties. Also I could see value in having an option to use presidio for more sophisticated scrubbing.
I'm happy to drop this from the PR if it's not adding value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presidio looks neat. We don't need to drop this and agreed it's probably a good idea to keep it separate for now, (like a POC as you say), so we can iterate on it without modifying or worrying about impact or accidental breakage to the generic redaction processor. I'm just wondering if longer term the capabilities will overlap and we can consolidate. With that said, we want to grow the processor library so overlap is not major concern at the moment and if we find later users are confused as to "which processor to use" we can consolidate.
| "ORDER_BY", | ||
| &get_order_by("(ServiceName, SpanName, toDateTime(Timestamp))", engine), | ||
| ), | ||
| ("PARTITION_BY", &get_partition_by(partition_expr, engine)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: We need to review performance impact on insert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here was to be able to support multi-tenant querying of otel records, was going to do this by the service.name, but open to other/better ways of adding partioning / indexing.
| // Uses SpanAttributes['key'] syntax and casts for numeric values | ||
| const GENAI_MATERIALIZED_MAP_SQL: &str = r#" | ||
| -- GenAI MATERIALIZED columns (extracted from SpanAttributes Map) | ||
| GenAIConversationId String MATERIALIZED SpanAttributes['gen_ai.conversation.id'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to review performance impact on insert.
I wasn't aware of this project, I think it could be good to simulate a combination of various multiple turn, tool calling scenarios etc, the idea of adding the indexes was to be able quickly get to the last message in a multi-turn conversation instead of having to query the whole large json object.
I initially started this PR by looking to transform the objects with a genai specifc exporter, but decided to go this route instead which put the heavy lifting back onto clickhouse which felt like a more optimal approach, but would be interesting to benchmark. |
Thanks for the context. I will follow up with more thoughts on the schema changes in the coming days. RE: the otel-loadgen, if you want to open a PR to simulate a combination of various multiple turn, tool calling scenarios that would be 👍 |
|
Hey @brightsparc we're +1 on landing this but would like to get some benchmarks on our side first. We have a harness to run the benches, but would you be willing to open a PR for https://github.com/streamfold/otel-loadgen that generates synthetic data for materialized columns. We can then run benchmarks on our end to see what the performance looks like, update the README.md with any relevant data and get this merged. Also, it might be nice to do a blog post about if you're interested. We should be able to configure ClickStack to view the new materialized columns and we could discuss some use cases if you're interested in collaborating on a post. |
it might be worth using a public open dataset that has a number of multi turn inputs / outputs. A good example is the sales force APIGen dataset that has 5k examples for multi turn. They also have an older dataset of 60k examples that were typically single turn although with multiple tool calls. you could use this as a source to create a decent column of data that has input and output messages. |
|
This is now merged, going to load test this branch tomorrow |
rjenkins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @brightsparc.
I did some end-to-end and some light perf testing on this Friday with the https://huggingface.co/datasets/Salesforce/APIGen-MT-5k data and our load testing tool. Everything worked so we can get this merged, but I'd like us to make a few changes before we do.
Could you implement the following change requests?
- Support the
--enable-jsonoption. When I tested the ddl--materialize-genai-fieldsit's incompatible with the json column type. Ideally these should work together as the JSON column type is a big performance boost. It looks like you need to change items like the following.
SpanAttributes['gen_ai.operation.name'] -> **SpanAttributes.gen_ai.operation.name**
-
Write a unit test for the processor. In our last PR I added commits that demonstrate how to write unit tests for processors from the Rust side. Could you take a crack at writing on for the new processor you have here. Here's an example from the metadata PR 474c907#diff-19c63172d2682534b1050188b0486b25c5fca22d5af028e630b8f7367e76b9a2R2866
-
Add a note to the README.md for the clickhouse-ddl to mark the --materialize-genai-fields and --partition-by-service-name field as experimental. In particular note that the feature is "currently experimental" and we haven't tested it extensively let for performance or production usage. However, we are interested in hearing from users who want to better support these GenAI use cases and want to discuss performance and requirements.
Finally, we'd love to hear more about your use cases for LLM input/output and observability. If you have some time in the coming weeks to hop on a zoom for a quick chat we'd enjoy discussing.
Finally some perf testing notes... vvv
One thing that made it slightly difficult for us to test at scale on Friday was our load testing pipeline includes Kafka and we ran into some oversized message issues. Essentially our test harness looks like this.
loadgen->edge rotel->Kafka->gateway rotel->ClickHouse
Due to the large size of the Spans (with the LLM input/output) we needed to tune down the batch sizes in Rotel. After some fiddling and an increase in the Kafka max.message.size we could push the data through. Essentially there is batching in Rotel and also batching in the rdkafka library at the Kafka exporter and with really large spans, it's possible Rotel's batches will exceed the Kafka message size in bytes and neither rdkafka or Rotel's Kafka exporter currently have the ability to "re-split" these messages after batching.
The issue is not caused by this PR, however the large Spans induce the scenario, so it's something we're looking into and testing this was helpful. One potential approach is we might disable Rotel's batching when using the Kafka exporter and all rdkafka to just handle batching. This could also likely improve performance.
Glad the load test was helpful, I’m wondering other is also a configuration that doesn’t use Kafka that would be optional? I can look at the other changes you mentioned but it might take a week. So if you wanted to push the changes into the branch like the other PR I’m also happy to review. |
re: Kafka, there's no requirement to use Kafka, it's just the way we've set up our load testing framework to ensure reliable delivery. re: This week, no worries, there is no rush. I'm pretty slammed this week too so likely won't get a chance to push any changes. Whoever gets to it first just drop a comment here and we get it landed in the coming weeks. |
This PR adds support for GenAI/LLM observability workloads by extending the existing ClickHouse traces exporter with materialized fields optimized for LLM telemetry queries.
Changes
ClickHouse DDL Tool (
clickhouse-ddl)Added two new flags for traces table creation:
--materialize-genai-fields: Creates MATERIALIZED columns that extract GenAI semantic convention attributes fromSpanAttributes, enabling efficient querying without parsing JSON on every query.--partition-by-service-name: Optimizes multi-tenant deployments by partitioning data by(ServiceName, toDate(Timestamp))and ordering by(ServiceName, Timestamp).Materialized GenAI Fields
When
--materialize-genai-fieldsis enabled, the following columns are automatically extracted:GenAIOperationNameGenAIProviderNameGenAIRequestModelGenAIResponseModelGenAIInputTokensGenAIOutputTokensGenAISystemFingerprintGenAILastInputMessageGenAILastOutputMessagePython Processors
Added
genai_redaction_processor.pyfor redacting PII in GenAI message attributes:gen_ai.input.messagesgen_ai.output.messagesFeatures:
[REDACTED]Usage
Creating Tables
# Create traces table with GenAI materialized fields and multi-tenant partitioning clickhouse-ddl create \ --endpoint http://localhost:8123 \ --database otel \ --traces \ --materialize-genai-fields \ --partition-by-service-nameRunning with PII Redaction
Querying GenAI Data
Files Changed
.gitignore- Added Python cache patternssrc/bin/clickhouse-ddl/ddl_traces.rs- Added GenAI materialized columns generationsrc/bin/clickhouse-ddl/main.rs- Added CLI flagssrc/bin/clickhouse-ddl/README.md- Updated documentationsrc/init/args.rs- Minor updatessrc/exporters/clickhouse/mod.rs- Minor updatesrotel_python_processor_sdk/processors/genai_redaction_processor.py- New PII redaction processorrotel_python_processor_sdk/python_tests/genai_redaction_processor_test.py- Tests for redaction processor