Added gRPC interface definition for data cache #46

amolfnal · 2025-11-17T19:04:42Z

Added gRPC interface for data cache (Kafka)

beauremus

Should we consider off-the-shelf solutions?
https://github.com/aklivity/zilla
https://github.com/mailgun/kafka-pixy

beauremus · 2025-11-17T20:36:47Z

proto/controls/service/data-cache/cache.proto

+  rpc CreateTopic(CreateTopicRequest) returns (CreateTopicResponse);
+
+  // Send a single message to a topic
+  rpc Produce(ProduceRequest) returns (ProduceResponse);


Our data logger process/service is likely a good example of why a streaming producer is a good idea.

beauremus · 2025-11-17T20:38:37Z

proto/controls/service/data-cache/cache.proto

+  rpc Produce(ProduceRequest) returns (ProduceResponse);
+
+  // Stream messages from a topic (server streaming)
+  rpc Consume(ConsumeRequest) returns (stream ConsumeResponse);


I'm not familiar with Kafka standard operations. Is there a way to request multiple topics? This seems to imply a socket connection per device, which I think is a single device at a single rate.

Most implementations of a kafka consumer support one-consumer-many-topics setups.

Seems like the question is where we want the complexity to emerge. If each consumer is on one topic, we could have data from a single consumer be streamed to any requesting external client, and thereby limit the maximum number of active consumers. Allowing arbitrary combinations of topics means it gets harder to reuse consumers for different clients.

On the other hand, we have the concern you bring up, of a single client now needing many separate connections to listen on many topics, instead of a few connections.

But there's also the question of why we'd want data from many topics to be mangled into one stream. Usually a topic contains a specific kind of data. The data can come from many sources, but the idea is each topic is its own little pool of things that can be operated on or reasoned about in the same way. Allowing consumers to stream many topics back in one connection kinda breaks this pattern, making it harder to know what the data is that we're getting, and forcing external clients to implement a bunch of logic to disambiguate the data that comes in.

Normally, consumer has ability to consume messages from different topics. We can make an interface which has ability to consumer messages from different topics with single gRPC call.
is this the requirement?

beauremus · 2025-11-17T20:40:21Z

proto/controls/service/data-cache/cache.proto

+
+message ConsumeRequest {
+  string topic = 1;
+  string group_id = 2;


What is group_id? It doesn't mirror the ProduceRequest.

This is a kafka-ism. Each consumer of a topic can optionally be in a group. If in a group, kafka will spread consumers across the partitions for the topic as evenly as possible. If there are fewer consumers in the group than partitions, some consumers will get multiple partitions. If there are more consumers than partitions, some consumers will not get any data at all. Probably better to not use groups unless we're sure we're ok with some client only getting some of the messages from a topic.

EDIT: Wanted to clarify, if a consumer is not in a group, or if it is the only consumer in its group, it will get all the partitions of a topic (unless the consumer has been configured to listen on a specific subset of partitions, which is also a possibility). And you can have as many consumers listening to the topic that you want. They only get throttled when in the same group.

Regarding group_id, I should not include (thank you Beau nice catch)
This is more kafka related which may not be present if we changed from kafka to something else
Therefore, I should remove this.....

rneswold

I'm guilty of not putting enough comments in my .proto files. But a lot of the fields in these messages have very generic names: topic, key, name, group_id, etc. (maybe these are recognized in the Kafka world?) Some comments would help understand what they're purpose is.

For instance, in the response messages, you have a success field and a message field. Is the message field to describe an error if success is false? If so, what happens if success is true, but an error message is sent?

If those fields are paired like that, I'd get rid of them and make one optional field: errMessage. If the error message is missing, then it's success. If it's there, something went wrong. There's no way to have an invalid state.

If they're not tied together, then comments would be nice to understand why they're needed and what they indicate.

jacob-curley-fnal · 2025-11-17T23:24:11Z

I guess this is more for @beauremus , though maybe you also have an answer Amol, but what's the gain from putting Kafka behind a gRPC service, again? I'd heard an argument about "it keeps all our internal services speaking gRPC", but is that all?

For example, we have several services that know how to speak Postgres, and that seems to work ok. In fact, we had a whole meeting where it was explicitly determined that we would have many Postgres-fluent services instead of one Postgres-Service-To-Rule-Them-All.

I see the move to put Kafka behind a service in a similar vein. Out of the box, Kafka is designed to handle enterprise-scale messaging, at thousands or even millions of messages a second. Seems like a big waste to hide that capability behind another network hop, just so we can say we limited the number of things that need to know how to talk to Kafka.

Not here to say there aren't good reasons out there, but I haven't heard anything concrete of what exactly is motivating this architectural decision, when we've had a similar discussion in the past about Postgres that went the other way. As an alternative idea, what if we wrote up a library for talking to Postgres and another for talking to Kafka? Microservices that needed a connection to one or the other could just add the library as a dependency. It would keep each microservice in charge of its own data connections, rather than demanding everyone goes through a central service. But it still buys us the benefit of only writing the nitty-gritty implementation of Postgres/Kafka logic once. And we avoid accidentally creating a bottleneck in the control system that doesn't need to be there.

Again, just wanted to throw this out there as a means of sparking discussion - not meant to be construed as a demand for going any particular direction. Appreciate anyone who takes the time to engage!

rneswold · 2025-11-18T19:41:21Z

but what's the gain from putting Kafka behind a gRPC service, again?

Very true. Our services should use gRPC. But we're not trying to gRPC-ize the APIs of products. All gRPC services that need Postgres can use it directly. All gRPC services that leverage Kafka should use it directly. The gRPC APIs are providing a control system service -- not an alternate, generic API for these products.

cnlklink · 2025-11-18T19:54:29Z

I think I agree with @rneswold and @jacob-curley-fnal. The benefit of our own gRPC layer wrapping Kafka that I see is it keeps us de-coupled from Kafka as a technology choice. We can swap out Kafka for other message queue technology at a later date, should we chose, without needing to modify all of the down-stream services. But TBH, I think the cost-benefit isn't there. This API would need to scale at the same level as Kafka - is that a reasonable expectation? How much effort will that take? Instead, given that Kafka is a mature and popular open-source product we should just embrace it's API.

cnlklink · 2025-11-18T19:57:19Z

And, as @jacob-curley-fnal suggest, I think a better way to mitigate the risk of abandoning Kafka in the future would be to use native adapters rather than a microservice.

rneswold · 2025-11-18T19:57:36Z

The benefit of our own gRPC layer wrapping Kafka that I see is it keeps us de-coupled from Kafka as a technology choice.

But we're not de-coupled if the gRPC uses Kafka terms and data structures. Because if we move away from Kafka, then we're trying to make our Kafka-compatible gRPC API fit whatever new backend we chose.

amolfnal · 2025-11-18T23:13:05Z

Thank you all for expressing your opinion about it.
There are many motivations behind the gRPC adoption

Unified communication mechanism for the microservices
Abstraction layer to adopt another technology like kafka which can satisfy the data cache requirement
Controlled functionality of the framework
For example, FaaS has capability that the user can create many environments. However, it can create problem to host the FaaS service. For example if we have 1000 users and each user will create 10 environment then there will be 10000 x 3 pods / environment = 30000 pods which we do not have capacity to host on our K8S cluster. That's why I am thinking not to give environment creation functionality to the user which has two reasons
3.1. Many users host the same environment with different name (unoptimized utilization of the computing resource)
3.2. Accountability of what is getting installed on the containers
These all are open source technologies.
Therefore, there is risk that can damage the system if open source code is compromised. Therefore, if I kill gRPC server then there will not be any communication to and from data cache which can protect other modules of the central service. I agree that we are creating a bottle neck. However, it can be mitigated by hosting proper gRPC server which can satisfy the control system requirements and scale up/down depending on workload. Moreover, I believe that the K8S cluster is/going to be interconnected with 100 GB/s network
Customization
I am also planing to include one gRPC call to push the same message to multiple kafka topics also consumer can consume messages from different topics (I know one java program which takes two properties/values and add them together and push to another property). These can be implemented with single gRPC call instead of two kafka native API calls

jacob-curley-fnal · 2025-11-19T18:13:56Z

Thanks for your comments, Amol! I think there's a good amount to talk about, so we might benefit from getting folks in a room. I've been working on a graphQL -> Kafka endpoint as part of extapi-acsys, specifically for grabbing alarms. I know you've been working on data caching for device readings. This raises the question of are we talking about one Kafka instance servicing all the control system's needs, or will there be several instances for different purposes? That might impact how we decide to talk to it, or it might not. Either way, I had a whole long thing written out, but might just be more efficient if we have a meeting to be sure everyone's on the same page. Thoughts?

cnlklink · 2025-11-19T20:32:09Z

@jacob-curley-fnal I like the idea of having a meeting to discuss this too.

amolfnal · 2025-11-19T20:35:38Z

Definitely, I would love to join application team meeting to discuss about interfaces.
I have already requested @mguzman04 @beauremus for meeting request.

Added gRPC interface definition for data cache

79d3db3

amolfnal requested review from beauremus, charlieking65 and rneswold November 17, 2025 19:13

Changed from int16 to int32

87b194e

amolfnal requested a review from cnlklink November 17, 2025 19:27

beauremus reviewed Nov 17, 2025

View reviewed changes

rneswold reviewed Nov 17, 2025

View reviewed changes

amolfnal requested review from beauremus, jacob-curley-fnal and rneswold November 19, 2025 14:47

Added capbility to talk to multiple topics with single gRPC call

441cc13

Added gRPC interface definition for data cache #46

Are you sure you want to change the base?

Added gRPC interface definition for data cache #46

Uh oh!

Conversation

amolfnal commented Nov 17, 2025

Uh oh!

beauremus left a comment

Choose a reason for hiding this comment

Uh oh!

beauremus Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

beauremus Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

jacob-curley-fnal Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amolfnal Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beauremus Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

jacob-curley-fnal Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amolfnal Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

rneswold left a comment

Choose a reason for hiding this comment

Uh oh!

jacob-curley-fnal commented Nov 17, 2025

Uh oh!

rneswold commented Nov 18, 2025

Uh oh!

cnlklink commented Nov 18, 2025

Uh oh!

cnlklink commented Nov 18, 2025

Uh oh!

rneswold commented Nov 18, 2025

Uh oh!

amolfnal commented Nov 18, 2025

Uh oh!

jacob-curley-fnal commented Nov 19, 2025

Uh oh!

cnlklink commented Nov 19, 2025

Uh oh!

amolfnal commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jacob-curley-fnal Nov 17, 2025 •

edited

Loading

amolfnal Nov 17, 2025 •

edited

Loading

jacob-curley-fnal Nov 17, 2025 •

edited

Loading