[core] ECS waits until DCS operation completes in case a detector fails #675

knopers8 · 2025-03-25T14:14:29Z

This changes how we consume reply streams of requests to DCS. In particular, we do not stop consuming events in case we receive a detector error and instead we wait to receive the status of all the other detectors.

As it has been clarified with the DCS team, Ecs2DcsGateway never sends "a pre-closure summary of the current operation" with detector == DCS, despite the documentation in the dcs.proto file. The latest version of this proto file in the Ecs2DcsGateway repo has the comment removed.

With this commit, we treat an EOF from DCS as the only correct stream termination and do not ever expect a detector == DCS event.

We also remove some reduntant operations, but more refactoring will have to come.

Closes OCTRL-711.

This changes how we consume reply streams of requests to DCS. In particular, we do not stop consuming events in case we receive a detector error and instead we wait to receive the status of all the other detectors. As it has been clarified with the DCS team, Ecs2DcsGateway never sends "a pre-closure summary of the current operation" with detector == DCS, despite the documentation in the dcs.proto file. The latest version of this proto file in the Ecs2DcsGateway repo has the comment removed. With this commit, we treat an EOF from DCS as the only correct stream termination and do not ever expect a detector == DCS event. We also remove some reduntant operations, but more refactoring will have to come. Closes OCTRL-711.

knopers8 · 2025-03-25T14:16:48Z

Tested manually with Michele's mock system (ECS2DCS2ECS) with the following cases:

SOR, many detectors RUN_OK
SOR, many detectors, DCS takes too much time to reply with one detector (DCS receives the request, but a detector times out)
SOR, many detectors, DCS takes too much time to reply with all detectors (DCS receives the request, but all detectors time out)
SOR, many detectors, DCS takes too much time to reply (DCS does not receive the request, e.g. networking issue. Tested by blocking the firewall)
SOR, many detectors, gRPC returns unknown error (tested by replacing err with debugger)
SOR, many detectors, one returns SOR_FAILURE (SOR_UNAVAILABLE and SOR_TIMEOUT are handled similarly, no need to test separately)

justonedev1

so if I understand correctly: you wait here until all of the detectors respond (the return -> continue changes) and that is mostly it? How is timeout handled here?

knopers8 · 2025-03-25T15:31:05Z

so if I understand correctly: you wait here until all of the detectors respond (the return -> continue changes) and that is mostly it? How is timeout handled here?

Yes, that's mostly it.

The timeout is handled as before, a context with timeout is used in event loops and all calls to DCS.

knopers8 requested a review from justonedev1 March 25, 2025 14:14

justonedev1 approved these changes Mar 25, 2025

View reviewed changes

justonedev1 merged commit cd537c2 into AliceO2Group:master Mar 25, 2025
2 checks passed

knopers8 deleted the dcs-wait-failed-sor branch March 25, 2025 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] ECS waits until DCS operation completes in case a detector fails #675

[core] ECS waits until DCS operation completes in case a detector fails #675

Uh oh!

knopers8 commented Mar 25, 2025

Uh oh!

knopers8 commented Mar 25, 2025

Uh oh!

justonedev1 left a comment

Uh oh!

knopers8 commented Mar 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

[core] ECS waits until DCS operation completes in case a detector fails #675

[core] ECS waits until DCS operation completes in case a detector fails #675

Uh oh!

Conversation

knopers8 commented Mar 25, 2025

Uh oh!

knopers8 commented Mar 25, 2025

Uh oh!

justonedev1 left a comment

Choose a reason for hiding this comment

Uh oh!

knopers8 commented Mar 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants