Skip to content

Conversation

@knopers8
Copy link
Collaborator

This changes how we consume reply streams of requests to DCS. In particular, we do not stop consuming events in case we receive a detector error and instead we wait to receive the status of all the other detectors.

As it has been clarified with the DCS team, Ecs2DcsGateway never sends "a pre-closure summary of the current operation" with detector == DCS, despite the documentation in the dcs.proto file. The latest version of this proto file in the Ecs2DcsGateway repo has the comment removed.

With this commit, we treat an EOF from DCS as the only correct stream termination and do not ever expect a detector == DCS event.

We also remove some reduntant operations, but more refactoring will have to come.

Closes OCTRL-711.

This changes how we consume reply streams of requests to DCS.
In particular, we do not stop consuming events in case we receive a detector error and instead we wait to receive the status of all the other detectors.

As it has been clarified with the DCS team, Ecs2DcsGateway never sends "a pre-closure summary of the current operation" with detector == DCS, despite the documentation in the dcs.proto file.
The latest version of this proto file in the Ecs2DcsGateway repo has the comment removed.

With this commit, we treat an EOF from DCS as the only correct stream termination and do not ever expect a detector == DCS event.

We also remove some reduntant operations, but more refactoring will have to come.

Closes OCTRL-711.
@knopers8 knopers8 requested a review from justonedev1 March 25, 2025 14:14
@knopers8
Copy link
Collaborator Author

Tested manually with Michele's mock system (ECS2DCS2ECS) with the following cases:

  • SOR, many detectors RUN_OK
  • SOR, many detectors, DCS takes too much time to reply with one detector (DCS receives the request, but a detector times out)
  • SOR, many detectors, DCS takes too much time to reply with all detectors (DCS receives the request, but all detectors time out)
  • SOR, many detectors, DCS takes too much time to reply (DCS does not receive the request, e.g. networking issue. Tested by blocking the firewall)
  • SOR, many detectors, gRPC returns unknown error (tested by replacing err with debugger)
  • SOR, many detectors, one returns SOR_FAILURE (SOR_UNAVAILABLE and SOR_TIMEOUT are handled similarly, no need to test separately)

Copy link
Collaborator

@justonedev1 justonedev1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if I understand correctly: you wait here until all of the detectors respond (the return -> continue changes) and that is mostly it? How is timeout handled here?

@knopers8
Copy link
Collaborator Author

so if I understand correctly: you wait here until all of the detectors respond (the return -> continue changes) and that is mostly it? How is timeout handled here?

Yes, that's mostly it.

The timeout is handled as before, a context with timeout is used in event loops and all calls to DCS.

@justonedev1 justonedev1 merged commit cd537c2 into AliceO2Group:master Mar 25, 2025
2 checks passed
@knopers8 knopers8 deleted the dcs-wait-failed-sor branch March 25, 2025 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants