Skip to content

Conversation

@knopers8
Copy link
Collaborator

@knopers8 knopers8 commented Jan 15, 2025

  • "task '/opt/o2/bin/o2-readout-exe' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU, name alio2-cr1-hv-gw01.cern.ch:/opt/git/ControlWorkflows/tasks/readout@12b11ac4bb652e1835e3e94806a688c951691d5f#2sBwA3Z8yWU) failed with error..." displayed by COG becomes "task 'readout' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU) failed with error...". I believe the removed information is not useful for typical operators. Details can be always found in logs.
  • "MesosCommand MesosCommand_Transition timed out for task 2sBwYFZ82wn" becomes "MesosCommand_Transition timed out for task 2sBwYFZ82wn" because by convention, all MesosCommands have "MesosCommand" in their names.
  • If a transition request times out, more accurate error is reported in COG. "nil response" becomes "CONFIGURE could not complete for critical tasks, errors: task 'readout' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU) failed with error: MesosCommand_Transition timed out for task 2sBwA3Z8yWU".
  • In case a task transition times out, any other errors which happened at the same time (e.g. a task crashed during transition) are not omitted anymore in COG.
  • "nil response" in task/manager.go becomes "no response from Mesos to CONFIGURE transition request within 120s timeout", but it will not be typically printed for tasks timing out during a transition.
  • "nil response" in commandqueue.go becomes "did not receive neither response nor error for MesosCommand_Transition"

One could consider further simplification of some of these messages, but perhaps let's see how this goes.

OCTRL-975

…iguration

- "task '/opt/o2/bin/o2-readout-exe' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU, name alio2-cr1-hv-gw01.cern.ch:/opt/git/ControlWorkflows/tasks/readout@12b11ac4bb652e1835e3e94806a688c951691d5f#2sBwA3Z8yWU) failed with error..." displayed by COG becomes "task 'readout' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU) failed with error...". I believe the removed information is not useful for typical operators. Details can be always found in logs.
- "MesosCommand MesosCommand_Transition timed out for task 2sBwYFZ82wn" becomes "MesosCommand_Transition timed out for task 2sBwYFZ82wn" because by convention, all MesosCommands have "MesosCommand" in their names.
- If a transition request times out, more accurate error is reported in COG. "nil response" becomes "CONFIGURE could not complete for critical tasks, errors: task 'readout' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU) failed with error: MesosCommand_Transition timed out for task 2sBwA3Z8yWU".
- In case a task transition times out, any other errors which happened at the same time (e.g. a task crashed during transition) are not omitted anymore in COG.
- "nil response" in task/manager.go becomes "no response from Mesos to CONFIGURE transition request within 120s timeout", but it will not be typically printed for tasks timing out during a transition.
- "nil response" in commandqueue.go becomes "did not receive neither response nor error for MesosCommand_Transition"

One could consider further simplification of some of these messages, but perhaps let's see how this goes.
@knopers8 knopers8 requested review from justonedev1 and teo January 15, 2025 09:48
@knopers8 knopers8 merged commit 0a82858 into AliceO2Group:master Jan 16, 2025
2 checks passed
@knopers8 knopers8 deleted the nil-response branch January 16, 2025 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants