Skip to content

Conversation

@knopers8
Copy link
Collaborator

The issue described in OCTRL-953 occurs when we schedule kill of a task and while waiting for an acknowledgment, we receive two mesos updates:

[2024-11-11T12:14:02+01:00] TRACE scheduler:   task status update received detector=TRG message= srcHost=alio2-cr1-flp163 state=TASK_FAILED task=2qwA9EYEgnY
[2024-11-11T12:14:02+01:00] TRACE scheduler:   task status update received detector=TRG message=Reconciliation: Task is unknown to the agent srcHost=alio2-cr1-flp163 state=TASK_LOST task=2qwA9EYEgnY

Which then trigger the discussed ack.

Since it's inclear to me whether we can surely ignore TASK_LOST and trust that we will always receive either TASK_FAILED or TASK_FINISHED, I went for the approach of improving safeAcks to produce an error when subsequent acks are sent instead of blocking some goroutines.

The issue described in OCTRL-953 occurs when we schedule kill of a task and while waiting for an acknowledgment, we receive two mesos updates:
```
[2024-11-11T12:14:02+01:00] TRACE scheduler:   task status update received detector=TRG message= srcHost=alio2-cr1-flp163 state=TASK_FAILED task=2qwA9EYEgnY
[2024-11-11T12:14:02+01:00] TRACE scheduler:   task status update received detector=TRG message=Reconciliation: Task is unknown to the agent srcHost=alio2-cr1-flp163 state=TASK_LOST task=2qwA9EYEgnY
```
Which then trigger the discussed ack.

Since it's inclear to me whether we can surely ignore TASK_LOST and trust that we will always receive either TASK_FAILED or TASK_FINISHED, I went for the approach of improving safeAcks to produce an error when subsequent acks are sent instead of blocking some goroutines.
@knopers8 knopers8 requested review from justonedev1 and teo November 20, 2024 10:50
@knopers8
Copy link
Collaborator Author

Tested on staging, it seems ok.

Copy link
Collaborator

@justonedev1 justonedev1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems fine, I am just curious about the sync.Map

// At the moment we utilize SafeAcks to acknowledge that all the requested tasks were killed by mesos (task/manager.go).
type SafeAcks struct {
mu sync.RWMutex
acks map[string]ackChannels
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me that you are using mutex only to guard read/write into the map. Why not to use sync.Map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wasn't aware it exists. However, I'm not sure if I could achieve the expected behaviour in RegisterAck with sync.Map as I need to do perform a read and write ideally in one mutex acquisition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, thanks

@teo teo merged commit 3f4a6f7 into AliceO2Group:master Nov 25, 2024
2 checks passed
@knopers8 knopers8 deleted the safe-safeacks branch November 25, 2024 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants