-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Out of band migration can cause VMs to be stopped, that shouldn't be.
ISSUE TYPE
- Bug Report
- Improvement Request
COMPONENT NAME
Orchestration, PowerReport
CLOUDSTACK VERSION
4.0-4.11.3 (not confirmed in newer versions but seems to be there)
CONFIGURATION
any configuration that can do live migrations. ping interval can be important.
OS / ENVIRONMENT
SUMMARY
The issue in the title occurs when an out of band migration takes place, for instance by a third party DRS system (vmware build in being the culprit in this case).
The sequence that occurs is such:
- VM migrates
- it's old host reports its VMs
- cloudstack marks the VM stopped
- the new host reports its VMS
This is not specific to a hypervisor in its logic so I think it needs a generic solution. An easy solution is to add a state missing to the fsm for VMs, but that gives extra problems in cases of HA enabled VMs.
Addendum: after inspecting more code, the VM has to be missing for more than two Ping cycles for this to become a problem. There is a long milliSecondsGracefullPeriod = mgmtServiceConf.getPingInterval() * 2000L;. However the customers logs reveal that the new host actually updates the power state a few milliseconds before the old one. This is a race condition.
sanatised example:
2020-01-01 20:11:23,918 DEBUG [cloud.vm.VmWorkJobDispatcher] (Work-Job-Executor-26:ctx-fedcba98 job-2345678/job-2345679) Done with run of VM work job: com.cloud.vm.VmWorkStart for VM 1234, job origin: 2345678
2020-01-01 20:11:41,035 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-111:ctx-ffffffff) VM state report. host: 555, vm id: 1234, power state: PowerOn
2020-01-01 20:11:41,039 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-111:ctx-ffffffff) VM state report is updated. host: 555, vm id: 1234, power state: PowerOn
2020-01-01 20:11:41,049 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-222:ctx-eeeeeeee) Detected missing VM. host: 666, vm id: 1234, power state: PowerReportMissing, last state update: 1562822531000
2020-01-01 20:11:41,049 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-222:ctx-eeeeeeee) vm id: 1234 - time since last state update(34380045ms) has passed graceful period
2020-01-01 20:11:41,054 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-222:ctx-eeeeeeee) VM state report is updated. host: 666, vm id: 1234, power state: PowerReportMissing
2020-01-01 20:12:40,773 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-333:ctx-dddddddd) VM state report. host: 555, vm id: 1234, power state: PowerOn
2020-01-01 20:12:40,777 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-333:ctx-dddddddd) VM state report is updated. host: 555, vm id: 1234, power state: PowerOn
2020-01-01 20:13:40,859 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-444:ctx-cccccccc) VM state report. host: 555, vm id: 1234, power state: PowerOn
2020-01-01 20:13:40,873 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-444:ctx-cccccccc) VM state report is updated. host: 555, vm id: 1234, power state: PowerOn
2020-01-01 20:14:40,786 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-55:ctx-bbbbbbbb) VM state report. host: 555, vm id: 1234, power state: PowerOn
2020-01-01 20:14:40,791 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-55:ctx-bbbbbbbb) VM state report is updated. host: 555, vm id: 1234, power state: PowerOn
2020-01-01 20:15:40,781 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (DirectAgentCronJob-66:ctx-aaaaaaaa) VM state report. host: 555, vm id: 1234, power state: PowerOn
STEPS TO REPRODUCE
The issue has a timing facet and as such is hard to replay reliably. As shown in the description.
EXPECTED RESULTS
VM keeps running and re-appears to cloudstack
ACTUAL RESULTS
VM gets marked as stopped and an abundant StopCommand is send to the hypervisor. In theory this can be the old and the new hypervisor, depending on timing.