Created attachment 1090154 [details] logs Description of problem: After blocking connection to storage on host with engine vm, vm dropped to paused state and another engine vm started on second host, so when I restore connection to storage from first host, I have one host with engine vm up and one host with engine vm paused. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Deploy hosted engine on two hosts 2. Block connection to storage domain from host with engine vm 3. Wait until engine vm start on second host 4. Restore connection to storage from first host Actual results: --== Host 1 status ==-- Status up-to-date : True Hostname : cyan-vdsf.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "paused"} Score : 3105 stopped : False Local maintenance : False crc32 : 4be1a2c7 Host timestamp : 171518 --== Host 2 status ==-- Status up-to-date : True Hostname : rose05.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 5d1fea8a Host timestamp : 2075 Expected results: I expect that when connection restore and agent see that vm up on second host it will poweroff paused vm Additional info:
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset. Please set the correct milestone or add the z-stream flag.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Unfortunately we can't clean a VM in Paused state, because it might be an incoming migration too. We might be able to improve that once the vdsm events are exposed and we can listen for them.
This is degrading 1 HA host from the cluser. As long as we have that vm we can't use the host to run the engine VM. We must clear it. msivak - regarding comment 11, we have a status reason on the VM so we can know if the vm is an incomming migration and btw an incomming migration is in status migrationDst or something similar.
- how did we end up in DOWN state in 3.5? BTW a Down VM could also be a problem if its not cleaned. - After Sanlock killed the resource we should have cleaned it. A Down VM won't interfere re starting this VM but a paused VM probably will.
Hosted engine never cleaned up any VM. It was always handled by the engine. We might change that, but we have to be very careful as the engine relies on being the only one who cleans VMs (VDSM removes the record once the result is collected).
Artyom will that prevent us from restarting the vm on host 1? if yes it should should be urgent, if not this is low
It will prevent to HE VM to start on host 1, but only once, after vm start failed, host will have score 0 and no vm will exist on it(Score is 0 due to unexpected vm shutdown)
This means that for the next 10 minutes the un-clean host score will be 0 and after that it will start from 3400. Lowering priority, we should handle this cleanup quickly, there is no reason the HA cluster is degraded during that 10 minutes period.
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
oVirt 4.0 beta has been released, moving to RC milestone.
This bug has been reported against 3.6 in November 2015 and has no updates since November 2017. I'm closing this bug as wontfix. If you think this bug needs attention, please reopen providing fresh reproducer on latest released version.