Bug 1278481

Summary: After problem with connection to storage domain one of hosts have paused vm
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Artyom <alukiano>
Component: AgentAssignee: Sandro Bonazzola <sbonazzo>
Status: CLOSED WONTFIX QA Contact: Artyom <alukiano>
Severity: high Docs Contact:
Priority: low    
Version: 1.3.1CC: alukiano, bugs, dfediuck, mavital, mgoldboi, msivak, ylavi
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-24 10:01:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Artyom 2015-11-05 15:09:25 UTC
Created attachment 1090154 [details]
logs

Description of problem:
After blocking connection to storage on host with engine vm, vm dropped to paused state and another engine vm started on second host, so when I restore connection to storage from first host, I have one host with engine vm up and one host with engine vm paused.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy hosted engine on two hosts
2. Block connection to storage domain from host with engine vm
3. Wait until engine vm start on second host
4. Restore connection to storage from first host

Actual results:
--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : cyan-vdsf.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "paused"}
Score                              : 3105
stopped                            : False
Local maintenance                  : False
crc32                              : 4be1a2c7
Host timestamp                     : 171518


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : rose05.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 5d1fea8a
Host timestamp                     : 2075


Expected results:
I expect that when connection restore and agent see that vm up on second host it will poweroff paused vm

Additional info:

Comment 9 Red Hat Bugzilla Rules Engine 2015-11-27 05:40:03 UTC
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.

Comment 10 Red Hat Bugzilla Rules Engine 2015-11-27 05:40:03 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 11 Martin Sivák 2015-12-09 13:18:27 UTC
Unfortunately we can't clean a VM in Paused state, because it might be an incoming migration too. We might be able to improve that once the vdsm events are exposed and we can listen for them.

Comment 12 Roy Golan 2016-03-02 15:07:04 UTC
This is degrading 1 HA host from the cluser. As long as we have that vm we can't use the host to run the engine VM. We must clear it.

msivak - regarding comment 11, we have a status reason on the VM so we can know if the vm is an incomming migration and btw an incomming migration is in status migrationDst or something similar.

Comment 14 Roy Golan 2016-03-16 11:34:48 UTC
- how did we end up in DOWN state in 3.5? BTW a Down VM could also be a problem if its not cleaned.
- After Sanlock killed the resource we should have cleaned it. A Down VM won't interfere re starting this VM but a paused VM probably will.

Comment 15 Martin Sivák 2016-03-16 12:37:52 UTC
Hosted engine never cleaned up any VM. It was always handled by the engine. We might change that, but we have to be very careful as the engine relies on being the only one who cleans VMs (VDSM removes the record once the result is collected).

Comment 16 Roy Golan 2016-04-13 10:13:24 UTC
Artyom will that prevent us from restarting the vm on host 1? if yes it should should be urgent, if not this is low

Comment 17 Artyom 2016-04-13 15:58:02 UTC
It will prevent to HE VM to start on host 1, but only once, after vm start failed, host will have score 0 and no vm will exist on it(Score is 0 due to unexpected vm shutdown)

Comment 18 Roy Golan 2016-04-17 07:47:43 UTC
This means that for the next 10 minutes the un-clean host score will be 0 and after that it will start from 3400. Lowering priority, we should handle this cleanup quickly, there is no reason the HA cluster is degraded during that 10 minutes period.

Comment 19 Red Hat Bugzilla Rules Engine 2016-04-17 07:47:48 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 20 Sandro Bonazzola 2016-05-02 09:55:58 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 21 Yaniv Lavi 2016-05-23 13:17:26 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 22 Yaniv Lavi 2016-05-23 13:24:00 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 23 Sandro Bonazzola 2018-09-24 10:01:51 UTC
This bug has been reported against 3.6 in November 2015 and has no updates since November 2017. I'm closing this bug as wontfix.
If you think this bug needs attention, please reopen providing fresh reproducer on latest released version.