Bug 1278481 - After problem with connection to storage domain one of hosts have paused vm
After problem with connection to storage domain one of hosts have paused vm
Status: ASSIGNED
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent (Show other bugs)
1.3.1
x86_64 Linux
low Severity high (vote)
: ovirt-4.3.0
: ---
Assigned To: Phillip Bailey
Artyom
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-05 10:09 EST by Artyom
Modified: 2017-11-27 09:04 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: SLA
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.3+
mgoldboi: planning_ack+
rgolan: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
logs (4.98 MB, application/x-gzip)
2015-11-05 10:09 EST, Artyom
no flags Details

  None (edit)
Description Artyom 2015-11-05 10:09:25 EST
Created attachment 1090154 [details]
logs

Description of problem:
After blocking connection to storage on host with engine vm, vm dropped to paused state and another engine vm started on second host, so when I restore connection to storage from first host, I have one host with engine vm up and one host with engine vm paused.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy hosted engine on two hosts
2. Block connection to storage domain from host with engine vm
3. Wait until engine vm start on second host
4. Restore connection to storage from first host

Actual results:
--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : cyan-vdsf.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "paused"}
Score                              : 3105
stopped                            : False
Local maintenance                  : False
crc32                              : 4be1a2c7
Host timestamp                     : 171518


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : rose05.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 5d1fea8a
Host timestamp                     : 2075


Expected results:
I expect that when connection restore and agent see that vm up on second host it will poweroff paused vm

Additional info:
Comment 9 Red Hat Bugzilla Rules Engine 2015-11-27 00:40:03 EST
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.
Comment 10 Red Hat Bugzilla Rules Engine 2015-11-27 00:40:03 EST
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Comment 11 Martin Sivák 2015-12-09 08:18:27 EST
Unfortunately we can't clean a VM in Paused state, because it might be an incoming migration too. We might be able to improve that once the vdsm events are exposed and we can listen for them.
Comment 12 Roy Golan 2016-03-02 10:07:04 EST
This is degrading 1 HA host from the cluser. As long as we have that vm we can't use the host to run the engine VM. We must clear it.

msivak - regarding comment 11, we have a status reason on the VM so we can know if the vm is an incomming migration and btw an incomming migration is in status migrationDst or something similar.
Comment 14 Roy Golan 2016-03-16 07:34:48 EDT
- how did we end up in DOWN state in 3.5? BTW a Down VM could also be a problem if its not cleaned.
- After Sanlock killed the resource we should have cleaned it. A Down VM won't interfere re starting this VM but a paused VM probably will.
Comment 15 Martin Sivák 2016-03-16 08:37:52 EDT
Hosted engine never cleaned up any VM. It was always handled by the engine. We might change that, but we have to be very careful as the engine relies on being the only one who cleans VMs (VDSM removes the record once the result is collected).
Comment 16 Roy Golan 2016-04-13 06:13:24 EDT
Artyom will that prevent us from restarting the vm on host 1? if yes it should should be urgent, if not this is low
Comment 17 Artyom 2016-04-13 11:58:02 EDT
It will prevent to HE VM to start on host 1, but only once, after vm start failed, host will have score 0 and no vm will exist on it(Score is 0 due to unexpected vm shutdown)
Comment 18 Roy Golan 2016-04-17 03:47:43 EDT
This means that for the next 10 minutes the un-clean host score will be 0 and after that it will start from 3400. Lowering priority, we should handle this cleanup quickly, there is no reason the HA cluster is degraded during that 10 minutes period.
Comment 19 Red Hat Bugzilla Rules Engine 2016-04-17 03:47:48 EDT
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Comment 20 Sandro Bonazzola 2016-05-02 05:55:58 EDT
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
Comment 21 Yaniv Lavi 2016-05-23 09:17:26 EDT
oVirt 4.0 beta has been released, moving to RC milestone.
Comment 22 Yaniv Lavi 2016-05-23 09:24:00 EDT
oVirt 4.0 beta has been released, moving to RC milestone.

Note You need to log in before you can comment on or make changes to this bug.