1278481 – After problem with connection to storage domain one of hosts have paused vm

Bug 1278481 - After problem with connection to storage domain one of hosts have paused vm

Summary: After problem with connection to storage domain one of hosts have paused vm

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	Agent
Sub Component:
Version:	1.3.1
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Sandro Bonazzola
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-05 15:09 UTC by Artyom
Modified:	2018-09-24 10:01 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-09-24 10:01:51 UTC
oVirt Team:	Integration
Embargoed:

Attachments	(Terms of Use)
logs (4.98 MB, application/x-gzip) 2015-11-05 15:09 UTC, Artyom	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1393839	0	high	CLOSED	Hosted engine vm status remains paused on 1st host and starts on 2nd Host during hosted-storage disconnect and reconnect	2022-03-13 14:33:50 UTC

Internal Links: 1393839

Description Artyom 2015-11-05 15:09:25 UTC

Created attachment 1090154 [details]
logs

Description of problem:
After blocking connection to storage on host with engine vm, vm dropped to paused state and another engine vm started on second host, so when I restore connection to storage from first host, I have one host with engine vm up and one host with engine vm paused.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy hosted engine on two hosts
2. Block connection to storage domain from host with engine vm
3. Wait until engine vm start on second host
4. Restore connection to storage from first host

Actual results:
--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : cyan-vdsf.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "paused"}
Score                              : 3105
stopped                            : False
Local maintenance                  : False
crc32                              : 4be1a2c7
Host timestamp                     : 171518


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : rose05.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 5d1fea8a
Host timestamp                     : 2075


Expected results:
I expect that when connection restore and agent see that vm up on second host it will poweroff paused vm

Additional info:

Comment 9 Red Hat Bugzilla Rules Engine 2015-11-27 05:40:03 UTC

This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.

Comment 10 Red Hat Bugzilla Rules Engine 2015-11-27 05:40:03 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 11 Martin Sivák 2015-12-09 13:18:27 UTC

Unfortunately we can't clean a VM in Paused state, because it might be an incoming migration too. We might be able to improve that once the vdsm events are exposed and we can listen for them.

Comment 12 Roy Golan 2016-03-02 15:07:04 UTC

This is degrading 1 HA host from the cluser. As long as we have that vm we can't use the host to run the engine VM. We must clear it.

msivak - regarding comment 11, we have a status reason on the VM so we can know if the vm is an incomming migration and btw an incomming migration is in status migrationDst or something similar.

Comment 14 Roy Golan 2016-03-16 11:34:48 UTC

- how did we end up in DOWN state in 3.5? BTW a Down VM could also be a problem if its not cleaned.
- After Sanlock killed the resource we should have cleaned it. A Down VM won't interfere re starting this VM but a paused VM probably will.

Comment 15 Martin Sivák 2016-03-16 12:37:52 UTC

Hosted engine never cleaned up any VM. It was always handled by the engine. We might change that, but we have to be very careful as the engine relies on being the only one who cleans VMs (VDSM removes the record once the result is collected).

Comment 16 Roy Golan 2016-04-13 10:13:24 UTC

Artyom will that prevent us from restarting the vm on host 1? if yes it should should be urgent, if not this is low

Comment 17 Artyom 2016-04-13 15:58:02 UTC

It will prevent to HE VM to start on host 1, but only once, after vm start failed, host will have score 0 and no vm will exist on it(Score is 0 due to unexpected vm shutdown)

Comment 18 Roy Golan 2016-04-17 07:47:43 UTC

This means that for the next 10 minutes the un-clean host score will be 0 and after that it will start from 3400. Lowering priority, we should handle this cleanup quickly, there is no reason the HA cluster is degraded during that 10 minutes period.

Comment 19 Red Hat Bugzilla Rules Engine 2016-04-17 07:47:48 UTC

Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 20 Sandro Bonazzola 2016-05-02 09:55:58 UTC

Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 21 Yaniv Lavi 2016-05-23 13:17:26 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 22 Yaniv Lavi 2016-05-23 13:24:00 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 23 Sandro Bonazzola 2018-09-24 10:01:51 UTC

This bug has been reported against 3.6 in November 2015 and has no updates since November 2017. I'm closing this bug as wontfix.
If you think this bug needs attention, please reopen providing fresh reproducer on latest released version.

Note You need to log in before you can comment on or make changes to this bug.