1593568 – Unexpected behaviour of HA VM when host VM was running ended up Non-responsive.

Bug 1593568 - Unexpected behaviour of HA VM when host VM was running ended up Non-responsive.

Summary: Unexpected behaviour of HA VM when host VM was running ended up Non-responsive.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.2.3
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.3.0
Target Release:	4.3.0
Assignee:	Milan Zamazal
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1594793
TreeView+	depends on / blocked

Reported:	2018-06-21 06:14 UTC by Ribu Tho
Modified:	2021-09-09 15:03 UTC (History)
CC List:	11 users (show)
Fixed In Version:	v4.30.3
Doc Type:	Bug Fix
Doc Text:	Previously, if a CD-ROM was ejected from a virtual machine and VDSM was fenced or restarted, the virtual machine became unresponsive and/or the Manager reported its status as "Unknown." In the current release, a virtual machine with an ejected CD-ROM recovers after restarting VDSM.
Clone Of:
Clones:	1594793 (view as bug list)
Environment:
Last Closed:	2019-05-08 12:36:02 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs: engine, vdsm, libvirt, qemu (2.11 MB, application/x-gzip) 2018-08-09 13:57 UTC, Polina	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43533	None	None	None	2021-09-09 14:43:28 UTC
Red Hat Knowledge Base (Solution)	3496561	None	None	None	2018-06-25 03:58:27 UTC
Red Hat Product Errata	RHBA-2019:1077	None	None	None	2019-05-08 12:36:24 UTC
oVirt gerrit	92445	'None'	MERGED	virt: Don't fail on missing <source> in a CD-ROM device	2021-02-08 12:12:32 UTC
oVirt gerrit	92481	'None'	MERGED	virt: Don't fail on missing <source> in a CD-ROM device	2021-02-08 12:12:32 UTC
oVirt gerrit	92493	'None'	MERGED	virt: Don't fail on missing <source> in a CD-ROM device	2021-02-08 12:12:32 UTC
oVirt gerrit	92566	'None'	MERGED	virt: Ignore device initialization errors in recovery	2021-02-08 12:12:32 UTC

Description Ribu Tho 2018-06-21 06:14:49 UTC

Description of problem:


An issue with HA VM's for which the host goes non-responsive and the HA VM is forced start's up on another host at the same time. The host resumes back to Up state correcting out after which the VM ends up on two different hosts at the same time leading to filesystem corruption. 

Version-Release number of selected component (if applicable):

ovirt-engine-4.2.3.8-0.1.el7.noarch
vdsm-4.20.27.2-1.el7ev.x86_64
How reproducible:


Steps to Reproduce:
1. Start VM on host H1 

2. Host H1 goes non-responsive
 
3. VM starts on host H2

4. Host H1 resumes to Up State

5. VM ends up existing on two different hosts. 

6. VM goes to Paused/Up mode from time to time by Engine server. 


Actual results:
The VM ended up with filesystem corruption on the boot drive. 

Expected results:
Avoid running from 2 different hosts at the same time. 

Additional info:

The issue was faced with VM's in HA mode.

Comment 5 Michal Skrivanek 2018-06-22 06:28:44 UTC

please take a look at rh04 log 2018-06-18 11:57:13,281+1000
Seems the incoming migration VM disk has no "source" attribute at all in _srcDomXML, though there is one in regular params with empty file=''

Comment 6 Michal Skrivanek 2018-06-22 07:15:36 UTC

rh01:
09:45:03 VDSM restart
apparently not recovered correctly (failed due to same issue as in comment #5)
09:48:40 shutdown, but VM possibly not correctly undefined
start on 09:49:11 fails with VM machine already exists
09:49:21 again a destroy attempt
10:55:05,220 hmm? - INFO  (periodic/1) [vds] Recovered new external domain"
11:49:16 VDSM restart
11:49:18 VM recovered and detected as "Changed state to Down: VM terminated with error (code=1)"
11:52:28 start VM succeeds
11:52:43 VM guest reboot

rh04
09:49:22 VM start succeeded
10:33:27 VM guest reboot
10:37:15 VM guest reboot
11:44:57 shutdown, succeeded
13:10:59 VM start
13:40:25 shutdown

engine.log
starts a day later at 2018-06-19 03:37:01
Please attach a correct log.

Comment 7 Michal Skrivanek 2018-06-22 07:19:27 UTC

please also attach an earlier log from rh01 capturing the VM start prior to 2018-06-18 07:01:01, same for engine.log

Comment 8 Michal Skrivanek 2018-06-22 07:44:04 UTC

(In reply to Michal Skrivanek from comment #5)
> please take a look at rh04 log 2018-06-18 11:57:13,281+1000
> Seems the incoming migration VM disk has no "source" attribute at all in
> _srcDomXML, though there is one in regular params with empty file=''

@mzamazal: the one on rh01 at 09:45:03 is the first occurrence, perhaps rather look there
worth noting there were a lot of snapshot manipulations in previous days (live storage merge)

Comment 9 Milan Zamazal 2018-06-22 09:58:25 UTC

Recovery fails due to missing `file' attribute. The failed recovery means the VM startup domain wrapper is never replaced with a running VM wrapper and most libvirt operations are rejected, while the VM is running.

I can't inspect why `file' attribute is missing until Vdsm logs since the VM start are provided.

Comment 10 Milan Zamazal 2018-06-22 12:35:10 UTC

Actually <source> element of the CD-ROM drive is missing. This happens after CD-ROM ejection and is not handled in Vdsm. I'm looking for a fix.

Comment 12 Milan Zamazal 2018-06-25 10:50:48 UTC

(In reply to Ribu Tho from comment #11)
> Please can you specify the root cause to this issue

The problem occurs when a CD previously inserted in a VM (either on VM startup or during VM run) is ejected and Vdsm is fenced/restarted. The running VM can't be fully recovered in such a case, it can't be handled from Engine anymore and that leads to Engine confusion, up to possible split brain. The only workaround I know about to handle the situation once it occurs is to shut down the VM from the guest OS.

Comment 16 Polina 2018-08-09 13:56:16 UTC

Hi,

I've tried to reproduce the scenario on the downstream version rhvm-4.2.6-0.1.el7ev.noarch according the above description before getting the fixed version. Could you please look at the scenario and advice how to reproduce the situation to verify the bug?
My steps:
1. Run the HA VM with lease (CD attached) (associated with iscsi SD). Please see the logs . the scenario starts at the line 2425 2018-08-09 16:17:57,971+03 in the engine.log:
2. Cause the host to be Non Responsive  (by blocking access by iptables from Engine to the Host).

Actual result: the VM is not migrated to other host. stayed in Unknown status.

Comment 17 Polina 2018-08-09 13:57:18 UTC

Created attachment 1474695 [details]
logs: engine, vdsm, libvirt, qemu

Comment 18 Milan Zamazal 2018-08-10 19:17:07 UTC

Hi Polina, the actual result would be correct and no split brain occurred. However for proper verification Vdsm must be restarted during the host problems, e.g. by Engine soft-fencing, which didn't happen in your scenario. Perhaps Ribu Tho can clarify how the originally reported scenario was invoked exactly.

Comment 19 Polina 2018-08-12 07:13:42 UTC

Hi Ribu Tho, could you please clarify about the scenario? What caused the host problem, so that I could reproduce this. thank you

Comment 20 Michal Skrivanek 2018-08-15 09:25:00 UTC

(In reply to Polina from comment #16)
> Hi,
> 
> I've tried to reproduce the scenario on the downstream version
> rhvm-4.2.6-0.1.el7ev.noarch according the above description before getting
> the fixed version.

as you can see in bug 1594793 it was verified by Israel in 4.2.4.5-0.1.
Why do you expect it is not fixed in 4.2.6?

> Could you please look at the scenario and advice how to
> reproduce the situation to verify the bug?
> My steps:
> 1. Run the HA VM with lease (CD attached) (associated with iscsi SD). Please
> see the logs . the scenario starts at the line 2425 2018-08-09
> 16:17:57,971+03 in the engine.log:
> 2. Cause the host to be Non Responsive  (by blocking access by iptables from
> Engine to the Host).

why do you induce unresponsive host? Please read the root cause steps in comment  #12

> Actual result: the VM is not migrated to other host. stayed in Unknown
> status.

That's not related to this bug at all

Comment 21 Polina 2018-08-16 09:13:07 UTC

verified on upstream version:
ovirt-release-master-4.3.0-0.1.master.20180815000055.gitdd598f0.el7.noarch
vdsm-4.30.0-527.gitcec1054.el7.x86_64
steps for verification are from 
https://bugzilla.redhat.com/show_bug.cgi?id=1594793#c16

Comment 25 errata-xmlrpc 2019-05-08 12:36:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077

Comment 26 Daniel Gur 2019-08-28 13:12:17 UTC

sync2jira

Comment 27 Daniel Gur 2019-08-28 13:16:30 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.