1594793 – Unexpected behaviour of HA VM when host VM was running ended up Non-responsive.

Bug 1594793 - Unexpected behaviour of HA VM when host VM was running ended up Non-responsive.

Summary: Unexpected behaviour of HA VM when host VM was running ended up Non-responsive.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.2.3
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.2.4-1
Target Release:	---
Assignee:	Milan Zamazal
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:	1593568
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-25 12:31 UTC by RHV bug bot
Modified:	2023-09-15 00:10 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1593568
Environment:
Last Closed:	2018-07-02 18:58:51 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3496561	None	None	None	2018-06-25 12:32:55 UTC
Red Hat Product Errata	RHEA-2018:2118	None	None	None	2018-07-02 18:58:56 UTC
oVirt gerrit	92445	'None'	MERGED	virt: Don't fail on missing <source> in a CD-ROM device	2020-09-28 09:57:34 UTC
oVirt gerrit	92481	'None'	MERGED	virt: Don't fail on missing <source> in a CD-ROM device	2020-09-28 09:57:34 UTC
oVirt gerrit	92493	'None'	MERGED	virt: Don't fail on missing <source> in a CD-ROM device	2020-09-28 09:57:34 UTC

Description RHV bug bot 2018-06-25 12:31:54 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1593568 +++
======================================================================

Description of problem:


An issue with HA VM's for which the host goes non-responsive and the HA VM is forced start's up on another host at the same time. The host resumes back to Up state correcting out after which the VM ends up on two different hosts at the same time leading to filesystem corruption. 

Version-Release number of selected component (if applicable):

ovirt-engine-4.2.3.8-0.1.el7.noarch
vdsm-4.20.27.2-1.el7ev.x86_64
How reproducible:


Steps to Reproduce:
1. Start VM on host H1 

2. Host H1 goes non-responsive
 
3. VM starts on host H2

4. Host H1 resumes to Up State

5. VM ends up existing on two different hosts. 

6. VM goes to Paused/Up mode from time to time by Engine server. 


Actual results:
The VM ended up with filesystem corruption on the boot drive. 

Expected results:
Avoid running from 2 different hosts at the same time. 

Additional info:

The issue was faced with VM's in HA mode.

(Originally by Ribu Abraham)

Comment 6 RHV bug bot 2018-06-25 12:32:17 UTC

please take a look at rh04 log 2018-06-18 11:57:13,281+1000
Seems the incoming migration VM disk has no "source" attribute at all in _srcDomXML, though there is one in regular params with empty file=''

(Originally by michal.skrivanek)

Comment 7 RHV bug bot 2018-06-25 12:32:22 UTC

rh01:
09:45:03 VDSM restart
apparently not recovered correctly (failed due to same issue as in comment #5)
09:48:40 shutdown, but VM possibly not correctly undefined
start on 09:49:11 fails with VM machine already exists
09:49:21 again a destroy attempt
10:55:05,220 hmm? - INFO  (periodic/1) [vds] Recovered new external domain"
11:49:16 VDSM restart
11:49:18 VM recovered and detected as "Changed state to Down: VM terminated with error (code=1)"
11:52:28 start VM succeeds
11:52:43 VM guest reboot

rh04
09:49:22 VM start succeeded
10:33:27 VM guest reboot
10:37:15 VM guest reboot
11:44:57 shutdown, succeeded
13:10:59 VM start
13:40:25 shutdown

engine.log
starts a day later at 2018-06-19 03:37:01
Please attach a correct log.

(Originally by michal.skrivanek)

Comment 8 RHV bug bot 2018-06-25 12:32:26 UTC

please also attach an earlier log from rh01 capturing the VM start prior to 2018-06-18 07:01:01, same for engine.log

(Originally by michal.skrivanek)

Comment 9 RHV bug bot 2018-06-25 12:32:31 UTC

(In reply to Michal Skrivanek from comment #5)
> please take a look at rh04 log 2018-06-18 11:57:13,281+1000
> Seems the incoming migration VM disk has no "source" attribute at all in
> _srcDomXML, though there is one in regular params with empty file=''

@mzamazal: the one on rh01 at 09:45:03 is the first occurrence, perhaps rather look there
worth noting there were a lot of snapshot manipulations in previous days (live storage merge)

(Originally by michal.skrivanek)

Comment 10 RHV bug bot 2018-06-25 12:32:35 UTC

Recovery fails due to missing `file' attribute. The failed recovery means the VM startup domain wrapper is never replaced with a running VM wrapper and most libvirt operations are rejected, while the VM is running.

I can't inspect why `file' attribute is missing until Vdsm logs since the VM start are provided.

(Originally by Milan Zamazal)

Comment 11 RHV bug bot 2018-06-25 12:32:38 UTC

Actually <source> element of the CD-ROM drive is missing. This happens after CD-ROM ejection and is not handled in Vdsm. I'm looking for a fix.

(Originally by Milan Zamazal)

Comment 16 Israel Pinto 2018-06-27 14:31:35 UTC

Verify with:
Engine: Software Version:4.2.4.5-0.1 (rhv-release-4.2.4-7-001.noarch)
Hosts: 
OS Version:RHEL - 7.5 - 8.el7
Kernel Version:3.10.0 - 862.6.3.el7.x86_64
KVM Version:2.10.0 - 21.el7_5.4
LIBVIRT Version:libvirt-3.9.0-14.el7_5.6
VDSM Version:vdsm-4.20.32-1.el7ev

Steps:
1. Create 2 HA VMs, attach CD to each VM: 
   VM_1 is with lease and resume behavior "KILL"
   VM_2 is without lease and resume behavior "KILL"
   Both VM running with ISCSI disk 
   Did not test with NFS since: 
   https://bugzilla.redhat.com/show_bug.cgi?id=1481022
2. Start VMs on Host_1 and eject CD
3. Block connection to ISCSI storage with iptables on Host_1  
4. Both VMs switch to pause 
5. VMs started on Host_2 
6. Check on Host_1 that no running VM 
   # virsh -r list --all
  Id    Name                           State
  ----------------------------------------------------
Results: 
PASS

Comment 18 errata-xmlrpc 2018-07-02 18:58:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2118

Comment 19 Daniel Gur 2019-08-28 13:14:59 UTC

sync2jira

Comment 20 Daniel Gur 2019-08-28 13:20:02 UTC

sync2jira

Comment 23 Red Hat Bugzilla 2023-09-15 00:10:08 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.