Bug 1593568

Summary: Unexpected behaviour of HA VM when host VM was running ended up Non-responsive.
Product: Red Hat Enterprise Virtualization Manager Reporter: Ribu Tho <rabraham>
Component: vdsmAssignee: Milan Zamazal <mzamazal>
Status: CLOSED ERRATA QA Contact: Polina <pagranat>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.2.3CC: dfediuck, lsurette, mavital, mgoldboi, michal.skrivanek, mzamazal, rabraham, Rhev-m-bugs, sgoodman, srevivo, ycui
Target Milestone: ovirt-4.3.0Keywords: ZStream
Target Release: 4.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: v4.30.3 Doc Type: Bug Fix
Doc Text:
Previously, if a CD-ROM was ejected from a virtual machine and VDSM was fenced or restarted, the virtual machine became unresponsive and/or the Manager reported its status as "Unknown." In the current release, a virtual machine with an ejected CD-ROM recovers after restarting VDSM.
Story Points: ---
Clone Of:
: 1594793 (view as bug list) Environment:
Last Closed: 2019-05-08 12:36:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1594793    
Attachments:
Description Flags
logs: engine, vdsm, libvirt, qemu none

Description Ribu Tho 2018-06-21 06:14:49 UTC
Description of problem:


An issue with HA VM's for which the host goes non-responsive and the HA VM is forced start's up on another host at the same time. The host resumes back to Up state correcting out after which the VM ends up on two different hosts at the same time leading to filesystem corruption. 

Version-Release number of selected component (if applicable):

ovirt-engine-4.2.3.8-0.1.el7.noarch
vdsm-4.20.27.2-1.el7ev.x86_64
How reproducible:


Steps to Reproduce:
1. Start VM on host H1 

2. Host H1 goes non-responsive
 
3. VM starts on host H2

4. Host H1 resumes to Up State

5. VM ends up existing on two different hosts. 

6. VM goes to Paused/Up mode from time to time by Engine server. 


Actual results:
The VM ended up with filesystem corruption on the boot drive. 

Expected results:
Avoid running from 2 different hosts at the same time. 

Additional info:

The issue was faced with VM's in HA mode.

Comment 5 Michal Skrivanek 2018-06-22 06:28:44 UTC
please take a look at rh04 log 2018-06-18 11:57:13,281+1000
Seems the incoming migration VM disk has no "source" attribute at all in _srcDomXML, though there is one in regular params with empty file=''

Comment 6 Michal Skrivanek 2018-06-22 07:15:36 UTC
rh01:
09:45:03 VDSM restart
apparently not recovered correctly (failed due to same issue as in comment #5)
09:48:40 shutdown, but VM possibly not correctly undefined
start on 09:49:11 fails with VM machine already exists
09:49:21 again a destroy attempt
10:55:05,220 hmm? - INFO  (periodic/1) [vds] Recovered new external domain"
11:49:16 VDSM restart
11:49:18 VM recovered and detected as "Changed state to Down: VM terminated with error (code=1)"
11:52:28 start VM succeeds
11:52:43 VM guest reboot

rh04
09:49:22 VM start succeeded
10:33:27 VM guest reboot
10:37:15 VM guest reboot
11:44:57 shutdown, succeeded
13:10:59 VM start
13:40:25 shutdown

engine.log
starts a day later at 2018-06-19 03:37:01
Please attach a correct log.

Comment 7 Michal Skrivanek 2018-06-22 07:19:27 UTC
please also attach an earlier log from rh01 capturing the VM start prior to 2018-06-18 07:01:01, same for engine.log

Comment 8 Michal Skrivanek 2018-06-22 07:44:04 UTC
(In reply to Michal Skrivanek from comment #5)
> please take a look at rh04 log 2018-06-18 11:57:13,281+1000
> Seems the incoming migration VM disk has no "source" attribute at all in
> _srcDomXML, though there is one in regular params with empty file=''

@mzamazal: the one on rh01 at 09:45:03 is the first occurrence, perhaps rather look there
worth noting there were a lot of snapshot manipulations in previous days (live storage merge)

Comment 9 Milan Zamazal 2018-06-22 09:58:25 UTC
Recovery fails due to missing `file' attribute. The failed recovery means the VM startup domain wrapper is never replaced with a running VM wrapper and most libvirt operations are rejected, while the VM is running.

I can't inspect why `file' attribute is missing until Vdsm logs since the VM start are provided.

Comment 10 Milan Zamazal 2018-06-22 12:35:10 UTC
Actually <source> element of the CD-ROM drive is missing. This happens after CD-ROM ejection and is not handled in Vdsm. I'm looking for a fix.

Comment 12 Milan Zamazal 2018-06-25 10:50:48 UTC
(In reply to Ribu Tho from comment #11)
> Please can you specify the root cause to this issue

The problem occurs when a CD previously inserted in a VM (either on VM startup or during VM run) is ejected and Vdsm is fenced/restarted. The running VM can't be fully recovered in such a case, it can't be handled from Engine anymore and that leads to Engine confusion, up to possible split brain. The only workaround I know about to handle the situation once it occurs is to shut down the VM from the guest OS.

Comment 16 Polina 2018-08-09 13:56:16 UTC
Hi,

I've tried to reproduce the scenario on the downstream version rhvm-4.2.6-0.1.el7ev.noarch according the above description before getting the fixed version. Could you please look at the scenario and advice how to reproduce the situation to verify the bug?
My steps:
1. Run the HA VM with lease (CD attached) (associated with iscsi SD). Please see the logs . the scenario starts at the line 2425 2018-08-09 16:17:57,971+03 in the engine.log:
2. Cause the host to be Non Responsive  (by blocking access by iptables from Engine to the Host).

Actual result: the VM is not migrated to other host. stayed in Unknown status.

Comment 17 Polina 2018-08-09 13:57:18 UTC
Created attachment 1474695 [details]
logs: engine, vdsm, libvirt, qemu

Comment 18 Milan Zamazal 2018-08-10 19:17:07 UTC
Hi Polina, the actual result would be correct and no split brain occurred. However for proper verification Vdsm must be restarted during the host problems, e.g. by Engine soft-fencing, which didn't happen in your scenario. Perhaps Ribu Tho can clarify how the originally reported scenario was invoked exactly.

Comment 19 Polina 2018-08-12 07:13:42 UTC
Hi Ribu Tho, could you please clarify about the scenario? What caused the host problem, so that I could reproduce this. thank you

Comment 20 Michal Skrivanek 2018-08-15 09:25:00 UTC
(In reply to Polina from comment #16)
> Hi,
> 
> I've tried to reproduce the scenario on the downstream version
> rhvm-4.2.6-0.1.el7ev.noarch according the above description before getting
> the fixed version.

as you can see in bug 1594793 it was verified by Israel in 4.2.4.5-0.1.
Why do you expect it is not fixed in 4.2.6?

> Could you please look at the scenario and advice how to
> reproduce the situation to verify the bug?
> My steps:
> 1. Run the HA VM with lease (CD attached) (associated with iscsi SD). Please
> see the logs . the scenario starts at the line 2425 2018-08-09
> 16:17:57,971+03 in the engine.log:
> 2. Cause the host to be Non Responsive  (by blocking access by iptables from
> Engine to the Host).

why do you induce unresponsive host? Please read the root cause steps in comment  #12

> Actual result: the VM is not migrated to other host. stayed in Unknown
> status.

That's not related to this bug at all

Comment 21 Polina 2018-08-16 09:13:07 UTC
verified on upstream version:
ovirt-release-master-4.3.0-0.1.master.20180815000055.gitdd598f0.el7.noarch
vdsm-4.30.0-527.gitcec1054.el7.x86_64
steps for verification are from 
https://bugzilla.redhat.com/show_bug.cgi?id=1594793#c16

Comment 25 errata-xmlrpc 2019-05-08 12:36:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077

Comment 26 Daniel Gur 2019-08-28 13:12:17 UTC
sync2jira

Comment 27 Daniel Gur 2019-08-28 13:16:30 UTC
sync2jira