+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1593568 +++ ====================================================================== Description of problem: An issue with HA VM's for which the host goes non-responsive and the HA VM is forced start's up on another host at the same time. The host resumes back to Up state correcting out after which the VM ends up on two different hosts at the same time leading to filesystem corruption. Version-Release number of selected component (if applicable): ovirt-engine-4.2.3.8-0.1.el7.noarch vdsm-4.20.27.2-1.el7ev.x86_64 How reproducible: Steps to Reproduce: 1. Start VM on host H1 2. Host H1 goes non-responsive 3. VM starts on host H2 4. Host H1 resumes to Up State 5. VM ends up existing on two different hosts. 6. VM goes to Paused/Up mode from time to time by Engine server. Actual results: The VM ended up with filesystem corruption on the boot drive. Expected results: Avoid running from 2 different hosts at the same time. Additional info: The issue was faced with VM's in HA mode. (Originally by Ribu Abraham)
please take a look at rh04 log 2018-06-18 11:57:13,281+1000 Seems the incoming migration VM disk has no "source" attribute at all in _srcDomXML, though there is one in regular params with empty file='' (Originally by michal.skrivanek)
rh01: 09:45:03 VDSM restart apparently not recovered correctly (failed due to same issue as in comment #5) 09:48:40 shutdown, but VM possibly not correctly undefined start on 09:49:11 fails with VM machine already exists 09:49:21 again a destroy attempt 10:55:05,220 hmm? - INFO (periodic/1) [vds] Recovered new external domain" 11:49:16 VDSM restart 11:49:18 VM recovered and detected as "Changed state to Down: VM terminated with error (code=1)" 11:52:28 start VM succeeds 11:52:43 VM guest reboot rh04 09:49:22 VM start succeeded 10:33:27 VM guest reboot 10:37:15 VM guest reboot 11:44:57 shutdown, succeeded 13:10:59 VM start 13:40:25 shutdown engine.log starts a day later at 2018-06-19 03:37:01 Please attach a correct log. (Originally by michal.skrivanek)
please also attach an earlier log from rh01 capturing the VM start prior to 2018-06-18 07:01:01, same for engine.log (Originally by michal.skrivanek)
(In reply to Michal Skrivanek from comment #5) > please take a look at rh04 log 2018-06-18 11:57:13,281+1000 > Seems the incoming migration VM disk has no "source" attribute at all in > _srcDomXML, though there is one in regular params with empty file='' @mzamazal: the one on rh01 at 09:45:03 is the first occurrence, perhaps rather look there worth noting there were a lot of snapshot manipulations in previous days (live storage merge) (Originally by michal.skrivanek)
Recovery fails due to missing `file' attribute. The failed recovery means the VM startup domain wrapper is never replaced with a running VM wrapper and most libvirt operations are rejected, while the VM is running. I can't inspect why `file' attribute is missing until Vdsm logs since the VM start are provided. (Originally by Milan Zamazal)
Actually <source> element of the CD-ROM drive is missing. This happens after CD-ROM ejection and is not handled in Vdsm. I'm looking for a fix. (Originally by Milan Zamazal)
Verify with: Engine: Software Version:4.2.4.5-0.1 (rhv-release-4.2.4-7-001.noarch) Hosts: OS Version:RHEL - 7.5 - 8.el7 Kernel Version:3.10.0 - 862.6.3.el7.x86_64 KVM Version:2.10.0 - 21.el7_5.4 LIBVIRT Version:libvirt-3.9.0-14.el7_5.6 VDSM Version:vdsm-4.20.32-1.el7ev Steps: 1. Create 2 HA VMs, attach CD to each VM: VM_1 is with lease and resume behavior "KILL" VM_2 is without lease and resume behavior "KILL" Both VM running with ISCSI disk Did not test with NFS since: https://bugzilla.redhat.com/show_bug.cgi?id=1481022 2. Start VMs on Host_1 and eject CD 3. Block connection to ISCSI storage with iptables on Host_1 4. Both VMs switch to pause 5. VMs started on Host_2 6. Check on Host_1 that no running VM # virsh -r list --all Id Name State ---------------------------------------------------- Results: PASS
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2118
sync2jira
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days