Bug 1593568
Summary: | Unexpected behaviour of HA VM when host VM was running ended up Non-responsive. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Ribu Tho <rabraham> | ||||
Component: | vdsm | Assignee: | Milan Zamazal <mzamazal> | ||||
Status: | CLOSED ERRATA | QA Contact: | Polina <pagranat> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.2.3 | CC: | dfediuck, lsurette, mavital, mgoldboi, michal.skrivanek, mzamazal, rabraham, Rhev-m-bugs, sgoodman, srevivo, ycui | ||||
Target Milestone: | ovirt-4.3.0 | Keywords: | ZStream | ||||
Target Release: | 4.3.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | v4.30.3 | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, if a CD-ROM was ejected from a virtual machine and VDSM was fenced or restarted, the virtual machine became unresponsive and/or the Manager reported its status as "Unknown." In the current release, a virtual machine with an ejected CD-ROM recovers after restarting VDSM.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1594793 (view as bug list) | Environment: | |||||
Last Closed: | 2019-05-08 12:36:02 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1594793 | ||||||
Attachments: |
|
Description
Ribu Tho
2018-06-21 06:14:49 UTC
please take a look at rh04 log 2018-06-18 11:57:13,281+1000 Seems the incoming migration VM disk has no "source" attribute at all in _srcDomXML, though there is one in regular params with empty file='' rh01: 09:45:03 VDSM restart apparently not recovered correctly (failed due to same issue as in comment #5) 09:48:40 shutdown, but VM possibly not correctly undefined start on 09:49:11 fails with VM machine already exists 09:49:21 again a destroy attempt 10:55:05,220 hmm? - INFO (periodic/1) [vds] Recovered new external domain" 11:49:16 VDSM restart 11:49:18 VM recovered and detected as "Changed state to Down: VM terminated with error (code=1)" 11:52:28 start VM succeeds 11:52:43 VM guest reboot rh04 09:49:22 VM start succeeded 10:33:27 VM guest reboot 10:37:15 VM guest reboot 11:44:57 shutdown, succeeded 13:10:59 VM start 13:40:25 shutdown engine.log starts a day later at 2018-06-19 03:37:01 Please attach a correct log. please also attach an earlier log from rh01 capturing the VM start prior to 2018-06-18 07:01:01, same for engine.log (In reply to Michal Skrivanek from comment #5) > please take a look at rh04 log 2018-06-18 11:57:13,281+1000 > Seems the incoming migration VM disk has no "source" attribute at all in > _srcDomXML, though there is one in regular params with empty file='' @mzamazal: the one on rh01 at 09:45:03 is the first occurrence, perhaps rather look there worth noting there were a lot of snapshot manipulations in previous days (live storage merge) Recovery fails due to missing `file' attribute. The failed recovery means the VM startup domain wrapper is never replaced with a running VM wrapper and most libvirt operations are rejected, while the VM is running. I can't inspect why `file' attribute is missing until Vdsm logs since the VM start are provided. Actually <source> element of the CD-ROM drive is missing. This happens after CD-ROM ejection and is not handled in Vdsm. I'm looking for a fix. (In reply to Ribu Tho from comment #11) > Please can you specify the root cause to this issue The problem occurs when a CD previously inserted in a VM (either on VM startup or during VM run) is ejected and Vdsm is fenced/restarted. The running VM can't be fully recovered in such a case, it can't be handled from Engine anymore and that leads to Engine confusion, up to possible split brain. The only workaround I know about to handle the situation once it occurs is to shut down the VM from the guest OS. Hi, I've tried to reproduce the scenario on the downstream version rhvm-4.2.6-0.1.el7ev.noarch according the above description before getting the fixed version. Could you please look at the scenario and advice how to reproduce the situation to verify the bug? My steps: 1. Run the HA VM with lease (CD attached) (associated with iscsi SD). Please see the logs . the scenario starts at the line 2425 2018-08-09 16:17:57,971+03 in the engine.log: 2. Cause the host to be Non Responsive (by blocking access by iptables from Engine to the Host). Actual result: the VM is not migrated to other host. stayed in Unknown status. Created attachment 1474695 [details]
logs: engine, vdsm, libvirt, qemu
Hi Polina, the actual result would be correct and no split brain occurred. However for proper verification Vdsm must be restarted during the host problems, e.g. by Engine soft-fencing, which didn't happen in your scenario. Perhaps Ribu Tho can clarify how the originally reported scenario was invoked exactly. Hi Ribu Tho, could you please clarify about the scenario? What caused the host problem, so that I could reproduce this. thank you (In reply to Polina from comment #16) > Hi, > > I've tried to reproduce the scenario on the downstream version > rhvm-4.2.6-0.1.el7ev.noarch according the above description before getting > the fixed version. as you can see in bug 1594793 it was verified by Israel in 4.2.4.5-0.1. Why do you expect it is not fixed in 4.2.6? > Could you please look at the scenario and advice how to > reproduce the situation to verify the bug? > My steps: > 1. Run the HA VM with lease (CD attached) (associated with iscsi SD). Please > see the logs . the scenario starts at the line 2425 2018-08-09 > 16:17:57,971+03 in the engine.log: > 2. Cause the host to be Non Responsive (by blocking access by iptables from > Engine to the Host). why do you induce unresponsive host? Please read the root cause steps in comment #12 > Actual result: the VM is not migrated to other host. stayed in Unknown > status. That's not related to this bug at all verified on upstream version: ovirt-release-master-4.3.0-0.1.master.20180815000055.gitdd598f0.el7.noarch vdsm-4.30.0-527.gitcec1054.el7.x86_64 steps for verification are from https://bugzilla.redhat.com/show_bug.cgi?id=1594793#c16 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1077 sync2jira sync2jira |