Description of problem:
An issue with HA VM's for which the host goes non-responsive and the HA VM is forced start's up on another host at the same time. The host resumes back to Up state correcting out after which the VM ends up on two different hosts at the same time leading to filesystem corruption.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Start VM on host H1
2. Host H1 goes non-responsive
3. VM starts on host H2
4. Host H1 resumes to Up State
5. VM ends up existing on two different hosts.
6. VM goes to Paused/Up mode from time to time by Engine server.
The VM ended up with filesystem corruption on the boot drive.
Avoid running from 2 different hosts at the same time.
The issue was faced with VM's in HA mode.
please take a look at rh04 log 2018-06-18 11:57:13,281+1000
Seems the incoming migration VM disk has no "source" attribute at all in _srcDomXML, though there is one in regular params with empty file=''
09:45:03 VDSM restart
apparently not recovered correctly (failed due to same issue as in comment #5)
09:48:40 shutdown, but VM possibly not correctly undefined
start on 09:49:11 fails with VM machine already exists
09:49:21 again a destroy attempt
10:55:05,220 hmm? - INFO (periodic/1) [vds] Recovered new external domain"
11:49:16 VDSM restart
11:49:18 VM recovered and detected as "Changed state to Down: VM terminated with error (code=1)"
11:52:28 start VM succeeds
11:52:43 VM guest reboot
09:49:22 VM start succeeded
10:33:27 VM guest reboot
10:37:15 VM guest reboot
11:44:57 shutdown, succeeded
13:10:59 VM start
starts a day later at 2018-06-19 03:37:01
Please attach a correct log.
please also attach an earlier log from rh01 capturing the VM start prior to 2018-06-18 07:01:01, same for engine.log
(In reply to Michal Skrivanek from comment #5)
> please take a look at rh04 log 2018-06-18 11:57:13,281+1000
> Seems the incoming migration VM disk has no "source" attribute at all in
> _srcDomXML, though there is one in regular params with empty file=''
@mzamazal: the one on rh01 at 09:45:03 is the first occurrence, perhaps rather look there
worth noting there were a lot of snapshot manipulations in previous days (live storage merge)
Recovery fails due to missing `file' attribute. The failed recovery means the VM startup domain wrapper is never replaced with a running VM wrapper and most libvirt operations are rejected, while the VM is running.
I can't inspect why `file' attribute is missing until Vdsm logs since the VM start are provided.
Actually <source> element of the CD-ROM drive is missing. This happens after CD-ROM ejection and is not handled in Vdsm. I'm looking for a fix.
(In reply to Ribu Tho from comment #11)
> Please can you specify the root cause to this issue
The problem occurs when a CD previously inserted in a VM (either on VM startup or during VM run) is ejected and Vdsm is fenced/restarted. The running VM can't be fully recovered in such a case, it can't be handled from Engine anymore and that leads to Engine confusion, up to possible split brain. The only workaround I know about to handle the situation once it occurs is to shut down the VM from the guest OS.
I've tried to reproduce the scenario on the downstream version rhvm-4.2.6-0.1.el7ev.noarch according the above description before getting the fixed version. Could you please look at the scenario and advice how to reproduce the situation to verify the bug?
1. Run the HA VM with lease (CD attached) (associated with iscsi SD). Please see the logs . the scenario starts at the line 2425 2018-08-09 16:17:57,971+03 in the engine.log:
2. Cause the host to be Non Responsive (by blocking access by iptables from Engine to the Host).
Actual result: the VM is not migrated to other host. stayed in Unknown status.
Created attachment 1474695 [details]
logs: engine, vdsm, libvirt, qemu
Hi Polina, the actual result would be correct and no split brain occurred. However for proper verification Vdsm must be restarted during the host problems, e.g. by Engine soft-fencing, which didn't happen in your scenario. Perhaps Ribu Tho can clarify how the originally reported scenario was invoked exactly.
Hi Ribu Tho, could you please clarify about the scenario? What caused the host problem, so that I could reproduce this. thank you
(In reply to Polina from comment #16)
> I've tried to reproduce the scenario on the downstream version
> rhvm-4.2.6-0.1.el7ev.noarch according the above description before getting
> the fixed version.
as you can see in bug 1594793 it was verified by Israel in 188.8.131.52-0.1.
Why do you expect it is not fixed in 4.2.6?
> Could you please look at the scenario and advice how to
> reproduce the situation to verify the bug?
> My steps:
> 1. Run the HA VM with lease (CD attached) (associated with iscsi SD). Please
> see the logs . the scenario starts at the line 2425 2018-08-09
> 16:17:57,971+03 in the engine.log:
> 2. Cause the host to be Non Responsive (by blocking access by iptables from
> Engine to the Host).
why do you induce unresponsive host? Please read the root cause steps in comment #12
> Actual result: the VM is not migrated to other host. stayed in Unknown
That's not related to this bug at all
verified on upstream version:
steps for verification are from
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.