Bug 1344075

Summary: [3.5] VM split brain during networking issues
Product: Red Hat Enterprise Virtualization Manager Reporter: Michal Skrivanek <michal.skrivanek>
Component: vdsmAssignee: Francesco Romani <fromani>
Status: CLOSED ERRATA QA Contact: Nisim Simsolo <nsimsolo>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.7CC: adahms, agk, ahadas, bazulay, fromani, jentrena, lsurette, mgoldboi, michal.skrivanek, mkalinin, mtessun, nsimsolo, pkliczew, pstehlik, rbalakri, rhev-integ, Rhev-m-bugs, rhodain, srevivo, stirabos, tdosek, ycui, ykaul
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Previously, when the VDSM service was restarted on a host, the host would still respond to queries from the Manager over JSON-RPC protocol, which could lead to incorrectly reported status of virtual machines in the engine database. In the case of highly available virtual machines, this would cause the virtual machine to be restarted under certain circumstances even though the virtual machine was running. This issue has now been resolved, and API calls are correctly blocked while the VDSM service is starting.
Story Points: ---
Clone Of: 1342388 Environment:
Last Closed: 2016-06-27 12:42:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1339291, 1342388    
Bug Blocks:    

Description Michal Skrivanek 2016-06-08 16:54:18 UTC
+++ This bug was initially created as a clone of Bug #1342388 +++

+++ This bug is a RHEV-M zstream clone. The original bug is: +++
+++   https://bugzilla.redhat.com/show_bug.cgi?id=1339291. +++
+++ Requested by "mskrivan" +++

see parent bug for details.

Comment 2 Francesco Romani 2016-06-13 13:31:05 UTC
https://gerrit.ovirt.org/#/c/58892/ merged -> MODIFIED

Comment 4 Nisim Simsolo 2016-06-22 15:09:07 UTC
Verification build:
rhevm-3.5.8-0.1.el6ev.noarch
qemu-kvm-rhev-0.12.1.2-2.491.el6_8.1.x86_64
libvirt-0.10.2-60.el6.x86_64
vdsm-4.16.37-1.el6ev.x86_64
sanlock-2.8-2.el6_5.x86_64

Verification scenarios:

# Add 60 seconds sleep /usr/share/vdsm/clientIf.py (the scenario of reproducing this bug before the fix):
1. Use 2 hosts under the same cluster, on SPM host edit /usr/share/vdsm/clientIf.p and add time.sleep(60) under def _recoverExistingVms(self):
2. enable HA on VM.
3. Run VM.
4. Restart vdsms service.
5. Verify VM is not migrating to the second host.
After VDSM service restarted, verify same qemu-kvm process is running on SPM host and verify no qemu-kvm process for same VM on the second host.
Verify VM continue to run properly.

# Stop VDSM service:
1. Stop VDSM service on the host with running VM.
2. Wait for host to become non-responsive and VM in unknown state.
3. Verify soft fencing started on the host and VM status restored to up.
4. Verify VM continue to run properly.

# Power off host:
1. Power off host with VM running on it.
2. Wait for host to become in non-responsive state and VM in unknown state.
3. From webadmin confirm 'host has been rebooted'.
4. Verify VM is migrating to the active host and VM is restarting.

Comment 6 errata-xmlrpc 2016-06-27 12:42:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1342

Comment 7 Marina Kalinin 2016-12-05 21:34:20 UTC
Fixed in vdsm-4.16.37-1.el6ev.x86_64, prior to 3.5.9.
Engine bug for 3.5.9:
https://bugzilla.redhat.com/show_bug.cgi?id=1352612

Comment 8 Marina Kalinin 2016-12-05 21:35:15 UTC
Sorry, the 3.5.9 bug is still vdsm-hostdeploy.