1342388 – VM split brain during networking issues

Bug 1342388 - VM split brain during networking issues

Summary: VM split brain during networking issues

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-3.6.7
Target Release:	3.6.7
Assignee:	Arik
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1337203 1452393 (view as bug list)
Depends On:	1339291
Blocks:	1344075
TreeView+	depends on / blocked

Reported:	2016-06-03 07:22 UTC by rhev-integ
Modified:	2021-06-10 11:20 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously during vdsm restart, the host would still respond to queries over JSON-RPC protocol from the Manager, which could result in the Manager reporting the incorrect virtual machine state. This could cause a highly available virtual machine to restart despite it already running. This has been fixed and the API calls are blocked during the vdsm service startup.
Clone Of:	1339291
Clones:	1344075 (view as bug list)
Environment:
Last Closed:	2016-06-29 16:20:35 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1452393	urgent	CLOSED	RHEV Guests are corrupted regularly	2021-06-10 12:24:42 UTC
Red Hat Knowledge Base (Solution)	2356491	None	None	None	2016-06-08 07:20:44 UTC
Red Hat Product Errata	RHBA-2016:1364	normal	SHIPPED_LIVE	Red Hat Enterprise Virtualization Manager (rhevm) bug fix 3.6.7	2016-06-29 20:18:44 UTC
oVirt gerrit	58409	master	MERGED	core: log vms retrieved on statistics polling	2021-01-03 17:10:06 UTC
oVirt gerrit	58462	master	MERGED	jsonrpc: Fix log level overriding of some methods	2021-01-03 17:10:08 UTC
oVirt gerrit	58463	master	MERGED	rpc: Lower logging priority just for getAllVmStats	2021-01-03 17:10:06 UTC
oVirt gerrit	58464	master	MERGED	rpc: Log calls of API methods with possibly large results	2021-01-03 17:10:06 UTC
oVirt gerrit	58465	master	MERGED	rpc: Log important info from VM stats	2021-01-03 17:10:09 UTC
oVirt gerrit	58518	master	MERGED	ignore incoming requests during recovery with json-rpc	2021-01-03 17:10:45 UTC
oVirt gerrit	58567	master	MERGED	core: refine log for retrieved vms on statistics cycle	2021-01-03 17:10:10 UTC
oVirt gerrit	58737	None	MERGED	ignore incoming requests during recovery with json-rpc	2021-01-03 17:10:10 UTC
oVirt gerrit	58738	None	MERGED	ignore incoming requests during recovery with json-rpc	2021-01-03 17:10:10 UTC
oVirt gerrit	58772	ovirt-engine-3.6	MERGED	core: log vms retrieved on statistics polling	2021-01-03 17:10:10 UTC
oVirt gerrit	58776	ovirt-engine-3.6.7	MERGED	core: log vms retrieved on statistics polling	2021-01-03 17:10:08 UTC

Internal Links: 1452393

Comment 2 Francesco Romani 2016-06-07 12:18:48 UTC

58738 merged -> MODIFIED

Comment 3 Francesco Romani 2016-06-07 12:20:19 UTC

the Vdsm changes do not require doc_string updates.

Comment 4 Michal Skrivanek 2016-06-07 13:14:10 UTC

one more petch needs to get in:) https://gerrit.ovirt.org/#/c/58465/

Comment 5 Arik 2016-06-13 22:02:13 UTC

*** Bug 1337203 has been marked as a duplicate of this bug. ***

Comment 6 Nisim Simsolo 2016-06-27 14:04:22 UTC

Verification builds:
rhevm-3.6.7.5-0.1.el6
libvirt-client-1.2.17-13.el7_2.5.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
vdsm-4.17.31-0.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64

Verification scenarios:

# Add 60 seconds sleep /usr/share/vdsm/clientIf.py (the scenario of reproducing this bug before the fix):
1. Use 2 hosts under the same cluster, on SPM host edit /usr/share/vdsm/clientIf.p and add time.sleep(60) under def _recoverExistingVms(self):
2. enable HA on VM.
3. Run VM.
4. Restart vdsmd service (look for "VM is running in db and not running in VDS 'hostname'" in engine.log).
5. Verify VM is not migrating to the second host.
After VDSM service restarted, verify same qemu-kvm process is running on SPM host and verify no qemu-kvm process for same VM on the second host.
Verify VM continue to run properly.

# Stop VDSM service:
1. Stop VDSM service on the host with running VM.
2. Wait for host to become non-responsive and VM in unknown state.
3. Verify soft fencing started on the host and VM status restored to up.
4. Verify VM continue to run properly.

# Power off host:
1. Power off host with VM running on it.
2. Wait for host to become in non-responsive state and VM in unknown state.
3. From webadmin confirm 'host has been rebooted'.
4. Verify VM is migrating to the active host and VM is restarting.

Comment 8 errata-xmlrpc 2016-06-29 16:20:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1364

Comment 9 Adam Litke 2017-07-31 15:48:59 UTC

*** Bug 1452393 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

agkesos
ahadas
amarchuk
bgraveno
emahoney
fromani
jentrena
lsurette
mgoldboi
michal.skrivanek
mkalinin
mtessun
nsimsolo
pkliczew
pstehlik
rbalakri
Rhev-m-bugs
rhodain
srevivo
stirabos
tdosek
ykaul