1339291 – VM split brain during networking issues

Bug 1339291 - VM split brain during networking issues

Summary: VM split brain during networking issues

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.0.0-rc
Target Release:	4.0.0
Assignee:	Arik
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1337203 1342388 1344075
TreeView+	depends on / blocked

Reported:	2016-05-24 15:08 UTC by Michal Skrivanek
Modified:	2020-03-11 15:07 UTC (History)
CC List:	23 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously during VDSM restart, the host would still respond to queries over JSON-RPC protocol from the Manager, which could result in the Manager reporting the incorrect virtual machine state. This could cause a highly available virtual machine to restart despite it already running. This has been fixed and the API calls are blocked during the VDSM service startup.
Clone Of:
Clones:	1342388 (view as bug list)
Environment:
Last Closed:	2016-08-23 20:40:32 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1337203	urgent	CLOSED	VM is running on more than one Hypervisor after network outage	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	2356491	None	None	None	2016-06-08 07:19:50 UTC
Red Hat Product Errata	RHEA-2016:1743	normal	SHIPPED_LIVE	Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine)	2016-09-02 21:54:01 UTC
oVirt gerrit	58409	master	MERGED	core: log vms retrieved on statistics polling	2020-12-21 21:28:30 UTC
oVirt gerrit	58462	master	MERGED	jsonrpc: Fix log level overriding of some methods	2020-12-21 21:28:28 UTC
oVirt gerrit	58463	master	MERGED	rpc: Lower logging priority just for getAllVmStats	2020-12-21 21:29:01 UTC
oVirt gerrit	58464	master	MERGED	rpc: Log calls of API methods with possibly large results	2020-12-21 21:28:30 UTC
oVirt gerrit	58465	master	MERGED	rpc: Log important info from VM stats	2020-12-21 21:29:01 UTC
oVirt gerrit	58518	master	MERGED	ignore incoming requests during recovery with json-rpc	2020-12-21 21:28:30 UTC
oVirt gerrit	58567	master	MERGED	core: refine log for retrieved vms on statistics cycle	2020-12-21 21:29:02 UTC
oVirt gerrit	58638	ovirt-engine-4.0	MERGED	core: log vms retrieved on statistics polling	2020-12-21 21:28:29 UTC
oVirt gerrit	58686	master	DRAFT	rpc: Log important info from VM stats	2016-06-06 15:02:56 UTC
oVirt gerrit	58707	ovirt-4.0	MERGED	jsonrpc: Fix log level overriding of some methods	2020-12-21 21:28:31 UTC
oVirt gerrit	58708	ovirt-4.0	MERGED	rpc: Lower logging priority just for getAllVmStats	2020-12-21 21:28:31 UTC
oVirt gerrit	58709	ovirt-4.0	MERGED	rpc: Log calls of API methods with possibly large results	2020-12-21 21:28:31 UTC
oVirt gerrit	58737	ovirt-4.0	MERGED	ignore incoming requests during recovery with json-rpc	2020-12-21 21:28:29 UTC
oVirt gerrit	58738	ovirt-3.6	MERGED	ignore incoming requests during recovery with json-rpc	2020-12-21 21:28:32 UTC
oVirt gerrit	58770	ovirt-4.0	MERGED	Revert "rpc: Log calls of API methods with possibly large results"	2020-12-21 21:28:32 UTC
oVirt gerrit	58771	ovirt-4.0	MERGED	Revert "rpc: Lower logging priority just for getAllVmStats"	2020-12-21 21:28:32 UTC
oVirt gerrit	58796	ovirt-3.6	ABANDONED	json: use forzenset for filtered methods	2020-12-21 21:28:30 UTC
oVirt gerrit	58797	ovirt-3.6	ABANDONED	jsonrpc: Fix log level overriding of some methods	2020-12-21 21:28:32 UTC
oVirt gerrit	58804	ovirt-4.0	ABANDONED	rpc: Log calls of API methods with possibly large results	2020-12-21 21:28:33 UTC
oVirt gerrit	58833	master	MERGED	xmlrpc: Hide fenceNode password in the log	2020-12-21 21:29:04 UTC
oVirt gerrit	58892	ovirt-3.5	MERGED	ignore incoming requests during recovery with json-rpc	2020-12-21 21:28:31 UTC
oVirt gerrit	59230	ovirt-4.0	MERGED	jsonrpc: events: do not send events on recovery	2020-12-21 21:29:05 UTC
oVirt gerrit	59235	ovirt-3.6	MERGED	jsonrpc: events: do not send events on recovery	2020-12-21 21:28:31 UTC

Internal Links: 1337203 1398664

Description Michal Skrivanek 2016-05-24 15:08:55 UTC

Apparently during problematic network conditions a split brain can happen for HA VMs

Comment 3 Michal Skrivanek 2016-05-25 16:02:58 UTC

unfortunately we do not have enough information how this can happen
We can only add more logging to both vdsm and engine to be able to get closer next time it happens

suggesting to add INFO logging of _all_ verbs, including "list" on both sides. As an optimization we can log it only when there is a change since the last one, that way it should be still reasonable and not flooding too much
In addition we should log events which are discarded due to concurrent periodic poll at INFO level. IT shouldn't be too often.

Comment 4 Arik 2016-05-25 16:09:46 UTC

(In reply to Michal Skrivanek from comment #3)
> In addition we should log events which are discarded due to concurrent
> periodic poll at INFO level. IT shouldn't be too often.

If we can have some periodic timer in VDSM that sends the same data that is returned by the polling via an event, it would be better IMO. All we need then is to store the timestamp of the last event and in the host monitoring to act when no event arrives after some threshold. Then we won't need to discard events (unless those that arrive in reverse order)

Comment 11 Michal Skrivanek 2016-06-01 07:52:06 UTC

we will add more logging to both engine and vdsm to be able to correlate what is going on in these situations. 
Unfortunately nothing concrete about the mentioned issue, we will need to reproduce it afterwards

Comment 12 Arik 2016-06-01 20:23:28 UTC

Seems like I figured out the problem.
In short, the flag _recovery that blocks API calls with xml-rpm and not used by json-rpc to block API calls.

More detailed information:
I hacked VDSM with a sleep right before the recovery for 1 minute.
Then I start one VM on a host and restarted VDSM.
The log in the engine looks like that:
2016-06-01 23:17:54,480 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)
2016-06-01 23:18:10,373 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)
2016-06-01 23:18:25,495 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)
2016-06-01 23:18:40,605 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(1 VMs)
2016-06-01 23:18:55,715 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs)
2016-06-01 23:19:10,780 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs)
2016-06-01 23:19:25,898 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs)
2016-06-01 23:19:40,972 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)

The calls before 23:18:40,605 were done before the restart.
The calls between 23:18:40,605 to 23:19:25,898 were done after the restart during the recovery (they should be blocked!)
The call at 23:19:40,972 was done after the recovery.

So the engine can get VMs in WaitForLaunch if their 'wrapper' was initialized but the query from libvirt didn't finish. And the engine might no get VMs that their 'wrappers' were not initialized yet, if I understand VDSM code.

Comment 13 Arik 2016-06-01 20:28:01 UTC

(In reply to Arik from comment #12)
> More detailed information:
> I hacked VDSM with a sleep right before the recovery for 1 minute.

'before the recovery' -> before the actual queries from libvirt and the local recovery files, but after the flag has been set and the log 'recovery: started'

Comment 23 Michal Skrivanek 2016-06-08 10:36:45 UTC

we do want more logging enhancements ASAP, but it's not holding back the actual fix anymore

Comment 24 Francesco Romani 2016-06-08 11:02:32 UTC

sorry automation but, this is MODIFIED.

Comment 25 Francesco Romani 2016-06-10 16:22:29 UTC

This is not ON_QA, the change is under the final stages of review, not yet committed on the branch.

Comment 26 Francesco Romani 2016-06-10 16:23:13 UTC

SORRY! wrong bug! this is indeed ON_QA

Comment 30 Nisim Simsolo 2016-06-23 14:58:08 UTC

Verified: 
rhevm-4.0.0.6-0.1.el7ev
vdsm-4.18.4-2.el7ev.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
libvirt-daemon-1.2.17-13.el7_2.5.x86_64
sanlock-3.2.4-2.el7_2.x86_64

Verification scenarios:

# Add 60 seconds sleep /usr/share/vdsm/clientIf.py (the scenario of reproducing this bug before the fix):
1. Use 2 hosts under the same cluster, on SPM host edit /usr/share/vdsm/clientIf.p and add time.sleep(60) under def _recoverExistingVms(self):
2. enable HA on VM.
3. Run VM.
4. Restart vdsms service.
5. Verify VM is not migrating to the second host.
After VDSM service restarted, verify same qemu-kvm process is running on SPM host and verify no qemu-kvm process for same VM on the second host.
Verify VM continue to run properly.

# Stop VDSM service:
1. Stop VDSM service on the host with running VM.
2. Wait for host to become non-responsive and VM in unknown state.
3. Verify soft fencing started on the host and VM status restored to up.
4. Verify VM continue to run properly.

Comment 32 errata-xmlrpc 2016-08-23 20:40:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html

Note You need to log in before you can comment on or make changes to this bug.

ahadas
amarchuk
bgraveno
dmoessne
fromani
gklein
jentrena
lsurette
mgoldboi
michal.skrivanek
mjankula
mkalinin
mtessun
mzamazal
nsimsolo
pablo.iranzo
pkliczew
rbalakri
Rhev-m-bugs
rhodain
sfroemer
srevivo
ykaul