Bug 1339291 - VM split brain during networking issues
Summary: VM split brain during networking issues
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.0.0-rc
: 4.0.0
Assignee: Arik
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On:
Blocks: 1337203 1342388 1344075
TreeView+ depends on / blocked
 
Reported: 2016-05-24 15:08 UTC by Michal Skrivanek
Modified: 2020-03-11 15:07 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously during VDSM restart, the host would still respond to queries over JSON-RPC protocol from the Manager, which could result in the Manager reporting the incorrect virtual machine state. This could cause a highly available virtual machine to restart despite it already running. This has been fixed and the API calls are blocked during the VDSM service startup.
Clone Of:
: 1342388 (view as bug list)
Environment:
Last Closed: 2016-08-23 20:40:32 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1337203 0 urgent CLOSED VM is running on more than one Hypervisor after network outage 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 2356491 0 None None None 2016-06-08 07:19:50 UTC
Red Hat Product Errata RHEA-2016:1743 0 normal SHIPPED_LIVE Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine) 2016-09-02 21:54:01 UTC
oVirt gerrit 58409 0 master MERGED core: log vms retrieved on statistics polling 2020-12-21 21:28:30 UTC
oVirt gerrit 58462 0 master MERGED jsonrpc: Fix log level overriding of some methods 2020-12-21 21:28:28 UTC
oVirt gerrit 58463 0 master MERGED rpc: Lower logging priority just for getAllVmStats 2020-12-21 21:29:01 UTC
oVirt gerrit 58464 0 master MERGED rpc: Log calls of API methods with possibly large results 2020-12-21 21:28:30 UTC
oVirt gerrit 58465 0 master MERGED rpc: Log important info from VM stats 2020-12-21 21:29:01 UTC
oVirt gerrit 58518 0 master MERGED ignore incoming requests during recovery with json-rpc 2020-12-21 21:28:30 UTC
oVirt gerrit 58567 0 master MERGED core: refine log for retrieved vms on statistics cycle 2020-12-21 21:29:02 UTC
oVirt gerrit 58638 0 ovirt-engine-4.0 MERGED core: log vms retrieved on statistics polling 2020-12-21 21:28:29 UTC
oVirt gerrit 58686 0 master DRAFT rpc: Log important info from VM stats 2016-06-06 15:02:56 UTC
oVirt gerrit 58707 0 ovirt-4.0 MERGED jsonrpc: Fix log level overriding of some methods 2020-12-21 21:28:31 UTC
oVirt gerrit 58708 0 ovirt-4.0 MERGED rpc: Lower logging priority just for getAllVmStats 2020-12-21 21:28:31 UTC
oVirt gerrit 58709 0 ovirt-4.0 MERGED rpc: Log calls of API methods with possibly large results 2020-12-21 21:28:31 UTC
oVirt gerrit 58737 0 ovirt-4.0 MERGED ignore incoming requests during recovery with json-rpc 2020-12-21 21:28:29 UTC
oVirt gerrit 58738 0 ovirt-3.6 MERGED ignore incoming requests during recovery with json-rpc 2020-12-21 21:28:32 UTC
oVirt gerrit 58770 0 ovirt-4.0 MERGED Revert "rpc: Log calls of API methods with possibly large results" 2020-12-21 21:28:32 UTC
oVirt gerrit 58771 0 ovirt-4.0 MERGED Revert "rpc: Lower logging priority just for getAllVmStats" 2020-12-21 21:28:32 UTC
oVirt gerrit 58796 0 ovirt-3.6 ABANDONED json: use forzenset for filtered methods 2020-12-21 21:28:30 UTC
oVirt gerrit 58797 0 ovirt-3.6 ABANDONED jsonrpc: Fix log level overriding of some methods 2020-12-21 21:28:32 UTC
oVirt gerrit 58804 0 ovirt-4.0 ABANDONED rpc: Log calls of API methods with possibly large results 2020-12-21 21:28:33 UTC
oVirt gerrit 58833 0 master MERGED xmlrpc: Hide fenceNode password in the log 2020-12-21 21:29:04 UTC
oVirt gerrit 58892 0 ovirt-3.5 MERGED ignore incoming requests during recovery with json-rpc 2020-12-21 21:28:31 UTC
oVirt gerrit 59230 0 ovirt-4.0 MERGED jsonrpc: events: do not send events on recovery 2020-12-21 21:29:05 UTC
oVirt gerrit 59235 0 ovirt-3.6 MERGED jsonrpc: events: do not send events on recovery 2020-12-21 21:28:31 UTC

Internal Links: 1337203 1398664

Description Michal Skrivanek 2016-05-24 15:08:55 UTC
Apparently during problematic network conditions a split brain can happen for HA VMs

Comment 3 Michal Skrivanek 2016-05-25 16:02:58 UTC
unfortunately we do not have enough information how this can happen
We can only add more logging to both vdsm and engine to be able to get closer next time it happens

suggesting to add INFO logging of _all_ verbs, including "list" on both sides. As an optimization we can log it only when there is a change since the last one, that way it should be still reasonable and not flooding too much
In addition we should log events which are discarded due to concurrent periodic poll at INFO level. IT shouldn't be too often.

Comment 4 Arik 2016-05-25 16:09:46 UTC
(In reply to Michal Skrivanek from comment #3)
> In addition we should log events which are discarded due to concurrent
> periodic poll at INFO level. IT shouldn't be too often.

If we can have some periodic timer in VDSM that sends the same data that is returned by the polling via an event, it would be better IMO. All we need then is to store the timestamp of the last event and in the host monitoring to act when no event arrives after some threshold. Then we won't need to discard events (unless those that arrive in reverse order)

Comment 11 Michal Skrivanek 2016-06-01 07:52:06 UTC
we will add more logging to both engine and vdsm to be able to correlate what is going on in these situations. 
Unfortunately nothing concrete about the mentioned issue, we will need to reproduce it afterwards

Comment 12 Arik 2016-06-01 20:23:28 UTC
Seems like I figured out the problem.
In short, the flag _recovery that blocks API calls with xml-rpm and not used by json-rpc to block API calls.

More detailed information:
I hacked VDSM with a sleep right before the recovery for 1 minute.
Then I start one VM on a host and restarted VDSM.
The log in the engine looks like that:
2016-06-01 23:17:54,480 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)
2016-06-01 23:18:10,373 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)
2016-06-01 23:18:25,495 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)
2016-06-01 23:18:40,605 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(1 VMs)
2016-06-01 23:18:55,715 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs)
2016-06-01 23:19:10,780 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs)
2016-06-01 23:19:25,898 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs)
2016-06-01 23:19:40,972 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)

The calls before 23:18:40,605 were done before the restart.
The calls between 23:18:40,605 to 23:19:25,898 were done after the restart during the recovery (they should be blocked!)
The call at 23:19:40,972 was done after the recovery.

So the engine can get VMs in WaitForLaunch if their 'wrapper' was initialized but the query from libvirt didn't finish. And the engine might no get VMs that their 'wrappers' were not initialized yet, if I understand VDSM code.

Comment 13 Arik 2016-06-01 20:28:01 UTC
(In reply to Arik from comment #12)
> More detailed information:
> I hacked VDSM with a sleep right before the recovery for 1 minute.

'before the recovery' -> before the actual queries from libvirt and the local recovery files, but after the flag has been set and the log 'recovery: started'

Comment 23 Michal Skrivanek 2016-06-08 10:36:45 UTC
we do want more logging enhancements ASAP, but it's not holding back the actual fix anymore

Comment 24 Francesco Romani 2016-06-08 11:02:32 UTC
sorry automation but, this is MODIFIED.

Comment 25 Francesco Romani 2016-06-10 16:22:29 UTC
This is not ON_QA, the change is under the final stages of review, not yet committed on the branch.

Comment 26 Francesco Romani 2016-06-10 16:23:13 UTC
SORRY! wrong bug! this is indeed ON_QA

Comment 30 Nisim Simsolo 2016-06-23 14:58:08 UTC
Verified: 
rhevm-4.0.0.6-0.1.el7ev
vdsm-4.18.4-2.el7ev.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
libvirt-daemon-1.2.17-13.el7_2.5.x86_64
sanlock-3.2.4-2.el7_2.x86_64

Verification scenarios:

# Add 60 seconds sleep /usr/share/vdsm/clientIf.py (the scenario of reproducing this bug before the fix):
1. Use 2 hosts under the same cluster, on SPM host edit /usr/share/vdsm/clientIf.p and add time.sleep(60) under def _recoverExistingVms(self):
2. enable HA on VM.
3. Run VM.
4. Restart vdsms service.
5. Verify VM is not migrating to the second host.
After VDSM service restarted, verify same qemu-kvm process is running on SPM host and verify no qemu-kvm process for same VM on the second host.
Verify VM continue to run properly.

# Stop VDSM service:
1. Stop VDSM service on the host with running VM.
2. Wait for host to become non-responsive and VM in unknown state.
3. Verify soft fencing started on the host and VM status restored to up.
4. Verify VM continue to run properly.

Comment 32 errata-xmlrpc 2016-08-23 20:40:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html


Note You need to log in before you can comment on or make changes to this bug.