Bug 1339291

Summary: VM split brain during networking issues
Product: Red Hat Enterprise Virtualization Manager Reporter: Michal Skrivanek <michal.skrivanek>
Component: ovirt-engineAssignee: Arik <ahadas>
Status: CLOSED ERRATA QA Contact: Nisim Simsolo <nsimsolo>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.0CC: ahadas, amarchuk, bgraveno, dmoessne, fromani, gklein, jentrena, lsurette, mgoldboi, michal.skrivanek, mjankula, mkalinin, mtessun, mzamazal, nsimsolo, pablo.iranzo, pkliczew, rbalakri, Rhev-m-bugs, rhodain, sfroemer, srevivo, ykaul
Target Milestone: ovirt-4.0.0-rcKeywords: ZStream
Target Release: 4.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously during VDSM restart, the host would still respond to queries over JSON-RPC protocol from the Manager, which could result in the Manager reporting the incorrect virtual machine state. This could cause a highly available virtual machine to restart despite it already running. This has been fixed and the API calls are blocked during the VDSM service startup.
Story Points: ---
Clone Of:
: 1342388 (view as bug list) Environment:
Last Closed: 2016-08-23 20:40:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1337203, 1342388, 1344075    

Description Michal Skrivanek 2016-05-24 15:08:55 UTC
Apparently during problematic network conditions a split brain can happen for HA VMs

Comment 3 Michal Skrivanek 2016-05-25 16:02:58 UTC
unfortunately we do not have enough information how this can happen
We can only add more logging to both vdsm and engine to be able to get closer next time it happens

suggesting to add INFO logging of _all_ verbs, including "list" on both sides. As an optimization we can log it only when there is a change since the last one, that way it should be still reasonable and not flooding too much
In addition we should log events which are discarded due to concurrent periodic poll at INFO level. IT shouldn't be too often.

Comment 4 Arik 2016-05-25 16:09:46 UTC
(In reply to Michal Skrivanek from comment #3)
> In addition we should log events which are discarded due to concurrent
> periodic poll at INFO level. IT shouldn't be too often.

If we can have some periodic timer in VDSM that sends the same data that is returned by the polling via an event, it would be better IMO. All we need then is to store the timestamp of the last event and in the host monitoring to act when no event arrives after some threshold. Then we won't need to discard events (unless those that arrive in reverse order)

Comment 11 Michal Skrivanek 2016-06-01 07:52:06 UTC
we will add more logging to both engine and vdsm to be able to correlate what is going on in these situations. 
Unfortunately nothing concrete about the mentioned issue, we will need to reproduce it afterwards

Comment 12 Arik 2016-06-01 20:23:28 UTC
Seems like I figured out the problem.
In short, the flag _recovery that blocks API calls with xml-rpm and not used by json-rpc to block API calls.

More detailed information:
I hacked VDSM with a sleep right before the recovery for 1 minute.
Then I start one VM on a host and restarted VDSM.
The log in the engine looks like that:
2016-06-01 23:17:54,480 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)
2016-06-01 23:18:10,373 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)
2016-06-01 23:18:25,495 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)
2016-06-01 23:18:40,605 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(1 VMs)
2016-06-01 23:18:55,715 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs)
2016-06-01 23:19:10,780 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs)
2016-06-01 23:19:25,898 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs)
2016-06-01 23:19:40,972 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs)

The calls before 23:18:40,605 were done before the restart.
The calls between 23:18:40,605 to 23:19:25,898 were done after the restart during the recovery (they should be blocked!)
The call at 23:19:40,972 was done after the recovery.

So the engine can get VMs in WaitForLaunch if their 'wrapper' was initialized but the query from libvirt didn't finish. And the engine might no get VMs that their 'wrappers' were not initialized yet, if I understand VDSM code.

Comment 13 Arik 2016-06-01 20:28:01 UTC
(In reply to Arik from comment #12)
> More detailed information:
> I hacked VDSM with a sleep right before the recovery for 1 minute.

'before the recovery' -> before the actual queries from libvirt and the local recovery files, but after the flag has been set and the log 'recovery: started'

Comment 23 Michal Skrivanek 2016-06-08 10:36:45 UTC
we do want more logging enhancements ASAP, but it's not holding back the actual fix anymore

Comment 24 Francesco Romani 2016-06-08 11:02:32 UTC
sorry automation but, this is MODIFIED.

Comment 25 Francesco Romani 2016-06-10 16:22:29 UTC
This is not ON_QA, the change is under the final stages of review, not yet committed on the branch.

Comment 26 Francesco Romani 2016-06-10 16:23:13 UTC
SORRY! wrong bug! this is indeed ON_QA

Comment 30 Nisim Simsolo 2016-06-23 14:58:08 UTC
Verified: 
rhevm-4.0.0.6-0.1.el7ev
vdsm-4.18.4-2.el7ev.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
libvirt-daemon-1.2.17-13.el7_2.5.x86_64
sanlock-3.2.4-2.el7_2.x86_64

Verification scenarios:

# Add 60 seconds sleep /usr/share/vdsm/clientIf.py (the scenario of reproducing this bug before the fix):
1. Use 2 hosts under the same cluster, on SPM host edit /usr/share/vdsm/clientIf.p and add time.sleep(60) under def _recoverExistingVms(self):
2. enable HA on VM.
3. Run VM.
4. Restart vdsms service.
5. Verify VM is not migrating to the second host.
After VDSM service restarted, verify same qemu-kvm process is running on SPM host and verify no qemu-kvm process for same VM on the second host.
Verify VM continue to run properly.

# Stop VDSM service:
1. Stop VDSM service on the host with running VM.
2. Wait for host to become non-responsive and VM in unknown state.
3. Verify soft fencing started on the host and VM status restored to up.
4. Verify VM continue to run properly.

Comment 32 errata-xmlrpc 2016-08-23 20:40:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html