Hide Forgot
Apparently during problematic network conditions a split brain can happen for HA VMs
unfortunately we do not have enough information how this can happen We can only add more logging to both vdsm and engine to be able to get closer next time it happens suggesting to add INFO logging of _all_ verbs, including "list" on both sides. As an optimization we can log it only when there is a change since the last one, that way it should be still reasonable and not flooding too much In addition we should log events which are discarded due to concurrent periodic poll at INFO level. IT shouldn't be too often.
(In reply to Michal Skrivanek from comment #3) > In addition we should log events which are discarded due to concurrent > periodic poll at INFO level. IT shouldn't be too often. If we can have some periodic timer in VDSM that sends the same data that is returned by the polling via an event, it would be better IMO. All we need then is to store the timestamp of the last event and in the host monitoring to act when no event arrives after some threshold. Then we won't need to discard events (unless those that arrive in reverse order)
we will add more logging to both engine and vdsm to be able to correlate what is going on in these situations. Unfortunately nothing concrete about the mentioned issue, we will need to reproduce it afterwards
Seems like I figured out the problem. In short, the flag _recovery that blocks API calls with xml-rpm and not used by json-rpc to block API calls. More detailed information: I hacked VDSM with a sleep right before the recovery for 1 minute. Then I start one VM on a host and restarted VDSM. The log in the engine looks like that: 2016-06-01 23:17:54,480 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs) 2016-06-01 23:18:10,373 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs) 2016-06-01 23:18:25,495 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs) 2016-06-01 23:18:40,605 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(1 VMs) 2016-06-01 23:18:55,715 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs) 2016-06-01 23:19:10,780 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs) 2016-06-01 23:19:25,898 Poll 962d01e9-c069-4daf-adfd-831e762147f8:(0 VMs) 2016-06-01 23:19:40,972 Poll 962d01e9-c069-4daf-adfd-831e762147f8: 6b307527:Up(1 VMs) The calls before 23:18:40,605 were done before the restart. The calls between 23:18:40,605 to 23:19:25,898 were done after the restart during the recovery (they should be blocked!) The call at 23:19:40,972 was done after the recovery. So the engine can get VMs in WaitForLaunch if their 'wrapper' was initialized but the query from libvirt didn't finish. And the engine might no get VMs that their 'wrappers' were not initialized yet, if I understand VDSM code.
(In reply to Arik from comment #12) > More detailed information: > I hacked VDSM with a sleep right before the recovery for 1 minute. 'before the recovery' -> before the actual queries from libvirt and the local recovery files, but after the flag has been set and the log 'recovery: started'
we do want more logging enhancements ASAP, but it's not holding back the actual fix anymore
sorry automation but, this is MODIFIED.
This is not ON_QA, the change is under the final stages of review, not yet committed on the branch.
SORRY! wrong bug! this is indeed ON_QA
Verified: rhevm-4.0.0.6-0.1.el7ev vdsm-4.18.4-2.el7ev.x86_64 qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64 libvirt-daemon-1.2.17-13.el7_2.5.x86_64 sanlock-3.2.4-2.el7_2.x86_64 Verification scenarios: # Add 60 seconds sleep /usr/share/vdsm/clientIf.py (the scenario of reproducing this bug before the fix): 1. Use 2 hosts under the same cluster, on SPM host edit /usr/share/vdsm/clientIf.p and add time.sleep(60) under def _recoverExistingVms(self): 2. enable HA on VM. 3. Run VM. 4. Restart vdsms service. 5. Verify VM is not migrating to the second host. After VDSM service restarted, verify same qemu-kvm process is running on SPM host and verify no qemu-kvm process for same VM on the second host. Verify VM continue to run properly. # Stop VDSM service: 1. Stop VDSM service on the host with running VM. 2. Wait for host to become non-responsive and VM in unknown state. 3. Verify soft fencing started on the host and VM status restored to up. 4. Verify VM continue to run properly.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1743.html