+++ This bug was initially created as a clone of Bug #949192 +++ Created attachment 732223 [details] ## Logs vdsm, rhevm, libvirt Description of problem: VDSM service don't restart when libvirt service failed Version-Release number of selected component (if applicable): RHEVM 3.2 - SF11 environment: RHEVM: rhevm-3.2.0-10.14.beta1.el6ev.noarch VDSM: vdsm-4.10.2-11.0.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.2.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64 rely on BZ948216 How reproducible: unknown Steps to Reproduce: 1. Create DC with 3 hosts connected to one SD, and in beginning they in following state Host-A – Maintenance Host-B – Maintenance Host-C – Active (SPM) 2. Second step Host-A – Maintenance Host-B – Active (HSM) Host-C – Active (SPM) 3. Step 3 Host-A – Maintenance Host-B – Active (SPM) Host-C – Maintenance 4. Step 4 Host-A – Active (HSM) Host-B – Active (SPM) Host-C – Maintenance 5. Step 5 Host-A – Active (SPM) Host-B – Maintenance Host-C – Maintenance 6. Step 6, activate Host-B & Host-C Host-A – Active (SPM) Host-B – Active (HSM) Host-C – Unassigned - Failed activate Actual results: Failed Activate host Expected results: Succeed Activate host Additional info: Impact on user: Failed activate host /var/log/ovirt-engine/engine.log /var/log/vdsm/vdsm.log --- Additional comment from Yaniv Bronhaim on 2013-04-09 05:21:45 EDT --- 1. Currently we restart vdsm only if libvirt throws VIR_ERR_SYSTEM_ERROR, This might changed in libvirt implementation. When we try to reproduce it now vdsm receives from libvirt VIR_ERR_INTERNAL_ERROR. I consider using virConnectRegisterCloseCallback or catching also INTERNAL_ERROR. 2. When the connection is broken we perform prepareForShutdown that takes time when large amount of vms are running on host (Bug 924801). We consider better flow to make it quicker. The steps to reproduce can be much easier.. just "kill -s SIGABRT [libvirt pid]" or "rm /var/run/libvirt/libvirt-sock" In both cases vdsm should perform self fencing. --- Additional comment from Yaniv Bronhaim on 2013-04-09 12:24:54 EDT --- libvirt has registration method for callback that signaled when the connection with libvirt is closed (registerCloseCallback - http://www.libvirt.org/html/libvirt-libvirt.html#virConnectRegisterCloseCallback) We should use this callback to distinguish connectivity errors. This callback is not available in libvirt < 1.0.1 libvirt for rhel6.4 doesn't contain it. First it should backported to rhel6.4.z and then merging vdsm patch that uses it. --- Additional comment from Barak on 2013-04-09 12:59:49 EDT --- Dave, Can registerCloseCallback be back-ported to 6.4.z ? The above will also solve the SIGABRT not cought be VDSM. --- Additional comment from Dave Allan on 2013-04-09 13:13:36 EDT --- (In reply to comment #3) > Dave, > > Can registerCloseCallback be back-ported to 6.4.z ? > The above will also solve the SIGABRT not cought be VDSM. No, unfortunately that's a new API call and cannot be backported. --- Additional comment from Simon Grinberg on 2013-04-09 14:15:03 EDT --- (In reply to comment #4) > (In reply to comment #3) > > Dave, > > > > Can registerCloseCallback be back-ported to 6.4.z ? > > The above will also solve the SIGABRT not cought be VDSM. > > No, unfortunately that's a new API call and cannot be backported. Because of compatibility? because the changes are just too bug? or just because we do not do features in a Z release? --- Additional comment from Yaniv Bronhaim on 2013-04-11 04:04:28 EDT --- *** Bug 948216 has been marked as a duplicate of this bug. *** --- Additional comment from Yaniv Bronhaim on 2013-04-18 04:36:42 EDT --- When connectivity failure is raised by libvirt, vdsm starts self fencing. When vdsm restarts, it also restarts libvirt service. On large scale and high load on host, the connection to libvirt takes more than a minute. Unlike described in the bug description, the issue is that the host doesn't response until the connection to libvirt is back. This leads to host fencing (Bug 924801). To avoid that we move the first connectivity to libvirt to be in external thread. this way vdsm will be able to response to API calls and report its status.
Verified on RHEVM - 3.1 - SI28.1 vdsm-4.10.2-1.13.el6ev.x86_64 libvirt-0.10.2-18.el6_4.4.x86_64 15:38:14,214: libvirtError: internal error client socket is closed 15:38:43,316: logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0774.html