Bug 953645

Summary: [vdsm] [scale] After libvirt failure vdsm restarts and starts responding to XML-RPC after a big delay
Product: Red Hat Enterprise Virtualization Manager Reporter: Idith Tal-Kohen <italkohe>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED ERRATA QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: abaron, alonbl, bazulay, cpelland, dallan, danken, hateya, iheim, jentrena, lpeer, sgrinber, ykaul, zdover
Target Milestone: ---Keywords: ZStream
Target Release: 3.1.4   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: vdsm-4.10.2-1.11.el6ev Doc Type: Bug Fix
Doc Text:
Previously, when libvirt failed, the host would not respond until the connection between VDSM and libvirt had been re-established. This caused the environment to begin fencing, which would lead to an unusable environment. This was because when libvirt raised a connectivity failure, VDSM would begin fencing. When VDSM restarted, it would also restart the libvirt service. In large scale environments with high host loads, restarting the connection to libvirt took a long time. The long time it took to re-establish a connection to libvirt was the reason fencing started. VDSM now handles connectivity to libvirt in an external thread, so that VDSM is able to respond to API calls and report its status. The condition that resulted in fencing which in turn resulted in an unusable environment no longer occurs.
Story Points: ---
Clone Of: 949192 Environment:
Last Closed: 2013-05-01 18:26:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 948216, 949192    
Bug Blocks:    

Description Idith Tal-Kohen 2013-04-18 19:06:46 UTC
+++ This bug was initially created as a clone of Bug #949192 +++

Created attachment 732223 [details]
## Logs vdsm, rhevm, libvirt

Description of problem:
VDSM service don't restart when libvirt service failed

Version-Release number of selected component (if applicable):
RHEVM 3.2 - SF11 environment:

RHEVM: rhevm-3.2.0-10.14.beta1.el6ev.noarch     
VDSM: vdsm-4.10.2-11.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.2.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

rely on BZ948216

How reproducible:
unknown

Steps to Reproduce:
1. Create DC with 3 hosts connected to one SD, and in beginning they in following state
Host-A – Maintenance 
Host-B – Maintenance 
Host-C – Active (SPM)
2. Second step
Host-A – Maintenance 
Host-B – Active (HSM) 
Host-C – Active (SPM)
3. Step 3
Host-A – Maintenance 
Host-B – Active (SPM) 
Host-C – Maintenance
4. Step 4
Host-A – Active (HSM) 
Host-B – Active (SPM) 
Host-C – Maintenance
5. Step 5
Host-A – Active (SPM) 
Host-B – Maintenance
Host-C – Maintenance
6. Step 6, activate Host-B & Host-C
Host-A – Active (SPM) 
Host-B – Active (HSM) 
Host-C – Unassigned - Failed activate
Actual results:
Failed Activate host

Expected results:
Succeed Activate host
Additional info:

Impact on user:
Failed activate host 
/var/log/ovirt-engine/engine.log

/var/log/vdsm/vdsm.log

--- Additional comment from Yaniv Bronhaim on 2013-04-09 05:21:45 EDT ---

1. Currently we restart vdsm only if libvirt throws VIR_ERR_SYSTEM_ERROR, This might changed in libvirt implementation. When we try to reproduce it now vdsm receives from libvirt VIR_ERR_INTERNAL_ERROR. I consider using virConnectRegisterCloseCallback or catching also INTERNAL_ERROR.

2. When the connection is broken we perform prepareForShutdown that takes time when large amount of vms are running on host (Bug 924801). We consider better flow to make it quicker.

The steps to reproduce can be much easier.. just "kill -s SIGABRT [libvirt pid]" or "rm /var/run/libvirt/libvirt-sock"

In both cases vdsm should perform self fencing.

--- Additional comment from Yaniv Bronhaim on 2013-04-09 12:24:54 EDT ---

libvirt has registration method for callback that signaled when the connection with libvirt is closed (registerCloseCallback - http://www.libvirt.org/html/libvirt-libvirt.html#virConnectRegisterCloseCallback)

We should use this callback to distinguish connectivity errors. This callback is not available in libvirt < 1.0.1

libvirt for rhel6.4 doesn't contain it. First it should backported to rhel6.4.z and then merging vdsm patch that uses it.

--- Additional comment from Barak on 2013-04-09 12:59:49 EDT ---

Dave,

Can registerCloseCallback be back-ported to 6.4.z ?
The above will also solve the SIGABRT not cought be VDSM.

--- Additional comment from Dave Allan on 2013-04-09 13:13:36 EDT ---

(In reply to comment #3)
> Dave,
> 
> Can registerCloseCallback be back-ported to 6.4.z ?
> The above will also solve the SIGABRT not cought be VDSM.

No, unfortunately that's a new API call and cannot be backported.

--- Additional comment from Simon Grinberg on 2013-04-09 14:15:03 EDT ---

(In reply to comment #4)
> (In reply to comment #3)
> > Dave,
> > 
> > Can registerCloseCallback be back-ported to 6.4.z ?
> > The above will also solve the SIGABRT not cought be VDSM.
> 
> No, unfortunately that's a new API call and cannot be backported.

Because of compatibility? because the changes are just too bug? or just because we do not do features in a Z release?

--- Additional comment from Yaniv Bronhaim on 2013-04-11 04:04:28 EDT ---

*** Bug 948216 has been marked as a duplicate of this bug. ***

--- Additional comment from Yaniv Bronhaim on 2013-04-18 04:36:42 EDT ---

When connectivity failure is raised by libvirt, vdsm starts self fencing. When vdsm restarts, it also restarts libvirt service. On large scale and high load on host, the connection to libvirt takes more than a minute.

Unlike described in the bug description, the issue is that the host doesn't response until the connection to libvirt is back.

This leads to host fencing (Bug 924801). To avoid that we move the first connectivity to libvirt to be in external thread. this way vdsm will be able to response to API calls and report its status.

Comment 2 Elad 2013-04-30 15:01:30 UTC
Verified on RHEVM - 3.1 - SI28.1

vdsm-4.10.2-1.13.el6ev.x86_64
libvirt-0.10.2-18.el6_4.4.x86_64



15:38:14,214:
libvirtError: internal error client socket is closed


15:38:43,316:
logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)

Comment 3 errata-xmlrpc 2013-05-01 18:26:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0774.html