Bug 953645 - [vdsm] [scale] After libvirt failure vdsm restarts and starts responding to XML-RPC after a big delay
Summary: [vdsm] [scale] After libvirt failure vdsm restarts and starts responding to X...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.1.4
Assignee: Yaniv Bronhaim
QA Contact: Elad
URL:
Whiteboard: infra
Depends On: 948216 949192
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-18 19:06 UTC by Idith Tal-Kohen
Modified: 2018-12-01 15:37 UTC (History)
13 users (show)

Fixed In Version: vdsm-4.10.2-1.11.el6ev
Doc Type: Bug Fix
Doc Text:
Previously, when libvirt failed, the host would not respond until the connection between VDSM and libvirt had been re-established. This caused the environment to begin fencing, which would lead to an unusable environment. This was because when libvirt raised a connectivity failure, VDSM would begin fencing. When VDSM restarted, it would also restart the libvirt service. In large scale environments with high host loads, restarting the connection to libvirt took a long time. The long time it took to re-establish a connection to libvirt was the reason fencing started. VDSM now handles connectivity to libvirt in an external thread, so that VDSM is able to respond to API calls and report its status. The condition that resulted in fencing which in turn resulted in an unusable environment no longer occurs.
Clone Of: 949192
Environment:
Last Closed: 2013-05-01 18:26:25 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:0774 0 normal SHIPPED_LIVE rhev 3.1.4 - vdsm bug fix update 2013-05-01 22:23:56 UTC
oVirt gerrit 14018 0 None None None Never

Description Idith Tal-Kohen 2013-04-18 19:06:46 UTC
+++ This bug was initially created as a clone of Bug #949192 +++

Created attachment 732223 [details]
## Logs vdsm, rhevm, libvirt

Description of problem:
VDSM service don't restart when libvirt service failed

Version-Release number of selected component (if applicable):
RHEVM 3.2 - SF11 environment:

RHEVM: rhevm-3.2.0-10.14.beta1.el6ev.noarch     
VDSM: vdsm-4.10.2-11.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.2.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

rely on BZ948216

How reproducible:
unknown

Steps to Reproduce:
1. Create DC with 3 hosts connected to one SD, and in beginning they in following state
Host-A – Maintenance 
Host-B – Maintenance 
Host-C – Active (SPM)
2. Second step
Host-A – Maintenance 
Host-B – Active (HSM) 
Host-C – Active (SPM)
3. Step 3
Host-A – Maintenance 
Host-B – Active (SPM) 
Host-C – Maintenance
4. Step 4
Host-A – Active (HSM) 
Host-B – Active (SPM) 
Host-C – Maintenance
5. Step 5
Host-A – Active (SPM) 
Host-B – Maintenance
Host-C – Maintenance
6. Step 6, activate Host-B & Host-C
Host-A – Active (SPM) 
Host-B – Active (HSM) 
Host-C – Unassigned - Failed activate
Actual results:
Failed Activate host

Expected results:
Succeed Activate host
Additional info:

Impact on user:
Failed activate host 
/var/log/ovirt-engine/engine.log

/var/log/vdsm/vdsm.log

--- Additional comment from Yaniv Bronhaim on 2013-04-09 05:21:45 EDT ---

1. Currently we restart vdsm only if libvirt throws VIR_ERR_SYSTEM_ERROR, This might changed in libvirt implementation. When we try to reproduce it now vdsm receives from libvirt VIR_ERR_INTERNAL_ERROR. I consider using virConnectRegisterCloseCallback or catching also INTERNAL_ERROR.

2. When the connection is broken we perform prepareForShutdown that takes time when large amount of vms are running on host (Bug 924801). We consider better flow to make it quicker.

The steps to reproduce can be much easier.. just "kill -s SIGABRT [libvirt pid]" or "rm /var/run/libvirt/libvirt-sock"

In both cases vdsm should perform self fencing.

--- Additional comment from Yaniv Bronhaim on 2013-04-09 12:24:54 EDT ---

libvirt has registration method for callback that signaled when the connection with libvirt is closed (registerCloseCallback - http://www.libvirt.org/html/libvirt-libvirt.html#virConnectRegisterCloseCallback)

We should use this callback to distinguish connectivity errors. This callback is not available in libvirt < 1.0.1

libvirt for rhel6.4 doesn't contain it. First it should backported to rhel6.4.z and then merging vdsm patch that uses it.

--- Additional comment from Barak on 2013-04-09 12:59:49 EDT ---

Dave,

Can registerCloseCallback be back-ported to 6.4.z ?
The above will also solve the SIGABRT not cought be VDSM.

--- Additional comment from Dave Allan on 2013-04-09 13:13:36 EDT ---

(In reply to comment #3)
> Dave,
> 
> Can registerCloseCallback be back-ported to 6.4.z ?
> The above will also solve the SIGABRT not cought be VDSM.

No, unfortunately that's a new API call and cannot be backported.

--- Additional comment from Simon Grinberg on 2013-04-09 14:15:03 EDT ---

(In reply to comment #4)
> (In reply to comment #3)
> > Dave,
> > 
> > Can registerCloseCallback be back-ported to 6.4.z ?
> > The above will also solve the SIGABRT not cought be VDSM.
> 
> No, unfortunately that's a new API call and cannot be backported.

Because of compatibility? because the changes are just too bug? or just because we do not do features in a Z release?

--- Additional comment from Yaniv Bronhaim on 2013-04-11 04:04:28 EDT ---

*** Bug 948216 has been marked as a duplicate of this bug. ***

--- Additional comment from Yaniv Bronhaim on 2013-04-18 04:36:42 EDT ---

When connectivity failure is raised by libvirt, vdsm starts self fencing. When vdsm restarts, it also restarts libvirt service. On large scale and high load on host, the connection to libvirt takes more than a minute.

Unlike described in the bug description, the issue is that the host doesn't response until the connection to libvirt is back.

This leads to host fencing (Bug 924801). To avoid that we move the first connectivity to libvirt to be in external thread. this way vdsm will be able to response to API calls and report its status.

Comment 2 Elad 2013-04-30 15:01:30 UTC
Verified on RHEVM - 3.1 - SI28.1

vdsm-4.10.2-1.13.el6ev.x86_64
libvirt-0.10.2-18.el6_4.4.x86_64



15:38:14,214:
libvirtError: internal error client socket is closed


15:38:43,316:
logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)

Comment 3 errata-xmlrpc 2013-05-01 18:26:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0774.html


Note You need to log in before you can comment on or make changes to this bug.