Bug 949192 - [vdsm] [scale] After libvirt failure vdsm restarts and starts responding to XML-RPC after a big delay
Summary: [vdsm] [scale] After libvirt failure vdsm restarts and starts responding to X...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.2.0
Assignee: Yaniv Bronhaim
QA Contact: Elad
URL:
Whiteboard: infra
: 948216 (view as bug list)
Depends On: 948216
Blocks: 953645
TreeView+ depends on / blocked
 
Reported: 2013-04-06 23:42 UTC by vvyazmin@redhat.com
Modified: 2022-07-09 05:59 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, when a connectivity failure was raised by libvirt, VDSM began self-fencing. When VDSM restarted, it restarted the libvirt service. In large environments with high host loads, establishing the connection between VDSM and libvirt took quite a long time. Because the host doesn't respond until the connection to libvirt is back, this meant that host fencing would begin before the connection between VDSM and libvirt was established. An upgrade allows VDSM to respond to API calls and report its status, which prevents the condition that previously caused premature and unwanted host fencing.
Clone Of:
: 953645 (view as bug list)
Environment:
Last Closed: 2013-06-10 20:48:06 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
## Logs vdsm, rhevm, libvirt (54.24 MB, application/x-gzip)
2013-04-06 23:42 UTC, vvyazmin@redhat.com
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-47083 0 None None None 2022-07-09 05:59:21 UTC
Red Hat Product Errata RHSA-2013:0886 0 normal SHIPPED_LIVE Moderate: rhev 3.2 - vdsm security and bug fix update 2013-06-11 00:25:02 UTC
oVirt gerrit 14018 0 None None None Never

Description vvyazmin@redhat.com 2013-04-06 23:42:23 UTC
Created attachment 732223 [details]
## Logs vdsm, rhevm, libvirt

Description of problem:
VDSM service don't restart when libvirt service failed

Version-Release number of selected component (if applicable):
RHEVM 3.2 - SF11 environment:

RHEVM: rhevm-3.2.0-10.14.beta1.el6ev.noarch     
VDSM: vdsm-4.10.2-11.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.2.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

rely on BZ948216

How reproducible:
unknown

Steps to Reproduce:
1. Create DC with 3 hosts connected to one SD, and in beginning they in following state
Host-A – Maintenance 
Host-B – Maintenance 
Host-C – Active (SPM)
2. Second step
Host-A – Maintenance 
Host-B – Active (HSM) 
Host-C – Active (SPM)
3. Step 3
Host-A – Maintenance 
Host-B – Active (SPM) 
Host-C – Maintenance
4. Step 4
Host-A – Active (HSM) 
Host-B – Active (SPM) 
Host-C – Maintenance
5. Step 5
Host-A – Active (SPM) 
Host-B – Maintenance
Host-C – Maintenance
6. Step 6, activate Host-B & Host-C
Host-A – Active (SPM) 
Host-B – Active (HSM) 
Host-C – Unassigned - Failed activate
Actual results:
Failed Activate host

Expected results:
Succeed Activate host
Additional info:

Impact on user:
Failed activate host 
/var/log/ovirt-engine/engine.log

/var/log/vdsm/vdsm.log

Comment 1 Yaniv Bronhaim 2013-04-09 09:21:45 UTC
1. Currently we restart vdsm only if libvirt throws VIR_ERR_SYSTEM_ERROR, This might changed in libvirt implementation. When we try to reproduce it now vdsm receives from libvirt VIR_ERR_INTERNAL_ERROR. I consider using virConnectRegisterCloseCallback or catching also INTERNAL_ERROR.

2. When the connection is broken we perform prepareForShutdown that takes time when large amount of vms are running on host (Bug 924801). We consider better flow to make it quicker.

The steps to reproduce can be much easier.. just "kill -s SIGABRT [libvirt pid]" or "rm /var/run/libvirt/libvirt-sock"

In both cases vdsm should perform self fencing.

Comment 2 Yaniv Bronhaim 2013-04-09 16:24:54 UTC
libvirt has registration method for callback that signaled when the connection with libvirt is closed (registerCloseCallback - http://www.libvirt.org/html/libvirt-libvirt.html#virConnectRegisterCloseCallback)

We should use this callback to distinguish connectivity errors. This callback is not available in libvirt < 1.0.1

libvirt for rhel6.4 doesn't contain it. First it should backported to rhel6.4.z and then merging vdsm patch that uses it.

Comment 3 Barak 2013-04-09 16:59:49 UTC
Dave,

Can registerCloseCallback be back-ported to 6.4.z ?
The above will also solve the SIGABRT not cought be VDSM.

Comment 4 Dave Allan 2013-04-09 17:13:36 UTC
(In reply to comment #3)
> Dave,
> 
> Can registerCloseCallback be back-ported to 6.4.z ?
> The above will also solve the SIGABRT not cought be VDSM.

No, unfortunately that's a new API call and cannot be backported.

Comment 5 Simon Grinberg 2013-04-09 18:15:03 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > Dave,
> > 
> > Can registerCloseCallback be back-ported to 6.4.z ?
> > The above will also solve the SIGABRT not cought be VDSM.
> 
> No, unfortunately that's a new API call and cannot be backported.

Because of compatibility? because the changes are just too bug? or just because we do not do features in a Z release?

Comment 6 Yaniv Bronhaim 2013-04-11 08:04:28 UTC
*** Bug 948216 has been marked as a duplicate of this bug. ***

Comment 7 Yaniv Bronhaim 2013-04-18 08:36:42 UTC
When connectivity failure is raised by libvirt, vdsm starts self fencing. When vdsm restarts, it also restarts libvirt service. On large scale and high load on host, the connection to libvirt takes more than a minute.

Unlike described in the bug description, the issue is that the host doesn't response until the connection to libvirt is back.

This leads to host fencing (Bug 924801). To avoid that we move the first connectivity to libvirt to be in external thread. this way vdsm will be able to response to API calls and report its status.

Comment 10 Elad 2013-05-01 12:16:43 UTC
Verified on RHEVM - 3.2 - SF14

vdsm-4.10.2-16.0.el6ev.x86_64
libvirt-0.10.2-18.el6_4.4.x86_64
rhevm-3.2.0-10.20.master.el6ev.noarch

vdsm initiate prepareForShutdown immediately after connection to libvirt breaks.

 

15:11:11,036::
libvirtconnection::123::vds::(wrapper) connection to libvirt broken. taking vdsm down.



15:11:12,036::
logUtils::40::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)

Comment 12 errata-xmlrpc 2013-06-10 20:48:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0886.html


Note You need to log in before you can comment on or make changes to this bug.