A critical issue has been found with libvirt relating to keepalive functionality.
If a customer does multiple migrations - for example when putting a host into maintenance mode during an upgrade (eg. 6.2->6.3) then this issue is encountered and we fail migrate and we also seem to lose the connection between vdsm and libvirt.
Two relevant BZ's
Request : change the /etc/libvirt/qemu.conf to set keepalive_interval = -1
Since keepalive is a new feature I suggest we disable this by default in 6.3 and move to enable it in 6.4
Proposing blocker - but a zero day would be acceptable
Dallan - can this be disabled in the config file since RHEV already modifies libvirt's config or does this need to be disabled in the code?
A config file edit to keepalive_interval=-1 will correctly disable keepalive, if you'd like to go that route for the bare minimum patch necessary for quick turnaround.
(In reply to comment #3)
> A config file edit to keepalive_interval=-1 will correctly disable
> keepalive, if you'd like to go that route for the bare minimum patch
> necessary for quick turnaround.
Eric, but if the config file was already modified the RPM update fix this
Which RPM are you proposing to modify? Libvirt to ship the config file with keepalive_interval already off, or VDSM, since installing VDSM already causes modifications to libvirt's config and this would be one more modification?
Created attachment 591838 [details]
Proposed patch to disable keepalive
Scratch build is available at http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4513493
In POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-June/msg00178.html
Updated the libvirt to libvirt-0.9.10-21.el6_3.no.ka.x86_64 (as in the brewweb link above) on all the nodes in the cluster.
The mentioned problem still persists, i.e. can't migrate many (tried 18) VMs at once.
In the gui I get the following:
Migration failed due to Error: Fatal error during migration (VM : <vmname>, Source Host: <host name>)
Furthermore, now it seems like I can't migrate even 1 VM.
vdsm log shows:
Thread-2281::DEBUG::2012-06-14 18:43:11,253::libvirtvm::307::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::migration downtime thread started
Thread-2282::DEBUG::2012-06-14 18:43:11,254::libvirtvm::335::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::starting migration monitor thread
Thread-2280::DEBUG::2012-06-14 18:43:11,382::libvirtvm::322::vm.Vm::(cancel) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::canceling migration downtime thread
Thread-2280::DEBUG::2012-06-14 18:43:11,383::libvirtvm::372::vm.Vm::(stop) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::stopping migration monitor thread
Thread-2281::DEBUG::2012-06-14 18:43:11,383::libvirtvm::319::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::migration downtime thread exiting
Thread-2280::ERROR::2012-06-14 18:43:11,383::vm::177::vm.Vm::(_recover) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::invalid argument: negative or zero interval make no sense
Thread-2161::ERROR::2012-06-14 18:39:54,425::vm::232::vm.Vm::(run) vmId=`f90f69d2-7212-4cff-90ce-85554ee378c7`::Traceback (most recent call last):
File "/usr/share/vdsm/vm.py", line 224, in run
File "/usr/share/vdsm/libvirtvm.py", line 423, in _startUnderlyingMigration
libvirt.VIR_MIGRATE_PEER2PEER, None, maxBandwidth)
File "/usr/share/vdsm/libvirtvm.py", line 445, in f
ret = attr(*args, **kwargs)
File "/usr/share/vdsm/libvirtconnection.py", line 63, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1039, in migrateToURI
if ret == -1: raise libvirtError ('virDomainMigrateToURI() failed', dom=self)
libvirtError: invalid argument: negative or zero interval make no sense
re: 6.4 I think we should look at rolling back this and setting the default to enabled if we can get enough testing done.
For RHEL 6.4, there is a patch series that should fix this to allow enabling keepalive, by virtue of rebasing to newer libvirt. 13 patches, ending in:
Author: Daniel P. Berrange <firstname.lastname@example.org>
Date: Wed Jun 13 10:54:02 2012 +0100
client rpc: Fix error checking after poll()
First 'poll' can't return EWOULDBLOCK, and second, we're checking errno
so far away from the poll() call that we've probably already trashed the
original errno value.
Only 6.3.z needs the disable hack (that is, the thread mentioned in comment 8 should NOT be applied to 6.4)
(In reply to comment #14)
> For RHEL 6.4, there is a patch series that should fix this to allow enabling
> keepalive, by virtue of rebasing to newer libvirt. 13 patches, ending in:
> commit 5d490603a6d60298162cbd32ec45f736b58929fb
> Author: Daniel P. Berrange <email@example.com>
> Date: Wed Jun 13 10:54:02 2012 +0100
> client rpc: Fix error checking after poll()
> First 'poll' can't return EWOULDBLOCK, and second, we're checking errno
> so far away from the poll() call that we've probably already trashed the
> original errno value.
> Only 6.3.z needs the disable hack (that is, the thread mentioned in comment
> 8 should NOT be applied to 6.4)
I'm trying to verify this bug with libvirt-0.9.13-2.el6, but i'm not sure how to verify it. And there has a bolck bug 837485 that vdsmd can started with this version libvirt.
So can you kindly give some suggestions ?
We disabled keepalive in 6.3 bacuase the issues were found late in the release process but for 6.4 we are actually fixing them...
Vdsm is not required to verify this bug. However, this bug report covers several fixes in libvirt's keepalive code. I'll try to cover them all and provide steps for testing them.
As comment 18
NEEDINFO from dev.
Issues fixed with the keepalive patches mentioned in comment 14:
- libvirt connection would be treated as dead when lots of stream calls is
being sent through it
- this is covered by bug 807907
- libvirt daemon would close a connection to a client when the client calls a
long-running API from its event loop
- to test this, use the following python script:
conn = libvirt.open(None)
dom = conn.lookupByName("DOMAIN")
- testing steps:
1. start DOMAIN
2. kill -STOP the qemu-kvm process associated with DOMAIN
3. run the script above
- with unfixed libvirt, the libvirtd would close the connection after
- libvirt client would not detect dead connection when it calls a long-running
API from its event loop
- in the script above, add the following line after "conn = libvirt.open":
this tells the script to consider the connection dead after 15 seconds of
no response from the server
- testing steps:
1. start DOMAIN
2. run the modified script
3. kill -STOP libvirtd process
- with unfixed libvirt, the script should not detect the connection died and
should keep waiting for the result of dom.suspend()
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.