Bug 832081
| Summary: | Fix keepalive issues in libvirt | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Andrew Cathrow <acathrow> | ||||
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 6.3 | CC: | acathrow, borgan, dallan, dyasny, dyuan, eblake, mitian, mkalinin, mzhan, rbalakri, rwu, syeghiay, whuang, ydu, zhpeng | ||||
| Target Milestone: | rc | Keywords: | Regression, ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | libvirt-0.9.13-2.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Cause: Implementation of keep-alive messages libvirt sends to be able to detect broken connections or dead peers had several bugs that surfaced in various corner cases.
Consequence: Libvirt connections could be wrongly considered broken and thus keep-alive messages were disabled in RHEL 6.3 in default configuration.
Fix: The keep-alive implementation was fixed and simplified which solved all issues that appeared in the past.
Result: Libvirt connections are now reliable even when keep-alive messages are turned on, which allowed the default configuration to be changed and keep-alive messages are now send through all libvirt connections (unless this is explicitly disabled in libvirt configuration).
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-02-21 07:17:23 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 804821, 832184, 836196, 838924 | ||||||
| Attachments: |
|
||||||
|
Description
Andrew Cathrow
2012-06-14 13:19:08 UTC
Proposing blocker - but a zero day would be acceptable Dallan - can this be disabled in the config file since RHEV already modifies libvirt's config or does this need to be disabled in the code? A config file edit to keepalive_interval=-1 will correctly disable keepalive, if you'd like to go that route for the bare minimum patch necessary for quick turnaround. (In reply to comment #3) > A config file edit to keepalive_interval=-1 will correctly disable > keepalive, if you'd like to go that route for the bare minimum patch > necessary for quick turnaround. Eric, but if the config file was already modified the RPM update fix this Which RPM are you proposing to modify? Libvirt to ship the config file with keepalive_interval already off, or VDSM, since installing VDSM already causes modifications to libvirt's config and this would be one more modification? Created attachment 591838 [details]
Proposed patch to disable keepalive
Scratch build is available at http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4513493 Updated the libvirt to libvirt-0.9.10-21.el6_3.no.ka.x86_64 (as in the brewweb link above) on all the nodes in the cluster.
The mentioned problem still persists, i.e. can't migrate many (tried 18) VMs at once.
In the gui I get the following:
Migration failed due to Error: Fatal error during migration (VM : <vmname>, Source Host: <host name>)
Furthermore, now it seems like I can't migrate even 1 VM.
vdsm log shows:
Thread-2281::DEBUG::2012-06-14 18:43:11,253::libvirtvm::307::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::migration downtime thread started
Thread-2282::DEBUG::2012-06-14 18:43:11,254::libvirtvm::335::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::starting migration monitor thread
Thread-2280::DEBUG::2012-06-14 18:43:11,382::libvirtvm::322::vm.Vm::(cancel) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::canceling migration downtime thread
Thread-2280::DEBUG::2012-06-14 18:43:11,383::libvirtvm::372::vm.Vm::(stop) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::stopping migration monitor thread
Thread-2281::DEBUG::2012-06-14 18:43:11,383::libvirtvm::319::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::migration downtime thread exiting
Thread-2280::ERROR::2012-06-14 18:43:11,383::vm::177::vm.Vm::(_recover) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::invalid argument: negative or zero interval make no sense
Thread-2161::ERROR::2012-06-14 18:39:54,425::vm::232::vm.Vm::(run) vmId=`f90f69d2-7212-4cff-90ce-85554ee378c7`::Traceback (most recent call last):
File "/usr/share/vdsm/vm.py", line 224, in run
self._startUnderlyingMigration()
File "/usr/share/vdsm/libvirtvm.py", line 423, in _startUnderlyingMigration
libvirt.VIR_MIGRATE_PEER2PEER, None, maxBandwidth)
File "/usr/share/vdsm/libvirtvm.py", line 445, in f
ret = attr(*args, **kwargs)
File "/usr/share/vdsm/libvirtconnection.py", line 63, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1039, in migrateToURI
if ret == -1: raise libvirtError ('virDomainMigrateToURI() failed', dom=self)
libvirtError: invalid argument: negative or zero interval make no sense
re: 6.4 I think we should look at rolling back this and setting the default to enabled if we can get enough testing done. For RHEL 6.4, there is a patch series that should fix this to allow enabling keepalive, by virtue of rebasing to newer libvirt. 13 patches, ending in:
commit 5d490603a6d60298162cbd32ec45f736b58929fb
Author: Daniel P. Berrange <berrange>
Date: Wed Jun 13 10:54:02 2012 +0100
client rpc: Fix error checking after poll()
First 'poll' can't return EWOULDBLOCK, and second, we're checking errno
so far away from the poll() call that we've probably already trashed the
original errno value.
Only 6.3.z needs the disable hack (that is, the thread mentioned in comment 8 should NOT be applied to 6.4)
(In reply to comment #14) > For RHEL 6.4, there is a patch series that should fix this to allow enabling > keepalive, by virtue of rebasing to newer libvirt. 13 patches, ending in: > > commit 5d490603a6d60298162cbd32ec45f736b58929fb > Author: Daniel P. Berrange <berrange> > Date: Wed Jun 13 10:54:02 2012 +0100 > > client rpc: Fix error checking after poll() > > First 'poll' can't return EWOULDBLOCK, and second, we're checking errno > so far away from the poll() call that we've probably already trashed the > original errno value. > > Only 6.3.z needs the disable hack (that is, the thread mentioned in comment > 8 should NOT be applied to 6.4) Hi, I'm trying to verify this bug with libvirt-0.9.13-2.el6, but i'm not sure how to verify it. And there has a bolck bug 837485 that vdsmd can started with this version libvirt. So can you kindly give some suggestions ? We disabled keepalive in 6.3 bacuase the issues were found late in the release process but for 6.4 we are actually fixing them... Vdsm is not required to verify this bug. However, this bug report covers several fixes in libvirt's keepalive code. I'll try to cover them all and provide steps for testing them. As comment 18 NEEDINFO from dev. Issues fixed with the keepalive patches mentioned in comment 14: - libvirt connection would be treated as dead when lots of stream calls is being sent through it - this is covered by bug 807907 - libvirt daemon would close a connection to a client when the client calls a long-running API from its event loop - to test this, use the following python script: import libvirt libvirt.virEventRegisterDefaultImpl() conn = libvirt.open(None) dom = conn.lookupByName("DOMAIN") while True: libvirt.virEventRunDefaultImpl() dom.suspend() - testing steps: 1. start DOMAIN 2. kill -STOP the qemu-kvm process associated with DOMAIN 3. run the script above - with unfixed libvirt, the libvirtd would close the connection after keepalive timeout - libvirt client would not detect dead connection when it calls a long-running API from its event loop - in the script above, add the following line after "conn = libvirt.open": conn.setKeepAlive(5, 3) this tells the script to consider the connection dead after 15 seconds of no response from the server - testing steps: 1. start DOMAIN 2. run the modified script 3. kill -STOP libvirtd process - with unfixed libvirt, the script should not detect the connection died and should keep waiting for the result of dom.suspend() Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0276.html |