832081 – Fix keepalive issues in libvirt

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 832081 - Fix keepalive issues in libvirt

Summary: Fix keepalive issues in libvirt

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Denemark
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	804821 832184 836196 838924
TreeView+	depends on / blocked

Reported:	2012-06-14 13:19 UTC by Andrew Cathrow
Modified:	2014-09-07 22:54 UTC (History)
CC List:	15 users (show)
Fixed In Version:	libvirt-0.9.13-2.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Implementation of keep-alive messages libvirt sends to be able to detect broken connections or dead peers had several bugs that surfaced in various corner cases. Consequence: Libvirt connections could be wrongly considered broken and thus keep-alive messages were disabled in RHEL 6.3 in default configuration. Fix: The keep-alive implementation was fixed and simplified which solved all issues that appeared in the past. Result: Libvirt connections are now reliable even when keep-alive messages are turned on, which allowed the default configuration to be changed and keep-alive messages are now send through all libvirt connections (unless this is explicitly disabled in libvirt configuration).
Clone Of:
Environment:
Last Closed:	2013-02-21 07:17:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Proposed patch to disable keepalive (2.21 KB, patch) 2012-06-14 13:58 UTC, Jiri Denemark	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	285813	0	None	None	None	Never
Red Hat Product Errata	RHSA-2013:0276	0	normal	SHIPPED_LIVE	Moderate: libvirt security, bug fix, and enhancement update	2013-02-20 21:18:26 UTC

Description Andrew Cathrow 2012-06-14 13:19:08 UTC

A critical issue has been found with libvirt relating to keepalive functionality.
If a customer does multiple migrations - for example when putting a host into maintenance mode during an upgrade (eg. 6.2->6.3) then this issue is encountered and we fail migrate and we also seem to lose the connection between vdsm and libvirt.

Two relevant BZ's

https://bugzilla.redhat.com/show_bug.cgi?id=816451
https://bugzilla.redhat.com/show_bug.cgi?id=807907


Request : change the /etc/libvirt/qemu.conf to set keepalive_interval = -1

Since keepalive is a new feature I suggest we disable this by default in 6.3 and move to enable it in 6.4

Comment 1 Andrew Cathrow 2012-06-14 13:19:58 UTC

Proposing blocker - but a zero day would be acceptable

Comment 2 Andrew Cathrow 2012-06-14 13:22:35 UTC

Dallan - can this be disabled in the config file since RHEV already modifies libvirt's config or does this need to be disabled in the code?

Comment 3 Eric Blake 2012-06-14 13:31:10 UTC

A config file edit to keepalive_interval=-1 will correctly disable keepalive, if you'd like to go that route for the bare minimum patch necessary for quick turnaround.

Comment 4 Andrew Cathrow 2012-06-14 13:32:50 UTC

(In reply to comment #3)
> A config file edit to keepalive_interval=-1 will correctly disable
> keepalive, if you'd like to go that route for the bare minimum patch
> necessary for quick turnaround.

Eric, but if the config file was already modified the RPM update fix this

Comment 5 Eric Blake 2012-06-14 13:41:46 UTC

Which RPM are you proposing to modify? Libvirt to ship the config file with keepalive_interval already off, or VDSM, since installing VDSM already causes modifications to libvirt's config and this would be one more modification?

Comment 6 Jiri Denemark 2012-06-14 13:58:00 UTC

Created attachment 591838 [details]
Proposed patch to disable keepalive

Comment 7 Jiri Denemark 2012-06-14 14:35:19 UTC

Scratch build is available at http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4513493

Comment 8 Jiri Denemark 2012-06-14 14:57:06 UTC

In POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-June/msg00178.html

Comment 9 Alexander Chuzhoy 2012-06-14 15:53:59 UTC

Updated the libvirt to libvirt-0.9.10-21.el6_3.no.ka.x86_64 (as in the brewweb link above) on all the nodes in the cluster.
The mentioned problem still persists, i.e. can't migrate many (tried 18) VMs at once.
In the gui I get the following:
Migration failed due to Error: Fatal error during migration (VM : <vmname>, Source Host: <host name>)

Furthermore, now it seems like I can't migrate even 1 VM.



vdsm log shows:
Thread-2281::DEBUG::2012-06-14 18:43:11,253::libvirtvm::307::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::migration downtime thread started                                                             
Thread-2282::DEBUG::2012-06-14 18:43:11,254::libvirtvm::335::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::starting migration monitor thread                                                             
Thread-2280::DEBUG::2012-06-14 18:43:11,382::libvirtvm::322::vm.Vm::(cancel) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::canceling migration downtime thread                                                        
Thread-2280::DEBUG::2012-06-14 18:43:11,383::libvirtvm::372::vm.Vm::(stop) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::stopping migration monitor thread                                                            
Thread-2281::DEBUG::2012-06-14 18:43:11,383::libvirtvm::319::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::migration downtime thread exiting                                                             

Thread-2280::ERROR::2012-06-14 18:43:11,383::vm::177::vm.Vm::(_recover) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::invalid argument: negative or zero interval make no sense                                       

Thread-2161::ERROR::2012-06-14 18:39:54,425::vm::232::vm.Vm::(run) vmId=`f90f69d2-7212-4cff-90ce-85554ee378c7`::Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 224, in run
    self._startUnderlyingMigration()
  File "/usr/share/vdsm/libvirtvm.py", line 423, in _startUnderlyingMigration
    libvirt.VIR_MIGRATE_PEER2PEER, None, maxBandwidth)
  File "/usr/share/vdsm/libvirtvm.py", line 445, in f
    ret = attr(*args, **kwargs)
  File "/usr/share/vdsm/libvirtconnection.py", line 63, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1039, in migrateToURI
    if ret == -1: raise libvirtError ('virDomainMigrateToURI() failed', dom=self)
libvirtError: invalid argument: negative or zero interval make no sense

Comment 12 Andrew Cathrow 2012-06-14 17:12:30 UTC

re: 6.4 I think we should look at rolling back this and setting the default to enabled if we can get enough testing done.

Comment 14 Eric Blake 2012-06-14 20:57:58 UTC

For RHEL 6.4, there is a patch series that should fix this to allow enabling keepalive, by virtue of rebasing to newer libvirt.  13 patches, ending in:

commit 5d490603a6d60298162cbd32ec45f736b58929fb
Author: Daniel P. Berrange <berrange>
Date:   Wed Jun 13 10:54:02 2012 +0100

    client rpc: Fix error checking after poll()
    
    First 'poll' can't return EWOULDBLOCK, and second, we're checking errno
    so far away from the poll() call that we've probably already trashed the
    original errno value.

Only 6.3.z needs the disable hack (that is, the thread mentioned in comment 8 should NOT be applied to 6.4)

Comment 16 yanbing du 2012-07-06 07:07:18 UTC

(In reply to comment #14)
> For RHEL 6.4, there is a patch series that should fix this to allow enabling
> keepalive, by virtue of rebasing to newer libvirt.  13 patches, ending in:
> 
> commit 5d490603a6d60298162cbd32ec45f736b58929fb
> Author: Daniel P. Berrange <berrange>
> Date:   Wed Jun 13 10:54:02 2012 +0100
> 
>     client rpc: Fix error checking after poll()
>     
>     First 'poll' can't return EWOULDBLOCK, and second, we're checking errno
>     so far away from the poll() call that we've probably already trashed the
>     original errno value.
> 
> Only 6.3.z needs the disable hack (that is, the thread mentioned in comment
> 8 should NOT be applied to 6.4)

Hi,
    I'm trying to verify this bug with libvirt-0.9.13-2.el6, but i'm not sure how to verify it. And there has a bolck bug 837485 that vdsmd can started with this version libvirt. 
    So can you kindly give some suggestions ?

Comment 17 Jiri Denemark 2012-08-06 14:40:19 UTC

We disabled keepalive in 6.3 bacuase the issues were found late in the release process but for 6.4 we are actually fixing them...

Comment 18 Jiri Denemark 2012-08-06 14:45:37 UTC

Vdsm is not required to verify this bug. However, this bug report covers several fixes in libvirt's keepalive code. I'll try to cover them all and provide steps for testing them.

Comment 19 zhpeng 2012-11-27 05:39:00 UTC

As comment 18
NEEDINFO from dev.

Comment 21 Jiri Denemark 2012-12-20 15:45:44 UTC

Issues fixed with the keepalive patches mentioned in comment 14:

- libvirt connection would be treated as dead when lots of stream calls is
  being sent through it
  - this is covered by bug 807907

- libvirt daemon would close a connection to a client when the client calls a
  long-running API from its event loop
  - to test this, use the following python script:

      import libvirt

      libvirt.virEventRegisterDefaultImpl()
      conn = libvirt.open(None)
      dom = conn.lookupByName("DOMAIN")

      while True:
          libvirt.virEventRunDefaultImpl()
          dom.suspend()

  - testing steps:
    1. start DOMAIN
    2. kill -STOP the qemu-kvm process associated with DOMAIN
    3. run the script above
  - with unfixed libvirt, the libvirtd would close the connection after
    keepalive timeout

- libvirt client would not detect dead connection when it calls a long-running
  API from its event loop
  - in the script above, add the following line after "conn = libvirt.open":

      conn.setKeepAlive(5, 3)

    this tells the script to consider the connection dead after 15 seconds of
    no response from the server

  - testing steps:
    1. start DOMAIN
    2. run the modified script
    3. kill -STOP libvirtd process
  - with unfixed libvirt, the script should not detect the connection died and
    should keep waiting for the result of dom.suspend()

Comment 24 errata-xmlrpc 2013-02-21 07:17:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html

Note You need to log in before you can comment on or make changes to this bug.