Bug 832081

Summary:

Fix keepalive issues in libvirt

Product:

Red Hat Enterprise Linux 6

Reporter:

Andrew Cathrow <acathrow>

Component:

libvirt

Assignee:

Jiri Denemark <jdenemar>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.3

CC:

acathrow, borgan, dallan, dyasny, dyuan, eblake, mitian, mkalinin, mzhan, rbalakri, rwu, syeghiay, whuang, ydu, zhpeng

Target Milestone:

Keywords:

Regression, ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

libvirt-0.9.13-2.el6

Doc Type:

Bug Fix

Doc Text:

Cause: Implementation of keep-alive messages libvirt sends to be able to detect broken connections or dead peers had several bugs that surfaced in various corner cases. Consequence: Libvirt connections could be wrongly considered broken and thus keep-alive messages were disabled in RHEL 6.3 in default configuration. Fix: The keep-alive implementation was fixed and simplified which solved all issues that appeared in the past. Result: Libvirt connections are now reliable even when keep-alive messages are turned on, which allowed the default configuration to be changed and keep-alive messages are now send through all libvirt connections (unless this is explicitly disabled in libvirt configuration).

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-02-21 07:17:23 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

804821, 832184, 836196, 838924

Attachments:

Description	Flags
Proposed patch to disable keepalive	none

Description Andrew Cathrow 2012-06-14 13:19:08 UTC

A critical issue has been found with libvirt relating to keepalive functionality.
If a customer does multiple migrations - for example when putting a host into maintenance mode during an upgrade (eg. 6.2->6.3) then this issue is encountered and we fail migrate and we also seem to lose the connection between vdsm and libvirt.

Two relevant BZ's

https://bugzilla.redhat.com/show_bug.cgi?id=816451
https://bugzilla.redhat.com/show_bug.cgi?id=807907


Request : change the /etc/libvirt/qemu.conf to set keepalive_interval = -1

Since keepalive is a new feature I suggest we disable this by default in 6.3 and move to enable it in 6.4

Comment 1 Andrew Cathrow 2012-06-14 13:19:58 UTC

Proposing blocker - but a zero day would be acceptable

Comment 2 Andrew Cathrow 2012-06-14 13:22:35 UTC

Dallan - can this be disabled in the config file since RHEV already modifies libvirt's config or does this need to be disabled in the code?

Comment 3 Eric Blake 2012-06-14 13:31:10 UTC

A config file edit to keepalive_interval=-1 will correctly disable keepalive, if you'd like to go that route for the bare minimum patch necessary for quick turnaround.

Comment 4 Andrew Cathrow 2012-06-14 13:32:50 UTC

(In reply to comment #3)
> A config file edit to keepalive_interval=-1 will correctly disable
> keepalive, if you'd like to go that route for the bare minimum patch
> necessary for quick turnaround.

Eric, but if the config file was already modified the RPM update fix this

Comment 5 Eric Blake 2012-06-14 13:41:46 UTC

Which RPM are you proposing to modify? Libvirt to ship the config file with keepalive_interval already off, or VDSM, since installing VDSM already causes modifications to libvirt's config and this would be one more modification?

Comment 6 Jiri Denemark 2012-06-14 13:58:00 UTC

Created attachment 591838 [details]
Proposed patch to disable keepalive

Comment 7 Jiri Denemark 2012-06-14 14:35:19 UTC

Scratch build is available at http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4513493

Comment 8 Jiri Denemark 2012-06-14 14:57:06 UTC

In POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-June/msg00178.html

Comment 9 Alexander Chuzhoy 2012-06-14 15:53:59 UTC

Updated the libvirt to libvirt-0.9.10-21.el6_3.no.ka.x86_64 (as in the brewweb link above) on all the nodes in the cluster.
The mentioned problem still persists, i.e. can't migrate many (tried 18) VMs at once.
In the gui I get the following:
Migration failed due to Error: Fatal error during migration (VM : <vmname>, Source Host: <host name>)

Furthermore, now it seems like I can't migrate even 1 VM.



vdsm log shows:
Thread-2281::DEBUG::2012-06-14 18:43:11,253::libvirtvm::307::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::migration downtime thread started                                                             
Thread-2282::DEBUG::2012-06-14 18:43:11,254::libvirtvm::335::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::starting migration monitor thread                                                             
Thread-2280::DEBUG::2012-06-14 18:43:11,382::libvirtvm::322::vm.Vm::(cancel) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::canceling migration downtime thread                                                        
Thread-2280::DEBUG::2012-06-14 18:43:11,383::libvirtvm::372::vm.Vm::(stop) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::stopping migration monitor thread                                                            
Thread-2281::DEBUG::2012-06-14 18:43:11,383::libvirtvm::319::vm.Vm::(run) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::migration downtime thread exiting                                                             

Thread-2280::ERROR::2012-06-14 18:43:11,383::vm::177::vm.Vm::(_recover) vmId=`68dd2aaf-735c-49cc-82ca-f290cb4baae1`::invalid argument: negative or zero interval make no sense                                       

Thread-2161::ERROR::2012-06-14 18:39:54,425::vm::232::vm.Vm::(run) vmId=`f90f69d2-7212-4cff-90ce-85554ee378c7`::Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 224, in run
    self._startUnderlyingMigration()
  File "/usr/share/vdsm/libvirtvm.py", line 423, in _startUnderlyingMigration
    libvirt.VIR_MIGRATE_PEER2PEER, None, maxBandwidth)
  File "/usr/share/vdsm/libvirtvm.py", line 445, in f
    ret = attr(*args, **kwargs)
  File "/usr/share/vdsm/libvirtconnection.py", line 63, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1039, in migrateToURI
    if ret == -1: raise libvirtError ('virDomainMigrateToURI() failed', dom=self)
libvirtError: invalid argument: negative or zero interval make no sense

Comment 12 Andrew Cathrow 2012-06-14 17:12:30 UTC

re: 6.4 I think we should look at rolling back this and setting the default to enabled if we can get enough testing done.

Comment 14 Eric Blake 2012-06-14 20:57:58 UTC

For RHEL 6.4, there is a patch series that should fix this to allow enabling keepalive, by virtue of rebasing to newer libvirt.  13 patches, ending in:

commit 5d490603a6d60298162cbd32ec45f736b58929fb
Author: Daniel P. Berrange <berrange>
Date:   Wed Jun 13 10:54:02 2012 +0100

    client rpc: Fix error checking after poll()
    
    First 'poll' can't return EWOULDBLOCK, and second, we're checking errno
    so far away from the poll() call that we've probably already trashed the
    original errno value.

Only 6.3.z needs the disable hack (that is, the thread mentioned in comment 8 should NOT be applied to 6.4)

Comment 16 yanbing du 2012-07-06 07:07:18 UTC

(In reply to comment #14)
> For RHEL 6.4, there is a patch series that should fix this to allow enabling
> keepalive, by virtue of rebasing to newer libvirt.  13 patches, ending in:
> 
> commit 5d490603a6d60298162cbd32ec45f736b58929fb
> Author: Daniel P. Berrange <berrange>
> Date:   Wed Jun 13 10:54:02 2012 +0100
> 
>     client rpc: Fix error checking after poll()
>     
>     First 'poll' can't return EWOULDBLOCK, and second, we're checking errno
>     so far away from the poll() call that we've probably already trashed the
>     original errno value.
> 
> Only 6.3.z needs the disable hack (that is, the thread mentioned in comment
> 8 should NOT be applied to 6.4)

Hi,
    I'm trying to verify this bug with libvirt-0.9.13-2.el6, but i'm not sure how to verify it. And there has a bolck bug 837485 that vdsmd can started with this version libvirt. 
    So can you kindly give some suggestions ?

Comment 17 Jiri Denemark 2012-08-06 14:40:19 UTC

We disabled keepalive in 6.3 bacuase the issues were found late in the release process but for 6.4 we are actually fixing them...

Comment 18 Jiri Denemark 2012-08-06 14:45:37 UTC

Vdsm is not required to verify this bug. However, this bug report covers several fixes in libvirt's keepalive code. I'll try to cover them all and provide steps for testing them.

Comment 19 zhpeng 2012-11-27 05:39:00 UTC

As comment 18
NEEDINFO from dev.

Comment 21 Jiri Denemark 2012-12-20 15:45:44 UTC

Issues fixed with the keepalive patches mentioned in comment 14:

- libvirt connection would be treated as dead when lots of stream calls is
  being sent through it
  - this is covered by bug 807907

- libvirt daemon would close a connection to a client when the client calls a
  long-running API from its event loop
  - to test this, use the following python script:

      import libvirt

      libvirt.virEventRegisterDefaultImpl()
      conn = libvirt.open(None)
      dom = conn.lookupByName("DOMAIN")

      while True:
          libvirt.virEventRunDefaultImpl()
          dom.suspend()

  - testing steps:
    1. start DOMAIN
    2. kill -STOP the qemu-kvm process associated with DOMAIN
    3. run the script above
  - with unfixed libvirt, the libvirtd would close the connection after
    keepalive timeout

- libvirt client would not detect dead connection when it calls a long-running
  API from its event loop
  - in the script above, add the following line after "conn = libvirt.open":

      conn.setKeepAlive(5, 3)

    this tells the script to consider the connection dead after 15 seconds of
    no response from the server

  - testing steps:
    1. start DOMAIN
    2. run the modified script
    3. kill -STOP libvirtd process
  - with unfixed libvirt, the script should not detect the connection died and
    should keep waiting for the result of dom.suspend()

Comment 24 errata-xmlrpc 2013-02-21 07:17:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html