Bug 857256

Summary: nova-compute regularly disconnects from qpidd
Product: Red Hat OpenStack Reporter: Russell Bryant <rbryant>
Component: openstack-novaAssignee: Alan Pevec <apevec>
Status: CLOSED CURRENTRELEASE QA Contact: Nir Magnezi <nmagnezi>
Severity: high Docs Contact:
Priority: unspecified    
Version: 1.0 (Essex)CC: jkt, kchamart, pbrady, sclewis
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-nova-2012.1.3-1.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-31 17:03:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Russell Bryant 2012-09-13 23:19:02 UTC
This is a problem observed in a lab deployment of nova.


qpidd was regularly disconnecting the nova-compute service due to a
timeout (as seen in the log).  A review of a packet capture with
wireshark showed that the cause was nova-compute failing to send a
heartbeat message on time.  By default, we set the heartbeat interval to
5 seconds.

The timeouts were happening at a regular interval.  When compared with
the nova-compute log, it was happening when nova-compute ran its
periodic tasks (which is done 60 seconds after the last run of periodic
tasks completes by default).

Digging further, the update_available_resource() periodic task was
taking roughly 15 seconds.  Here's the relevant code:


https://github.com/openstack/nova/blob/stable/essex/nova/compute/manager.py#L2425


https://github.com/openstack/nova/blob/stable/essex/nova/virt/libvirt/connection.py#L1945

In the libvirt driver we have this code gathering a bunch of stats:

        # Updating host information
        dic = {'vcpus': self.get_vcpu_total(),
               'memory_mb': self.get_memory_mb_total(),
               'local_gb': self.get_local_gb_total(),
               'vcpus_used': self.get_vcpu_used(),
               'memory_mb_used': self.get_memory_mb_used(),
               'local_gb_used': self.get_local_gb_used(),
               'hypervisor_type': self.get_hypervisor_type(),
               'hypervisor_version': self.get_hypervisor_version(),
               'cpu_info': self.get_cpu_info(),
               'service_id': service_ref['id'],
               'disk_available_least': self.get_disk_available_least()}

self.get_vcpu_total() was taking about 5 seconds and
self.get_disk_available_least() was taking about 10 seconds.  (You can
take a look at the driver to see what libvirt calls are being used in
each of these.)

When we block in native code (such as libvirt), other greenthreads can
not be scheduled.  This is because eventlet relies on hooking into
various parts of the Python stdlib:

    http://eventlet.net/doc/patching.html

It appears that there was enough blocking done in this periodic task
that the greenthread handling the qpid heartbeat didn't get to run in
time and caused qpidd to drop the connection.

There is a bit of support for making libvirt calls non-blocking, but it
appears to only apply when connecting to libvirt, not when making other
calls.


The fix should be to include a couple of changes:

1) Include the default for qpid_heartbeat to a more reasonable value:

    https://review.openstack.org/#/c/12981/

This one needs to go into openstack-common, then nova master, and then backported to stable/essex.

2) Yield to other threads during time consuming periodic tasks:

    https://review.openstack.org/#/c/12985

This one is proposed for nova stable/essex.

Comment 2 Kashyap Chamarthy 2013-03-13 07:39:11 UTC
VERIFIED on Folsom.


Version:
python-nova-2012.2.3-4.el6ost.noarch


1/ Ensure the default value for qpid_heartbeat is 60
=============
# grep -i -A2 qpid_heartbeat /usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py
    cfg.IntOpt('qpid_heartbeat',
               default=60,
               help='Seconds between connection keepalive heartbeats'),
--
        self.connection.heartbeat = self.conf.qpid_heartbeat
        self.connection.protocol = self.conf.qpid_protocol
        self.connection.tcp_nodelay = self.conf.qpid_tcp_nodelay
=============


2/ Yield to other threads during time consuming periodic tasks change:

=============
# grep eventlet.sleep /usr/lib/python2.6/site-packages/nova/manager.py
                eventlet.sleep(0)
=============
# grep greenthread /usr/lib/python2.6/site-packages/nova/virt/libvirt/*.py
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:from eventlet import greenthread
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:            greenthread.sleep(1)
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:            greenthread.sleep(0)
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:            time_module = greenthread
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:        greenthread.spawn(self._live_migration, ctxt, instance_ref, dest,
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:                    greenthread.sleep(1)
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:            greenthread.sleep(0)
/usr/lib/python2.6/site-packages/nova/virt/libvirt/firewall.py:        # execute in a native thread and block current greenthread until done
=============

Sanity: All nova services running just fine, with the above fix, on Folsom:

=============
# for j in `for i in $(ls -1 /etc/init.d/openstack-nova-*) ; do $i status | grep running ; done | awk '{print $1}'` ; do service $j status ; done
openstack-nova-cert (pid  29558) is running...
openstack-nova-compute (pid  29580) is running...
openstack-nova-console (pid  29621) is running...
openstack-nova-consoleauth (pid  29656) is running...
openstack-nova-network (pid  29698) is running...
openstack-nova-novncproxy (pid  29714) is running...
=============