857256 – nova-compute regularly disconnects from qpidd

Bug 857256 - nova-compute regularly disconnects from qpidd

Summary: nova-compute regularly disconnects from qpidd

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	1.0 (Essex)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Alan Pevec
QA Contact:	Nir Magnezi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-09-13 23:19 UTC by Russell Bryant
Modified:	2013-10-31 17:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:	openstack-nova-2012.1.3-1.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-10-31 17:03:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Russell Bryant 2012-09-13 23:19:02 UTC

This is a problem observed in a lab deployment of nova.


qpidd was regularly disconnecting the nova-compute service due to a
timeout (as seen in the log).  A review of a packet capture with
wireshark showed that the cause was nova-compute failing to send a
heartbeat message on time.  By default, we set the heartbeat interval to
5 seconds.

The timeouts were happening at a regular interval.  When compared with
the nova-compute log, it was happening when nova-compute ran its
periodic tasks (which is done 60 seconds after the last run of periodic
tasks completes by default).

Digging further, the update_available_resource() periodic task was
taking roughly 15 seconds.  Here's the relevant code:


https://github.com/openstack/nova/blob/stable/essex/nova/compute/manager.py#L2425


https://github.com/openstack/nova/blob/stable/essex/nova/virt/libvirt/connection.py#L1945

In the libvirt driver we have this code gathering a bunch of stats:

        # Updating host information
        dic = {'vcpus': self.get_vcpu_total(),
               'memory_mb': self.get_memory_mb_total(),
               'local_gb': self.get_local_gb_total(),
               'vcpus_used': self.get_vcpu_used(),
               'memory_mb_used': self.get_memory_mb_used(),
               'local_gb_used': self.get_local_gb_used(),
               'hypervisor_type': self.get_hypervisor_type(),
               'hypervisor_version': self.get_hypervisor_version(),
               'cpu_info': self.get_cpu_info(),
               'service_id': service_ref['id'],
               'disk_available_least': self.get_disk_available_least()}

self.get_vcpu_total() was taking about 5 seconds and
self.get_disk_available_least() was taking about 10 seconds.  (You can
take a look at the driver to see what libvirt calls are being used in
each of these.)

When we block in native code (such as libvirt), other greenthreads can
not be scheduled.  This is because eventlet relies on hooking into
various parts of the Python stdlib:

    http://eventlet.net/doc/patching.html

It appears that there was enough blocking done in this periodic task
that the greenthread handling the qpid heartbeat didn't get to run in
time and caused qpidd to drop the connection.

There is a bit of support for making libvirt calls non-blocking, but it
appears to only apply when connecting to libvirt, not when making other
calls.


The fix should be to include a couple of changes:

1) Include the default for qpid_heartbeat to a more reasonable value:

    https://review.openstack.org/#/c/12981/

This one needs to go into openstack-common, then nova master, and then backported to stable/essex.

2) Yield to other threads during time consuming periodic tasks:

    https://review.openstack.org/#/c/12985

This one is proposed for nova stable/essex.

Comment 2 Kashyap Chamarthy 2013-03-13 07:39:11 UTC

VERIFIED on Folsom.


Version:
python-nova-2012.2.3-4.el6ost.noarch


1/ Ensure the default value for qpid_heartbeat is 60
=============
# grep -i -A2 qpid_heartbeat /usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py
    cfg.IntOpt('qpid_heartbeat',
               default=60,
               help='Seconds between connection keepalive heartbeats'),
--
        self.connection.heartbeat = self.conf.qpid_heartbeat
        self.connection.protocol = self.conf.qpid_protocol
        self.connection.tcp_nodelay = self.conf.qpid_tcp_nodelay
=============


2/ Yield to other threads during time consuming periodic tasks change:

=============
# grep eventlet.sleep /usr/lib/python2.6/site-packages/nova/manager.py
                eventlet.sleep(0)
=============
# grep greenthread /usr/lib/python2.6/site-packages/nova/virt/libvirt/*.py
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:from eventlet import greenthread
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:            greenthread.sleep(1)
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:            greenthread.sleep(0)
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:            time_module = greenthread
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:        greenthread.spawn(self._live_migration, ctxt, instance_ref, dest,
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:                    greenthread.sleep(1)
/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:            greenthread.sleep(0)
/usr/lib/python2.6/site-packages/nova/virt/libvirt/firewall.py:        # execute in a native thread and block current greenthread until done
=============

Sanity: All nova services running just fine, with the above fix, on Folsom:

=============
# for j in `for i in $(ls -1 /etc/init.d/openstack-nova-*) ; do $i status | grep running ; done | awk '{print $1}'` ; do service $j status ; done
openstack-nova-cert (pid  29558) is running...
openstack-nova-compute (pid  29580) is running...
openstack-nova-console (pid  29621) is running...
openstack-nova-consoleauth (pid  29656) is running...
openstack-nova-network (pid  29698) is running...
openstack-nova-novncproxy (pid  29714) is running...
=============

Note You need to log in before you can comment on or make changes to this bug.