This is a problem observed in a lab deployment of nova. qpidd was regularly disconnecting the nova-compute service due to a timeout (as seen in the log). A review of a packet capture with wireshark showed that the cause was nova-compute failing to send a heartbeat message on time. By default, we set the heartbeat interval to 5 seconds. The timeouts were happening at a regular interval. When compared with the nova-compute log, it was happening when nova-compute ran its periodic tasks (which is done 60 seconds after the last run of periodic tasks completes by default). Digging further, the update_available_resource() periodic task was taking roughly 15 seconds. Here's the relevant code: https://github.com/openstack/nova/blob/stable/essex/nova/compute/manager.py#L2425 https://github.com/openstack/nova/blob/stable/essex/nova/virt/libvirt/connection.py#L1945 In the libvirt driver we have this code gathering a bunch of stats: # Updating host information dic = {'vcpus': self.get_vcpu_total(), 'memory_mb': self.get_memory_mb_total(), 'local_gb': self.get_local_gb_total(), 'vcpus_used': self.get_vcpu_used(), 'memory_mb_used': self.get_memory_mb_used(), 'local_gb_used': self.get_local_gb_used(), 'hypervisor_type': self.get_hypervisor_type(), 'hypervisor_version': self.get_hypervisor_version(), 'cpu_info': self.get_cpu_info(), 'service_id': service_ref['id'], 'disk_available_least': self.get_disk_available_least()} self.get_vcpu_total() was taking about 5 seconds and self.get_disk_available_least() was taking about 10 seconds. (You can take a look at the driver to see what libvirt calls are being used in each of these.) When we block in native code (such as libvirt), other greenthreads can not be scheduled. This is because eventlet relies on hooking into various parts of the Python stdlib: http://eventlet.net/doc/patching.html It appears that there was enough blocking done in this periodic task that the greenthread handling the qpid heartbeat didn't get to run in time and caused qpidd to drop the connection. There is a bit of support for making libvirt calls non-blocking, but it appears to only apply when connecting to libvirt, not when making other calls. The fix should be to include a couple of changes: 1) Include the default for qpid_heartbeat to a more reasonable value: https://review.openstack.org/#/c/12981/ This one needs to go into openstack-common, then nova master, and then backported to stable/essex. 2) Yield to other threads during time consuming periodic tasks: https://review.openstack.org/#/c/12985 This one is proposed for nova stable/essex.
VERIFIED on Folsom. Version: python-nova-2012.2.3-4.el6ost.noarch 1/ Ensure the default value for qpid_heartbeat is 60 ============= # grep -i -A2 qpid_heartbeat /usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py cfg.IntOpt('qpid_heartbeat', default=60, help='Seconds between connection keepalive heartbeats'), -- self.connection.heartbeat = self.conf.qpid_heartbeat self.connection.protocol = self.conf.qpid_protocol self.connection.tcp_nodelay = self.conf.qpid_tcp_nodelay ============= 2/ Yield to other threads during time consuming periodic tasks change: ============= # grep eventlet.sleep /usr/lib/python2.6/site-packages/nova/manager.py eventlet.sleep(0) ============= # grep greenthread /usr/lib/python2.6/site-packages/nova/virt/libvirt/*.py /usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:from eventlet import greenthread /usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py: greenthread.sleep(1) /usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py: greenthread.sleep(0) /usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py: time_module = greenthread /usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py: greenthread.spawn(self._live_migration, ctxt, instance_ref, dest, /usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py: greenthread.sleep(1) /usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py: greenthread.sleep(0) /usr/lib/python2.6/site-packages/nova/virt/libvirt/firewall.py: # execute in a native thread and block current greenthread until done ============= Sanity: All nova services running just fine, with the above fix, on Folsom: ============= # for j in `for i in $(ls -1 /etc/init.d/openstack-nova-*) ; do $i status | grep running ; done | awk '{print $1}'` ; do service $j status ; done openstack-nova-cert (pid 29558) is running... openstack-nova-compute (pid 29580) is running... openstack-nova-console (pid 29621) is running... openstack-nova-consoleauth (pid 29656) is running... openstack-nova-network (pid 29698) is running... openstack-nova-novncproxy (pid 29714) is running... =============