Bug 1085997

Summary: Internal Error from python-qpid can cause qpid connection to never recover
Product: Red Hat OpenStack Reporter: Russell Bryant <rbryant>
Component: openstack-neutronAssignee: RHOS Maint <rhos-maint>
Status: CLOSED DUPLICATE QA Contact: Ofer Blaut <oblaut>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.0CC: chrisw, ndipanov, nyechiel, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1085006
: 1086004 1086011 (view as bug list) Environment:
Last Closed: 2014-04-22 10:40:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Russell Bryant 2014-04-09 20:35:51 UTC
+++ This bug was initially created as a clone of Bug #1085006 +++

While working with a partner on some problems in their system, I have observed two instances where the qpid client library gets into a bad state and the qpid connection thread in nova never recovers.  An example of the exception is:


 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 78, in inner_func
     return infunc(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 698, in _consumer_thread
     self.consume()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 689, in consume
     it.next()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 606, in iterconsume
     yield self.ensure(_error_callback, _consume)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
     return method(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume
     nxt_receiver = self.session.next_receiver(timeout=timeout)
   File "<string>", line 6, in next_receiver
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver
     if self._ecwait(lambda: self.incoming, timeout):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
     result = self._ewait(lambda: self.closed or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
     self.check_error()
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
     raise self.error
 InternalError: Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 667, in write
     self._op_dec.write(*self._seg_dec.read())
   File "/usr/lib/python2.6/site-packages/qpid/framing.py", line 269, in write
     if self.op.headers is None:
 AttributeError: 'NoneType' object has no attribute 'headers'

There is some code that automatically detects if the thread dies with an exception.  It will sleep for a second a retry.  The code will sit in this loop forever.  Every time it tries to run again it will hit this error immediately.  As a result, you see a message like this every minute or so:

2014-04-06 09:03:49.014 125211 ERROR root [-] Unexpected exception occurred 60 time(s)... retrying.

Part of the issue is that I don't think this should ever happen.  However, if it does, Nova should be more tolerant and reset the connection instead being stuck in this error for forever.

--- Additional comment from Russell Bryant on 2014-04-09 16:21:12 EDT ---

This patch has been merged into both oslo.messaging and the rpc library in oslo-incubator.  In RHOS 4.0, nothing had been converted to oslo.messaging, so this fix needs to be backported to all of the projects that include rpc from oslo-incubator.  I will be cloning this bug to all affected projects.