Bug 1086001 - Internal Error from python-qpid can cause qpid connection to never recover
Summary: Internal Error from python-qpid can cause qpid connection to never recover
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 5.0 (RHEL 7)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 5.0 (RHEL 7)
Assignee: Jeff Peeler
QA Contact: Ami Jeain
URL:
Whiteboard:
Depends On: 1085996 1086004 1086010
Blocks: 1085995 1086009
TreeView+ depends on / blocked
 
Reported: 2014-04-09 20:46 UTC by Russell Bryant
Modified: 2019-09-09 13:52 UTC (History)
9 users (show)

Fixed In Version: openstack-heat-2014.1.1-2.2.el7ost
Doc Type: Bug Fix
Doc Text:
Clone Of: 1085996
Environment:
Last Closed: 2014-07-24 17:24:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1303890 0 None None None Never
OpenStack gerrit 85750 0 None None None Never
OpenStack gerrit 86368 0 None None None Never
OpenStack gerrit 86370 0 None None None Never
OpenStack gerrit 86371 0 None None None Never
Red Hat Product Errata RHBA-2014:0935 0 normal SHIPPED_LIVE openstack-heat bug-fix advisory 2014-07-24 21:22:09 UTC

Description Russell Bryant 2014-04-09 20:46:40 UTC
+++ This bug was initially created as a clone of Bug #1085996 +++

+++ This bug was initially created as a clone of Bug #1085006 +++

While working with a partner on some problems in their system, I have observed two instances where the qpid client library gets into a bad state and the qpid connection thread in nova never recovers.  An example of the exception is:


 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 78, in inner_func
     return infunc(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 698, in _consumer_thread
     self.consume()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 689, in consume
     it.next()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 606, in iterconsume
     yield self.ensure(_error_callback, _consume)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
     return method(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume
     nxt_receiver = self.session.next_receiver(timeout=timeout)
   File "<string>", line 6, in next_receiver
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver
     if self._ecwait(lambda: self.incoming, timeout):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
     result = self._ewait(lambda: self.closed or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
     self.check_error()
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
     raise self.error
 InternalError: Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 667, in write
     self._op_dec.write(*self._seg_dec.read())
   File "/usr/lib/python2.6/site-packages/qpid/framing.py", line 269, in write
     if self.op.headers is None:
 AttributeError: 'NoneType' object has no attribute 'headers'

There is some code that automatically detects if the thread dies with an exception.  It will sleep for a second a retry.  The code will sit in this loop forever.  Every time it tries to run again it will hit this error immediately.  As a result, you see a message like this every minute or so:

2014-04-06 09:03:49.014 125211 ERROR root [-] Unexpected exception occurred 60 time(s)... retrying.

Part of the issue is that I don't think this should ever happen.  However, if it does, Nova should be more tolerant and reset the connection instead being stuck in this error for forever.

--- Additional comment from Russell Bryant on 2014-04-09 16:21:12 EDT ---

This patch has been merged into both oslo.messaging and the rpc library in oslo-incubator.  In RHOS 4.0, nothing had been converted to oslo.messaging, so this fix needs to be backported to all of the projects that include rpc from oslo-incubator.  I will be cloning this bug to all affected projects.

Comment 1 Russell Bryant 2014-04-09 20:47:11 UTC
Cloned to 5.0 as Heat uses rpc from oslo-incubtor in RHOS 5.0.

Comment 2 Jeff Peeler 2014-07-17 17:59:10 UTC
Also fixed here: openstack-heat-2014.1.1-2.2.el6ost

Comment 5 Ami Jeain 2014-07-21 09:25:15 UTC
verified under Heat regression. This test should be done in heavy load and not in our scope

Comment 7 errata-xmlrpc 2014-07-24 17:24:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0935.html


Note You need to log in before you can comment on or make changes to this bug.