Bug 1085996 - Internal Error from python-qpid can cause qpid connection to never recover
Summary: Internal Error from python-qpid can cause qpid connection to never recover
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z5
: 4.0
Assignee: Jeff Peeler
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On: 1086004
Blocks: 1085995 1086001 1086009 1086010
TreeView+ depends on / blocked
 
Reported: 2014-04-09 20:34 UTC by Russell Bryant
Modified: 2022-07-09 07:08 UTC (History)
9 users (show)

Fixed In Version: openstack-heat-2013.2.3-2.el6ost
Doc Type: Bug Fix
Doc Text:
Prior to this update, certain Qpid exceptions were not properly handled by the Qpid driver. As a result, the Qpid connection would fail and stop processing subsequent messages. With this update, all possible exceptions are handled to ensure the Qpid driver does not enter an unrecoverable failure loop. Consequently, Orchestration (heat) will continue to process Qpid messages, even after major exceptions occur.
Clone Of: 1085006
: 1086001 1086010 (view as bug list)
Environment:
Last Closed: 2014-10-22 17:52:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1303890 0 None None None Never
OpenStack gerrit 85750 0 None None None Never
OpenStack gerrit 86368 0 None None None Never
OpenStack gerrit 86370 0 None None None Never
OpenStack gerrit 86371 0 None None None Never
Red Hat Issue Tracker OSP-16522 0 None None None 2022-07-09 07:08:21 UTC
Red Hat Product Errata RHSA-2014:1687 0 normal SHIPPED_LIVE Moderate: openstack-heat security, bug fix, and enhancement update 2014-10-22 21:10:51 UTC

Description Russell Bryant 2014-04-09 20:34:21 UTC
+++ This bug was initially created as a clone of Bug #1085006 +++

While working with a partner on some problems in their system, I have observed two instances where the qpid client library gets into a bad state and the qpid connection thread in nova never recovers.  An example of the exception is:


 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 78, in inner_func
     return infunc(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 698, in _consumer_thread
     self.consume()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 689, in consume
     it.next()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 606, in iterconsume
     yield self.ensure(_error_callback, _consume)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
     return method(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume
     nxt_receiver = self.session.next_receiver(timeout=timeout)
   File "<string>", line 6, in next_receiver
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver
     if self._ecwait(lambda: self.incoming, timeout):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
     result = self._ewait(lambda: self.closed or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
     self.check_error()
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
     raise self.error
 InternalError: Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 667, in write
     self._op_dec.write(*self._seg_dec.read())
   File "/usr/lib/python2.6/site-packages/qpid/framing.py", line 269, in write
     if self.op.headers is None:
 AttributeError: 'NoneType' object has no attribute 'headers'

There is some code that automatically detects if the thread dies with an exception.  It will sleep for a second a retry.  The code will sit in this loop forever.  Every time it tries to run again it will hit this error immediately.  As a result, you see a message like this every minute or so:

2014-04-06 09:03:49.014 125211 ERROR root [-] Unexpected exception occurred 60 time(s)... retrying.

Part of the issue is that I don't think this should ever happen.  However, if it does, Nova should be more tolerant and reset the connection instead being stuck in this error for forever.

--- Additional comment from Russell Bryant on 2014-04-09 16:21:12 EDT ---

This patch has been merged into both oslo.messaging and the rpc library in oslo-incubator.  In RHOS 4.0, nothing had been converted to oslo.messaging, so this fix needs to be backported to all of the projects that include rpc from oslo-incubator.  I will be cloning this bug to all affected projects.

Comment 5 Amit Ugol 2014-10-11 18:05:56 UTC
fix is in there
tested on 2013.2.4-1.el6ost

Comment 7 errata-xmlrpc 2014-10-22 17:52:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2014-1687.html


Note You need to log in before you can comment on or make changes to this bug.