Bug 1085994 - Internal Error from python-qpid can cause qpid connection to never recover
Summary: Internal Error from python-qpid can cause qpid connection to never recover
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ceilometer
Version: 4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z6
: 4.0
Assignee: Eoghan Glynn
QA Contact: Yurii Prokulevych
URL:
Whiteboard:
Depends On: 1085998
Blocks: 1086008
TreeView+ depends on / blocked
 
Reported: 2014-04-09 20:31 UTC by Russell Bryant
Modified: 2019-09-09 13:29 UTC (History)
8 users (show)

Fixed In Version: openstack-ceilometer-2013.2.4-1.el6ost
Doc Type: Bug Fix
Doc Text:
Clone Of: 1085006
: 1085998 1086008 (view as bug list)
Environment:
Last Closed: 2015-04-23 13:03:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1303890 0 None None None Never
OpenStack gerrit 86371 0 None None None Never
OpenStack gerrit 109993 0 None None None Never
Red Hat Product Errata RHBA-2015:0885 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform Bug Fix and Enhancement Advisory 2015-04-23 17:03:29 UTC

Description Russell Bryant 2014-04-09 20:31:31 UTC
+++ This bug was initially created as a clone of Bug #1085006 +++

While working with a partner on some problems in their system, I have observed two instances where the qpid client library gets into a bad state and the qpid connection thread in nova never recovers.  An example of the exception is:


 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 78, in inner_func
     return infunc(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 698, in _consumer_thread
     self.consume()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 689, in consume
     it.next()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 606, in iterconsume
     yield self.ensure(_error_callback, _consume)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
     return method(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume
     nxt_receiver = self.session.next_receiver(timeout=timeout)
   File "<string>", line 6, in next_receiver
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver
     if self._ecwait(lambda: self.incoming, timeout):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
     result = self._ewait(lambda: self.closed or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
     self.check_error()
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
     raise self.error
 InternalError: Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 667, in write
     self._op_dec.write(*self._seg_dec.read())
   File "/usr/lib/python2.6/site-packages/qpid/framing.py", line 269, in write
     if self.op.headers is None:
 AttributeError: 'NoneType' object has no attribute 'headers'

There is some code that automatically detects if the thread dies with an exception.  It will sleep for a second a retry.  The code will sit in this loop forever.  Every time it tries to run again it will hit this error immediately.  As a result, you see a message like this every minute or so:

2014-04-06 09:03:49.014 125211 ERROR root [-] Unexpected exception occurred 60 time(s)... retrying.

Part of the issue is that I don't think this should ever happen.  However, if it does, Nova should be more tolerant and reset the connection instead being stuck in this error for forever.

--- Additional comment from Russell Bryant on 2014-04-09 16:21:12 EDT ---

This patch has been merged into both oslo.messaging and the rpc library in oslo-incubator.  In RHOS 4.0, nothing had been converted to oslo.messaging, so this fix needs to be backported to all of the projects that include rpc from oslo-incubator.  I will be cloning this bug to all affected projects.

Comment 2 Eoghan Glynn 2014-11-24 17:37:44 UTC
This patch was already brought in by the rebase onto ceilometer-2013.2.4:

  http://pkgs.devel.redhat.com/cgit/rpms/openstack-ceilometer/commit/?h=rhos-4.0-rhel-6&id=ed933998

Comment 3 Yurii Prokulevych 2015-04-20 08:04:57 UTC
Verified running ceilometer sanity tests.

Comment 6 errata-xmlrpc 2015-04-23 13:03:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0885.html

Comment 7 Andrew Dahms 2017-08-28 23:02:04 UTC
Cancelling old needinfo request.


Note You need to log in before you can comment on or make changes to this bug.