1085006 – Internal Error from python-qpid can cause qpid connection to never recover

Bug 1085006 - Internal Error from python-qpid can cause qpid connection to never recover

Summary: Internal Error from python-qpid can cause qpid connection to never recover

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	async
Target Release:	4.0
Assignee:	Russell Bryant
QA Contact:	Toure Dunnon
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1098827 1108959 (view as bug list)
Depends On:
Blocks:	1040649 1086006
TreeView+	depends on / blocked

Reported:	2014-04-07 13:59 UTC by Russell Bryant
Modified:	2022-07-09 06:49 UTC (History)
CC List:	12 users (show)
Fixed In Version:	openstack-nova-2013.2.3-9.el6ost
Doc Type:	Bug Fix
Doc Text:	There was an internal error in the python-qpid library which Compute would fail to handle gracefully and Qpid communication would be broken. This has been fixed so that Compute gracefully handles the failure and restarts Qpid communication. Now, Compute services recover after an internal error in the python-qpid library.
Clone Of:
Clones:	1085994 1085995 1085996 1085997 1086006 (view as bug list)
Environment:
Last Closed:	2014-08-21 00:40:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1303890	None	None	None	Never
OpenStack gerrit	85750	None	None	None	Never
OpenStack gerrit	86368	None	None	None	Never
OpenStack gerrit	86370	None	None	None	Never
OpenStack gerrit	86371	None	None	None	Never
OpenStack gerrit	87032	None	None	None	Never
Red Hat Issue Tracker	OSP-16451	None	None	None	2022-07-09 06:49:39 UTC
Red Hat Product Errata	RHSA-2014:1084	normal	SHIPPED_LIVE	Moderate: openstack-nova security, bug fix, and enhancement update	2014-08-21 04:34:32 UTC

Description Russell Bryant 2014-04-07 13:59:55 UTC

While working with a partner on some problems in their system, I have observed two instances where the qpid client library gets into a bad state and the qpid connection thread in nova never recovers.  An example of the exception is:


 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 78, in inner_func
     return infunc(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 698, in _consumer_thread
     self.consume()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 689, in consume
     it.next()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 606, in iterconsume
     yield self.ensure(_error_callback, _consume)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
     return method(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume
     nxt_receiver = self.session.next_receiver(timeout=timeout)
   File "<string>", line 6, in next_receiver
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver
     if self._ecwait(lambda: self.incoming, timeout):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
     result = self._ewait(lambda: self.closed or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
     self.check_error()
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
     raise self.error
 InternalError: Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 667, in write
     self._op_dec.write(*self._seg_dec.read())
   File "/usr/lib/python2.6/site-packages/qpid/framing.py", line 269, in write
     if self.op.headers is None:
 AttributeError: 'NoneType' object has no attribute 'headers'

There is some code that automatically detects if the thread dies with an exception.  It will sleep for a second a retry.  The code will sit in this loop forever.  Every time it tries to run again it will hit this error immediately.  As a result, you see a message like this every minute or so:

2014-04-06 09:03:49.014 125211 ERROR root [-] Unexpected exception occurred 60 time(s)... retrying.

Part of the issue is that I don't think this should ever happen.  However, if it does, Nova should be more tolerant and reset the connection instead being stuck in this error for forever.

Comment 1 Russell Bryant 2014-04-09 20:21:12 UTC

This patch has been merged into both oslo.messaging and the rpc library in oslo-incubator.  In RHOS 4.0, nothing had been converted to oslo.messaging, so this fix needs to be backported to all of the projects that include rpc from oslo-incubator.  I will be cloning this bug to all affected projects.

Comment 3 Russell Bryant 2014-06-12 20:27:05 UTC

*** Bug 1098827 has been marked as a duplicate of this bug. ***

Comment 4 Ihar Hrachyshka 2014-06-13 08:23:19 UTC

*** Bug 1108959 has been marked as a duplicate of this bug. ***

Comment 6 Sean Toner 2014-08-11 13:45:20 UTC

How do we repro this condition?  Is there a way to determine which of the qpid child threads is the connection thread?  If so, I could try to kill the connection thread, and see if the problem persists.

Actually, as I am writing this, I see Russel posted this as an internal error in the python-qpid library.  Is there a version of python-qpid with a fix for this?

Comment 7 Russell Bryant 2014-08-11 13:50:44 UTC

(In reply to Sean Toner from comment #6)
> How do we repro this condition?  Is there a way to determine which of the
> qpid child threads is the connection thread?  If so, I could try to kill the
> connection thread, and see if the problem persists.

Reproducing this will be very difficult.  I honestly wouldn't bother.  Time is better spent elsewhere.
 
> Actually, as I am writing this, I see Russel posted this as an internal
> error in the python-qpid library.  Is there a version of python-qpid with a
> fix for this?

Yes, this was triggered when we hit this bug:

https://issues.apache.org/jira/browse/QPID-5700
https://bugzilla.redhat.com/show_bug.cgi?id=1088004

The fix was in python-qpid-0.18-10.el7

Comment 10 errata-xmlrpc 2014-08-21 00:40:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-1084.html

Note You need to log in before you can comment on or make changes to this bug.