1086004 – Internal Error from python-qpid can cause qpid connection to never recover

Bug 1086004 - Internal Error from python-qpid can cause qpid connection to never recover

Summary: Internal Error from python-qpid can cause qpid connection to never recover

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	5.0 (RHEL 7)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	5.0 (RHEL 7)
Assignee:	Ihar Hrachyshka
QA Contact:	Zdenek Kraus
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1085995 1085996 1086001 1086009 1086010
TreeView+	depends on / blocked

Reported:	2014-04-09 20:49 UTC by Russell Bryant
Modified:	2019-09-09 13:43 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openstack-neutron-2014.1-23.el7ost
Doc Type:	Bug Fix
Doc Text:	Prior to this update, certain Qpid exceptions were not properly handled by the Qpid driver. As a result, the Qpid connection would fail and stop processing subsequent messages. With this update, all possible exceptions are handled to ensure the Qpid driver does not enter an unrecoverable failure loop. Consequently, Networking will continue to process Qpid messages, even after major exceptions occur.
Clone Of:	1085997
Environment:
Last Closed:	2014-07-08 15:36:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1303890	None	None	None	Never
OpenStack gerrit	85750	None	None	None	Never
OpenStack gerrit	86368	None	None	None	Never
OpenStack gerrit	86370	None	None	None	Never
OpenStack gerrit	86371	None	None	None	Never
Red Hat Bugzilla	1086077	high	CLOSED	RPC error in neutron-server leads to cascading failure	2022-07-09 06:16:41 UTC
Red Hat Bugzilla	1088004	high	CLOSED	heartbeat interleaved with message frames causes decode error	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2014:0848	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform Enhancement - Networking	2014-07-08 19:23:05 UTC

Internal Links: 1088004

Description Russell Bryant 2014-04-09 20:49:49 UTC

+++ This bug was initially created as a clone of Bug #1085997 +++

+++ This bug was initially created as a clone of Bug #1085006 +++

While working with a partner on some problems in their system, I have observed two instances where the qpid client library gets into a bad state and the qpid connection thread in nova never recovers.  An example of the exception is:


 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 78, in inner_func
     return infunc(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 698, in _consumer_thread
     self.consume()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 689, in consume
     it.next()
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 606, in iterconsume
     yield self.ensure(_error_callback, _consume)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
     return method(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume
     nxt_receiver = self.session.next_receiver(timeout=timeout)
   File "<string>", line 6, in next_receiver
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver
     if self._ecwait(lambda: self.incoming, timeout):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
     result = self._ewait(lambda: self.closed or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
     self.check_error()
   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
     raise self.error
 InternalError: Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 667, in write
     self._op_dec.write(*self._seg_dec.read())
   File "/usr/lib/python2.6/site-packages/qpid/framing.py", line 269, in write
     if self.op.headers is None:
 AttributeError: 'NoneType' object has no attribute 'headers'

There is some code that automatically detects if the thread dies with an exception.  It will sleep for a second a retry.  The code will sit in this loop forever.  Every time it tries to run again it will hit this error immediately.  As a result, you see a message like this every minute or so:

2014-04-06 09:03:49.014 125211 ERROR root [-] Unexpected exception occurred 60 time(s)... retrying.

Part of the issue is that I don't think this should ever happen.  However, if it does, Nova should be more tolerant and reset the connection instead being stuck in this error for forever.

--- Additional comment from Russell Bryant on 2014-04-09 16:21:12 EDT ---

This patch has been merged into both oslo.messaging and the rpc library in oslo-incubator.  In RHOS 4.0, nothing had been converted to oslo.messaging, so this fix needs to be backported to all of the projects that include rpc from oslo-incubator.  I will be cloning this bug to all affected projects.

Comment 1 Russell Bryant 2014-04-09 20:50:26 UTC

Cloned to 5.0 as Neutron uses rpc from oslo-incubator in RHOS 5.0.

Comment 2 Ihar Hrachyshka 2014-05-29 11:59:53 UTC

We have the patch merged into d/s package as of initial Icehouse builds. Putting the latest build into Fixed in Version.

Comment 5 Ihar Hrachyshka 2014-06-02 08:24:59 UTC

Ofer,
yes, as far as I know, we're going to support Qpid. Though RabbitMQ will be the default and recommended option.

As for verification steps, the failure showed up under high load. We don't know how we got into the situation, so we can just make sure regression tests pass for neutron. We did the same at: https://bugzilla.redhat.com/show_bug.cgi?id=1085995#c3 when doing verification.

Comment 6 Ihar Hrachyshka 2014-06-02 08:26:02 UTC

> Scale will not work, so why to support it ?
Sorry, I didn't get this part. What's not to be supported, specifically?

Comment 8 Ihar Hrachyshka 2014-06-03 09:14:24 UTC

As far as I know, there is PoC in the lab running that runs multiple neutron instances on single machine and load balancing them thru local haproxy. This could solve scale issues.

Comment 16 Frantisek Reznicek 2014-06-26 16:00:00 UTC

It looks like the issue is very difficult to reproduce and will require significant effort.

Comment 17 Frantisek Reznicek 2014-06-26 16:04:14 UTC

There is still possibility to trigger this using focused reproducer apart of RHOS which seems to be smaller effort and in my view worth to try.


QA automation testing proposal (apart of RHOS):
 * VM single core
 * multiple clients simultaneously
 * all with low heartbeats
 * all looping over longer period
 * using multiple messaging patterns (receivers on queues as well as on exchanges/topics)

Comment 22 Frantisek Reznicek 2014-06-27 18:24:55 UTC

It looks this defect is most probably dup of bug 1088004 (which was QAed already).

Key is to verify that bug 1088004 backtrace is identical and analyze the testing scenarios. Also getting Ken's or Gordon's opinion about being it dup would help.

Comment 23 Zdenek Kraus 2014-06-30 06:39:07 UTC

I've checked the backtrace and the underlying code and I confirm that it is the same scenario. But I have to disagree with the duplicate. Bug 1088004 is the underlying bug filed for qpidd, on the other hand this bug is filed for openstack-neutron (and clones for another components) and all of those introduced own patched, thus we have to check if the openstack patches are valid since the underlying issue was fixed.

Comment 24 Frantisek Reznicek 2014-06-30 07:16:40 UTC

(In reply to Zdenek Kraus from comment #23)
> I've checked the backtrace and the underlying code and I confirm that it is
> the same scenario. But I have to disagree with the duplicate. Bug 1088004 is
> the underlying bug filed for qpidd, on the other hand this bug is filed for
> openstack-neutron (and clones for another components) and all of those
> introduced own patched, thus we have to check if the openstack patches are
> valid since the underlying issue was fixed.

Thanks, sounds like the plan.
I believe we can skip comment 17 proposal as QA work was done on qpid side already (and tracked as bug 1088004).

Comment 25 Zdenek Kraus 2014-06-30 13:52:30 UTC

Patches were reviewed by me and gsim, and it's correct broadening exception catching. Since the underlying problem was fixed and verified (see Bug 1088004)

-> VERIFIED

Comment 27 errata-xmlrpc 2014-07-08 15:36:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0848.html

Note You need to log in before you can comment on or make changes to this bug.