+++ This bug was initially created as a clone of Bug #1085997 +++ +++ This bug was initially created as a clone of Bug #1085006 +++ While working with a partner on some problems in their system, I have observed two instances where the qpid client library gets into a bad state and the qpid connection thread in nova never recovers. An example of the exception is: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 78, in inner_func return infunc(*args, **kwargs) File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 698, in _consumer_thread self.consume() File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 689, in consume it.next() File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 606, in iterconsume yield self.ensure(_error_callback, _consume) File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure return method(*args, **kwargs) File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume nxt_receiver = self.session.next_receiver(timeout=timeout) File "<string>", line 6, in next_receiver File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver if self._ecwait(lambda: self.incoming, timeout): File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait result = self._ewait(lambda: self.closed or predicate(), timeout) File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait result = self.connection._ewait(lambda: self.error or predicate(), timeout) File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait self.check_error() File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error raise self.error InternalError: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 667, in write self._op_dec.write(*self._seg_dec.read()) File "/usr/lib/python2.6/site-packages/qpid/framing.py", line 269, in write if self.op.headers is None: AttributeError: 'NoneType' object has no attribute 'headers' There is some code that automatically detects if the thread dies with an exception. It will sleep for a second a retry. The code will sit in this loop forever. Every time it tries to run again it will hit this error immediately. As a result, you see a message like this every minute or so: 2014-04-06 09:03:49.014 125211 ERROR root [-] Unexpected exception occurred 60 time(s)... retrying. Part of the issue is that I don't think this should ever happen. However, if it does, Nova should be more tolerant and reset the connection instead being stuck in this error for forever. --- Additional comment from Russell Bryant on 2014-04-09 16:21:12 EDT --- This patch has been merged into both oslo.messaging and the rpc library in oslo-incubator. In RHOS 4.0, nothing had been converted to oslo.messaging, so this fix needs to be backported to all of the projects that include rpc from oslo-incubator. I will be cloning this bug to all affected projects.
Cloned to 5.0 as Neutron uses rpc from oslo-incubator in RHOS 5.0.
We have the patch merged into d/s package as of initial Icehouse builds. Putting the latest build into Fixed in Version.
Ofer, yes, as far as I know, we're going to support Qpid. Though RabbitMQ will be the default and recommended option. As for verification steps, the failure showed up under high load. We don't know how we got into the situation, so we can just make sure regression tests pass for neutron. We did the same at: https://bugzilla.redhat.com/show_bug.cgi?id=1085995#c3 when doing verification.
> Scale will not work, so why to support it ? Sorry, I didn't get this part. What's not to be supported, specifically?
As far as I know, there is PoC in the lab running that runs multiple neutron instances on single machine and load balancing them thru local haproxy. This could solve scale issues.
It looks like the issue is very difficult to reproduce and will require significant effort.
There is still possibility to trigger this using focused reproducer apart of RHOS which seems to be smaller effort and in my view worth to try. QA automation testing proposal (apart of RHOS): * VM single core * multiple clients simultaneously * all with low heartbeats * all looping over longer period * using multiple messaging patterns (receivers on queues as well as on exchanges/topics)
It looks this defect is most probably dup of bug 1088004 (which was QAed already). Key is to verify that bug 1088004 backtrace is identical and analyze the testing scenarios. Also getting Ken's or Gordon's opinion about being it dup would help.
I've checked the backtrace and the underlying code and I confirm that it is the same scenario. But I have to disagree with the duplicate. Bug 1088004 is the underlying bug filed for qpidd, on the other hand this bug is filed for openstack-neutron (and clones for another components) and all of those introduced own patched, thus we have to check if the openstack patches are valid since the underlying issue was fixed.
(In reply to Zdenek Kraus from comment #23) > I've checked the backtrace and the underlying code and I confirm that it is > the same scenario. But I have to disagree with the duplicate. Bug 1088004 is > the underlying bug filed for qpidd, on the other hand this bug is filed for > openstack-neutron (and clones for another components) and all of those > introduced own patched, thus we have to check if the openstack patches are > valid since the underlying issue was fixed. Thanks, sounds like the plan. I believe we can skip comment 17 proposal as QA work was done on qpid side already (and tracked as bug 1088004).
Patches were reviewed by me and gsim, and it's correct broadening exception catching. Since the underlying problem was fixed and verified (see Bug 1088004) -> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0848.html