Bug 1038638

Summary: [oslo] On restart of QPID or reconnect to QPID, fanout RPC no longer works
Product: Red Hat OpenStack Reporter: Ken Giusti <kgiusti>
Component: openstack-novaAssignee: Vladan Popovic <vpopovic>
Status: CLOSED ERRATA QA Contact: Jakub Ruzicka <jruzicka>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: breeler, dallan, ddomingo, fpercoco, hateya, jhenner, ndipanov, sclewis, sdake, vpopovic, xqueralt, yeylon
Target Milestone: rcKeywords: OtherQA, TestOnly
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-nova-2013.2-9.el6ost Doc Type: Bug Fix
Doc Text:
When the QPID broker is restarted, RPC servers attempt to re-connect. However, a bug in the QPID topic consumer re-connection logic (under the v2 topology) caused qpidd to use a malformed subscriber address after restarting. This caused qpidd to incorrectly establish multiple subscriptions on the same fanout address after reconnecting. The QPID broker only requires one subscription; each extra subscription created duplicate RPC notification samples for individual services (e.g. Nova). This release removes the special-case reconnect logic that handles UUID addresses, which in turn avoids the incorrect establishment of multiple subscription to the same fanout address. The QPID broker will simply generate unique queue names automatically when clients reconnect.
Story Points: ---
Clone Of:
: 1038709 1038710 1038711 1038712 1045065 (view as bug list) Environment:
Last Closed: 2013-12-20 00:41:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1045065    

Description Ken Giusti 2013-12-05 14:34:24 UTC
Description of problem:

   See upstream bug report:

   https://bugs.launchpad.net/oslo/+bug/1251757


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Perry Myers 2013-12-05 16:29:29 UTC
This seems important enough to be an RC blocker for 4.0.  But doesn't this bug need to be cloned to Nova, Ceilometer, Neutron, etc (every component that copies and pastes the code from Oslo for RPC?)

Comment 4 Russell Bryant 2013-12-05 17:01:06 UTC
(In reply to Perry Myers from comment #3)
> This seems important enough to be an RC blocker for 4.0.  But doesn't this
> bug need to be cloned to Nova, Ceilometer, Neutron, etc (every component
> that copies and pastes the code from Oslo for RPC?)

It's an important fix, for sure.  Note that it doesn't affect the default configuration.  This only affects if the v2 topology is enabled.  So, it doesn't necessarily have to be a blocker, but we'd need a release note saying not to use the v2 topology at all yet.  We should get this in if we can and avoid that.

Comment 5 Steven Dake 2013-12-05 22:25:15 UTC
Rather then possibly introducing regressions at this time, a release note seems to make more sense to me.  Even though the patch is small, folks have been using the various OpenStack services with QPID for 6 months during development, and changes to the base qpid_impl.py, even though minimal, could possibly introduce subtle changes to what the dev team has been working on during the Havana cycle.

What is the specific advantage of the v2 topology?  Is it something anyone even configures and tests?

Comment 6 Russell Bryant 2013-12-06 01:10:11 UTC
(In reply to Steven Dake from comment #5)
> Rather then possibly introducing regressions at this time, a release note
> seems to make more sense to me.  Even though the patch is small, folks have
> been using the various OpenStack services with QPID for 6 months during
> development, and changes to the base qpid_impl.py, even though minimal,
> could possibly introduce subtle changes to what the dev team has been
> working on during the Havana cycle.
> 
> What is the specific advantage of the v2 topology?

Qpid does not support automatically deleting exchanges.  The original code, that works just like the kombu (rabbitmq) driver, assumes that it does.  This led to leaking exchanges that were short lived.  We've already changed things so that we don't use short lived exchanges anymore.  We still leak an exchange on shutdown of a service, but the impact of that is quite small compared to what it used to be where we leaked an exchange on every use of rpc.call().

The v2 topology completely gets rid of the use of custom exchanges and uses some default ones, instead.  The end result is that the leak is completely resolved.

> Is it something anyone even configures and tests?

Based on the bugs coming up, I think it's obvious that nobody has actually tested this beyond anything but a simple, short lived, one node setup.  It still seems like a good idea to fix up the v2 stuff, because we really do want to make it *the* way to use qpid at some point.  I'm not terribly confident of making it the default this close to release, because of the lack of testing it has seen.

Comment 7 Steven Dake 2013-12-06 02:29:26 UTC
One option that comes to mind to limit risk is to cherry-pick the oslo change into the various upstream projects and release them in our first maintenance release of Havana, or even earlier, as a zstream, rather than in GA.  This would provide the engineering team time to use the cherry-picked patches in a devstack environment without destabilizing GA.

Regards
-steve

Comment 8 Flavio Percoco 2013-12-06 10:34:11 UTC
(In reply to Steven Dake from comment #7)
> One option that comes to mind to limit risk is to cherry-pick the oslo
> change into the various upstream projects and release them in our first
> maintenance release of Havana, or even earlier, as a zstream, rather than in
> GA.  This would provide the engineering team time to use the cherry-picked
> patches in a devstack environment without destabilizing GA.

Patches are already being proposed on projects upstream. Regardless we use v2 as the default topology for GA, I think patches should be backported right away.

@Nikola, I submitted a patch upstream for Nova

Comment 14 errata-xmlrpc 2013-12-20 00:41:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1859.html