Bug 1311593

Summary: RFE: Amqpdriver uses one connection for reply queues
Product: Red Hat OpenStack Reporter: Flavio Percoco <fpercoco>
Component: python-oslo-messagingAssignee: Victor Stinner <vstinner>
Status: CLOSED WONTFIX QA Contact: Udi Shkalim <ushkalim>
Severity: medium Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: abeekhof, apevec, fdinitto, jeckersb, lhh, royoung, srevivo, vstinner
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-18 15:45:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Comment 3 Flavio Percoco 2016-02-24 14:35:15 UTC
This was found in the context of a customer. During the debug session, we noticed that whenever a timeout error was raised[0], it was propagated by[1] and waiters were correctly removed by[2] but the connection waiting for replies is still in used and it's never put back into the connection pool[3][4].

In other words, there's just one connection waiting for replies and it's dedicated to it, which is not ideal. This issue, as shown in [3], exists in master as well.

This issue also explains some of the behaviors we were seeing in the customer's environment. Here's a short log of a conversation between John and myself about this issue:

2016-02-09 10:39:57     eck     [15:09:39] flaper87: every time i read o.m code i learn something new and important                                                                                        [83/3933]
2016-02-09 10:39:57     eck     [15:10:21] i just realized there's only one connection in the pool that is waiting for replies (and it's dedicated to doing so)
2016-02-09 10:46:15     flaper87        eck: is that OSP5 ?
2016-02-09 10:57:01     eck     flaper87: no on master
2016-02-09 10:57:11     eck     flaper87: i'm guessing it's the same back in icehouse though?
2016-02-09 11:00:07     flaper87        eck: I don't think so, that's why I'm asking. I think it's newish stuff but I can't recall
2016-02-09 11:02:01      *      eck looks
2016-02-09 11:04:23     eck     flaper87: seems to still be the case in icehouse if i'm reading correctly
2016-02-09 11:06:25     eck     basically this... https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L350-L365
2016-02-09 11:07:05     eck     it just sets one _reply_q and one _reply_q_conn for the whole driver
2016-02-09 11:07:58     eck     and the ReplyWaiter just listens on that one connection
2016-02-09 11:08:42     eck     i think it could also explain some of the bottleneck too
2016-02-09 11:08:55     flaper87        eck: ah yeah, thought you were talking about something else
2016-02-09 11:09:09     eck     if you've got ~30 greenthreads context switching on rpc connections
2016-02-09 11:09:09     flaper87        yeah, that's also one of the causes for those leaks we were seeing 
2016-02-09 11:09:20     flaper87        There's a card for it and I'm supposed to create a BZ today 
2016-02-09 11:09:21     eck     and only one of them is able to read replies
2016-02-09 11:09:26     flaper87        with a more detailed explanation
2016-02-09 11:09:31     eck     then it's not going to get "scheduled" very often
2016-02-09 11:09:35     eck     flaper87: cool
2016-02-09 11:09:37     flaper87        right
2016-02-09 11:10:18     eck     and just to finish my thought... :)
2016-02-09 11:10:41     eck     you've got metadata workers with big pools submitting lots of messages
2016-02-09 11:11:22     eck     and over in conductor, you've only got two listening greenthreads
2016-02-09 11:11:32     eck     one is consuming from the conductor queue
2016-02-09 11:11:39     eck     and the other is consuming from the reply queue
2016-02-09 11:12:38     eck     the only semi-good news for conductor in that scenario is it's not waiting for a whole lot of replies itself
2016-02-09 11:12:46     eck     at least i don't think so


[0] https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L217-L221 

[1] https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L410 

[2] https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L416 

[3] Icehouse: https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L350-L365

[3] Master: https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/amqpdriver.py#L374-L389

[4] https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L184

Comment 5 John Eckersberg 2016-09-23 15:43:22 UTC
I think is still a valid bug on master.  Still needs investigation.  Updating version to Newton (10).

Comment 6 Victor Stinner 2016-10-03 14:23:26 UTC
Retarget to RHOS 10.

Comment 10 Red Hat Bugzilla Rules Engine 2017-06-04 01:53:06 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 12 John Eckersberg 2018-09-18 15:45:49 UTC
I'm going to go ahead and close this.  In theory, we might be able to get a *very* small performance improvement here if we can parallelize the reply waiters, but it's really not worth the effort and would risk introducing new bugs.