Bug 1311593 - RFE: Amqpdriver uses one connection for reply queues
Summary: RFE: Amqpdriver uses one connection for reply queues
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Victor Stinner
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-24 14:29 UTC by Flavio Percoco
Modified: 2018-09-18 15:45 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-18 15:45:49 UTC
Target Upstream Version:


Attachments (Terms of Use)

Comment 3 Flavio Percoco 2016-02-24 14:35:15 UTC
This was found in the context of a customer. During the debug session, we noticed that whenever a timeout error was raised[0], it was propagated by[1] and waiters were correctly removed by[2] but the connection waiting for replies is still in used and it's never put back into the connection pool[3][4].

In other words, there's just one connection waiting for replies and it's dedicated to it, which is not ideal. This issue, as shown in [3], exists in master as well.

This issue also explains some of the behaviors we were seeing in the customer's environment. Here's a short log of a conversation between John and myself about this issue:

2016-02-09 10:39:57     eck     [15:09:39] flaper87: every time i read o.m code i learn something new and important                                                                                        [83/3933]
2016-02-09 10:39:57     eck     [15:10:21] i just realized there's only one connection in the pool that is waiting for replies (and it's dedicated to doing so)
2016-02-09 10:46:15     flaper87        eck: is that OSP5 ?
2016-02-09 10:57:01     eck     flaper87: no on master
2016-02-09 10:57:11     eck     flaper87: i'm guessing it's the same back in icehouse though?
2016-02-09 11:00:07     flaper87        eck: I don't think so, that's why I'm asking. I think it's newish stuff but I can't recall
2016-02-09 11:02:01      *      eck looks
2016-02-09 11:04:23     eck     flaper87: seems to still be the case in icehouse if i'm reading correctly
2016-02-09 11:06:25     eck     basically this... https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L350-L365
2016-02-09 11:07:05     eck     it just sets one _reply_q and one _reply_q_conn for the whole driver
2016-02-09 11:07:58     eck     and the ReplyWaiter just listens on that one connection
2016-02-09 11:08:42     eck     i think it could also explain some of the bottleneck too
2016-02-09 11:08:55     flaper87        eck: ah yeah, thought you were talking about something else
2016-02-09 11:09:09     eck     if you've got ~30 greenthreads context switching on rpc connections
2016-02-09 11:09:09     flaper87        yeah, that's also one of the causes for those leaks we were seeing 
2016-02-09 11:09:20     flaper87        There's a card for it and I'm supposed to create a BZ today 
2016-02-09 11:09:21     eck     and only one of them is able to read replies
2016-02-09 11:09:26     flaper87        with a more detailed explanation
2016-02-09 11:09:31     eck     then it's not going to get "scheduled" very often
2016-02-09 11:09:35     eck     flaper87: cool
2016-02-09 11:09:37     flaper87        right
2016-02-09 11:10:18     eck     and just to finish my thought... :)
2016-02-09 11:10:41     eck     you've got metadata workers with big pools submitting lots of messages
2016-02-09 11:11:22     eck     and over in conductor, you've only got two listening greenthreads
2016-02-09 11:11:32     eck     one is consuming from the conductor queue
2016-02-09 11:11:39     eck     and the other is consuming from the reply queue
2016-02-09 11:12:38     eck     the only semi-good news for conductor in that scenario is it's not waiting for a whole lot of replies itself
2016-02-09 11:12:46     eck     at least i don't think so


[0] https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L217-L221 

[1] https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L410 

[2] https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L416 

[3] Icehouse: https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L350-L365

[3] Master: https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/amqpdriver.py#L374-L389

[4] https://github.com/openstack/oslo.messaging/blob/icehouse-eol/oslo/messaging/_drivers/amqpdriver.py#L184

Comment 5 John Eckersberg 2016-09-23 15:43:22 UTC
I think is still a valid bug on master.  Still needs investigation.  Updating version to Newton (10).

Comment 6 Victor Stinner 2016-10-03 14:23:26 UTC
Retarget to RHOS 10.

Comment 10 Red Hat Bugzilla Rules Engine 2017-06-04 01:53:06 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 12 John Eckersberg 2018-09-18 15:45:49 UTC
I'm going to go ahead and close this.  In theory, we might be able to get a *very* small performance improvement here if we can parallelize the reply waiters, but it's really not worth the effort and would risk introducing new bugs.


Note You need to log in before you can comment on or make changes to this bug.