Bug 1302861 - Utilize QoS to prevent excessive client side queueing of messages
Utilize QoS to prevent excessive client side queueing of messages
Status: CLOSED WONTFIX
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging (Show other bugs)
6.0 (Juno)
Unspecified Unspecified
high Severity high
: async
: 6.0 (Juno)
Assigned To: John Eckersberg
Udi Shkalim
: ZStream
Depends On: 1295896 1302873 1310807
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-28 14:43 EST by John Eckersberg
Modified: 2016-04-26 23:33 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1295896
Environment:
Last Closed: 2016-03-03 12:40:55 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1531222 None None None 2016-01-28 14:43 EST

  None (edit)
Description John Eckersberg 2016-01-28 14:43:38 EST
+++ This bug was initially created as a clone of Bug #1295896 +++

https://bugs.launchpad.net/oslo.messaging/+bug/1531222

--- Additional comment from John Eckersberg on 2016-01-05 13:22:42 EST ---

This is problematic with the way we've got rabbitmq configured, particularly on older versions without AMQP heartbeat support.  We set TCP_USER_TIMEOUT to 5s in order to quickly notice failed connections [1].  What happens is roughly:

- There are a bunch of messages in a queue

- Because of no QoS, they all get flushed to the consumer(s)

- The consumer(s) can't process them fast enough, meaning they don't call recv() on the socket

- Messages buffer in the kernel on the consumer, using up the size of the recv buffer until it's full and the window drops to zero

- The server probes the zero window for 5 seconds, hits the timeout, and closes the connection due to timeout


[1] This is primarily due to weird behavior during VIP failover which we don't even use presently.  See http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html.  Maybe we should just turn this off for now...

--- Additional comment from John Eckersberg on 2016-01-07 13:31:21 EST ---

https://review.openstack.org/#/c/264911/

--- Additional comment from Perry Myers on 2016-01-12 10:01:11 EST ---

@eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3 more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

--- Additional comment from John Eckersberg on 2016-01-12 14:33:35 EST ---

(In reply to Perry Myers from comment #3)
> @eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3
> more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

Yeah it'd be nice to get it everywhere.  I guess hold for now until the upstream patch gets accepted, because it looks like it may be slightly more invasive than I originally thought and maybe the backport won't be so straightforward/feasible.  We'll see.  I'll keep the needinfo? to keep it on my radar for clones.

/me goes off to amend the review.
Comment 1 John Eckersberg 2016-01-28 15:34:00 EST
https://code.engineering.redhat.com/gerrit/#/c/66702/
Comment 2 John Eckersberg 2016-03-03 12:40:55 EST
Closing this because the backport from upstream is more involved and probably not worth the effort for now.

Note You need to log in before you can comment on or make changes to this bug.