Bug 1302861

Summary: Utilize QoS to prevent excessive client side queueing of messages
Product: Red Hat OpenStack Reporter: John Eckersberg <jeckersb>
Component: python-oslo-messagingAssignee: John Eckersberg <jeckersb>
Status: CLOSED WONTFIX QA Contact: Udi Shkalim <ushkalim>
Severity: high Docs Contact:
Priority: high    
Version: 6.0 (Juno)CC: apevec, dmaley, ggillies, jeckersb, lhh, mfuruta, plemenko, yeylon
Target Milestone: asyncKeywords: ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1295896 Environment:
Last Closed: 2016-03-03 17:40:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1295896, 1302873, 1310807    
Bug Blocks:    

Description John Eckersberg 2016-01-28 19:43:38 UTC
+++ This bug was initially created as a clone of Bug #1295896 +++

https://bugs.launchpad.net/oslo.messaging/+bug/1531222

--- Additional comment from John Eckersberg on 2016-01-05 13:22:42 EST ---

This is problematic with the way we've got rabbitmq configured, particularly on older versions without AMQP heartbeat support.  We set TCP_USER_TIMEOUT to 5s in order to quickly notice failed connections [1].  What happens is roughly:

- There are a bunch of messages in a queue

- Because of no QoS, they all get flushed to the consumer(s)

- The consumer(s) can't process them fast enough, meaning they don't call recv() on the socket

- Messages buffer in the kernel on the consumer, using up the size of the recv buffer until it's full and the window drops to zero

- The server probes the zero window for 5 seconds, hits the timeout, and closes the connection due to timeout


[1] This is primarily due to weird behavior during VIP failover which we don't even use presently.  See http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html.  Maybe we should just turn this off for now...

--- Additional comment from John Eckersberg on 2016-01-07 13:31:21 EST ---

https://review.openstack.org/#/c/264911/

--- Additional comment from Perry Myers on 2016-01-12 10:01:11 EST ---

@eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3 more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

--- Additional comment from John Eckersberg on 2016-01-12 14:33:35 EST ---

(In reply to Perry Myers from comment #3)
> @eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3
> more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

Yeah it'd be nice to get it everywhere.  I guess hold for now until the upstream patch gets accepted, because it looks like it may be slightly more invasive than I originally thought and maybe the backport won't be so straightforward/feasible.  We'll see.  I'll keep the needinfo? to keep it on my radar for clones.

/me goes off to amend the review.

Comment 1 John Eckersberg 2016-01-28 20:34:00 UTC
https://code.engineering.redhat.com/gerrit/#/c/66702/

Comment 2 John Eckersberg 2016-03-03 17:40:55 UTC
Closing this because the backport from upstream is more involved and probably not worth the effort for now.