1302861 – Utilize QoS to prevent excessive client side queueing of messages

Bug 1302861 - Utilize QoS to prevent excessive client side queueing of messages

Summary: Utilize QoS to prevent excessive client side queueing of messages

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-oslo-messaging
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	async
Target Release:	6.0 (Juno)
Assignee:	John Eckersberg
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:	1295896 1302873 1310807
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-28 19:43 UTC by John Eckersberg
Modified:	2016-04-27 03:33 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1295896
Environment:
Last Closed:	2016-03-03 17:40:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1531222	0	None	None	None	2016-01-28 19:43:38 UTC

Description John Eckersberg 2016-01-28 19:43:38 UTC

+++ This bug was initially created as a clone of Bug #1295896 +++

https://bugs.launchpad.net/oslo.messaging/+bug/1531222

--- Additional comment from John Eckersberg on 2016-01-05 13:22:42 EST ---

This is problematic with the way we've got rabbitmq configured, particularly on older versions without AMQP heartbeat support.  We set TCP_USER_TIMEOUT to 5s in order to quickly notice failed connections [1].  What happens is roughly:

- There are a bunch of messages in a queue

- Because of no QoS, they all get flushed to the consumer(s)

- The consumer(s) can't process them fast enough, meaning they don't call recv() on the socket

- Messages buffer in the kernel on the consumer, using up the size of the recv buffer until it's full and the window drops to zero

- The server probes the zero window for 5 seconds, hits the timeout, and closes the connection due to timeout


[1] This is primarily due to weird behavior during VIP failover which we don't even use presently.  See http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html.  Maybe we should just turn this off for now...

--- Additional comment from John Eckersberg on 2016-01-07 13:31:21 EST ---

https://review.openstack.org/#/c/264911/

--- Additional comment from Perry Myers on 2016-01-12 10:01:11 EST ---

@eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3 more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

--- Additional comment from John Eckersberg on 2016-01-12 14:33:35 EST ---

(In reply to Perry Myers from comment #3)
> @eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3
> more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

Yeah it'd be nice to get it everywhere.  I guess hold for now until the upstream patch gets accepted, because it looks like it may be slightly more invasive than I originally thought and maybe the backport won't be so straightforward/feasible.  We'll see.  I'll keep the needinfo? to keep it on my radar for clones.

/me goes off to amend the review.

Comment 1 John Eckersberg 2016-01-28 20:34:00 UTC

https://code.engineering.redhat.com/gerrit/#/c/66702/

Comment 2 John Eckersberg 2016-03-03 17:40:55 UTC

Closing this because the backport from upstream is more involved and probably not worth the effort for now.

Note You need to log in before you can comment on or make changes to this bug.