1295896 – Utilize QoS to prevent excessive client side queueing of messages

Bug 1295896 - Utilize QoS to prevent excessive client side queueing of messages

Summary: Utilize QoS to prevent excessive client side queueing of messages

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-oslo-messaging
Sub Component:
Version:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	async
Target Release:	7.0 (Kilo)
Assignee:	John Eckersberg
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1342629 (view as bug list)
Depends On:
Blocks:	1302861 1302873 1310807
TreeView+	depends on / blocked

Reported:	2016-01-05 17:46 UTC by John Eckersberg
Modified:	2016-07-21 21:21 UTC (History)
CC List:	12 users (show)
Fixed In Version:	python-oslo-messaging-1.8.3-4.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1302861 1302873 1310807 (view as bug list)
Environment:
Last Closed:	2016-03-10 17:06:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1531222	None	None	None	2016-01-05 17:46:57 UTC
OpenStack gerrit	279834	None	MERGED	rabbit: improvements to QoS	2020-02-26 16:40:41 UTC
Red Hat Product Errata	RHBA-2016:0437	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform 7 Bug Fix and Enhancement Advisory	2016-03-10 22:05:30 UTC

Description John Eckersberg 2016-01-05 17:46:57 UTC

https://bugs.launchpad.net/oslo.messaging/+bug/1531222

Comment 1 John Eckersberg 2016-01-05 18:22:42 UTC

This is problematic with the way we've got rabbitmq configured, particularly on older versions without AMQP heartbeat support.  We set TCP_USER_TIMEOUT to 5s in order to quickly notice failed connections [1].  What happens is roughly:

- There are a bunch of messages in a queue

- Because of no QoS, they all get flushed to the consumer(s)

- The consumer(s) can't process them fast enough, meaning they don't call recv() on the socket

- Messages buffer in the kernel on the consumer, using up the size of the recv buffer until it's full and the window drops to zero

- The server probes the zero window for 5 seconds, hits the timeout, and closes the connection due to timeout


[1] This is primarily due to weird behavior during VIP failover which we don't even use presently.  See http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html.  Maybe we should just turn this off for now...

Comment 2 John Eckersberg 2016-01-07 18:31:21 UTC

https://review.openstack.org/#/c/264911/

Comment 3 Perry Myers 2016-01-12 15:01:11 UTC

@eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3 more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

Comment 4 John Eckersberg 2016-01-12 19:33:35 UTC

(In reply to Perry Myers from comment #3)
> @eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3
> more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

Yeah it'd be nice to get it everywhere.  I guess hold for now until the upstream patch gets accepted, because it looks like it may be slightly more invasive than I originally thought and maybe the backport won't be so straightforward/feasible.  We'll see.  I'll keep the needinfo? to keep it on my radar for clones.

/me goes off to amend the review.

Comment 5 John Eckersberg 2016-01-28 20:33:17 UTC

https://code.engineering.redhat.com/gerrit/#/c/66698/

Comment 6 John Eckersberg 2016-02-13 02:17:04 UTC

https://review.openstack.org/#/c/279834/

Comment 11 John Eckersberg 2016-02-26 21:35:57 UTC

Note that this requires a config change as well by setting rabbit_qos_prefetch_count=N in nova.conf etc.  The default value of 0 disables QoS and retains the old behavior.  I would suggest to try setting it to 1 at first so that each consumer only takes one thing from the queue at a time, although this might negatively impact performance if it can't queue up at least a few things on the client side.  IIRC from looking at packet captures of this problem, there were something like 100-150 unacknowledged messages when the recv buffer would fill up, so to be safe I'd suggest a maximum of maybe 50.

Comment 15 Udi Shkalim 2016-03-08 15:08:56 UTC

I was unable to recreate the problem.
Version: python-oslo-messaging-1.8.3-4.el7ost.noarch

Steps I did in order to try and recreate the problem:
1. Deployed 1 controller 1 compute setup
2. restart pcs service
3. disable the nova-conductor service and watch messages accumulate in the queue: rabbitmqctl list_queues name messages | egrep '^conductor\s'
4. Once reached 300 messages enable the nova-conductor service

No tcp.window.zero was captured on port 5672 on the controller.
No error/disconnected messages in rabbit or nova-conductor log. 

Tried the same steps above with adding the configuration fix:
 - "rabbit_qos_prefetch_count=50" to /etc/nova/nova.conf on the controller

No change.

Comment 16 John Eckersberg 2016-03-08 15:41:08 UTC

Let's take a step back here.  We don't really need to reproduce the original circumstances.  There's a lot of timing issues that can make it hard to reproduce.

Instead, we know we want to enable QoS, so let's just focus on that.

To reproduce the original case, just query channel prefetch_count:

# rabbitmqctl list_channels name prefetch_count

You will see the prefetch_count is 0 for all channels.

To verify, set rabbit_qos_prefetch_count to some value other than 0, restart services, and re-run the above command to see that the prefetch_count is changed for those channels to reflect the configured value.

Comment 17 Udi Shkalim 2016-03-08 16:23:12 UTC

Got it.

Changed the value of rabbit_qos_prefetch_count to 50.
 
rabbitmqctl list_channels name prefetch_count  | grep 50
172.17.1.12:52630 -> 172.17.1.12:5672 (1)	50
172.17.1.12:52795 -> 172.17.1.12:5672 (1)	50
172.17.1.12:52855 -> 172.17.1.12:5672 (1)	50
172.17.1.12:52856 -> 172.17.1.12:5672 (1)	50
172.17.1.12:52857 -> 172.17.1.12:5672 (1)	50
172.17.1.12:52859 -> 172.17.1.12:5672 (1)	50

Comment 19 errata-xmlrpc 2016-03-10 17:06:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0437.html

Comment 20 Peter Lemenkov 2016-07-21 21:21:17 UTC

*** Bug 1342629 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.