Bug 1295896 - Utilize QoS to prevent excessive client side queueing of messages
Utilize QoS to prevent excessive client side queueing of messages
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging (Show other bugs)
8.0 (Liberty)
Unspecified Unspecified
high Severity high
: async
: 7.0 (Kilo)
Assigned To: John Eckersberg
Udi Shkalim
: ZStream
: 1342629 (view as bug list)
Depends On:
Blocks: 1302861 1302873 1310807
  Show dependency treegraph
Reported: 2016-01-05 12:46 EST by John Eckersberg
Modified: 2016-07-21 17:21 EDT (History)
12 users (show)

See Also:
Fixed In Version: python-oslo-messaging-1.8.3-4.el7ost
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1302861 1302873 1310807 (view as bug list)
Last Closed: 2016-03-10 12:06:15 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1531222 None None None 2016-01-05 12:46 EST
OpenStack gerrit 279834 None None None 2016-02-12 21:17 EST

  None (edit)
Comment 1 John Eckersberg 2016-01-05 13:22:42 EST
This is problematic with the way we've got rabbitmq configured, particularly on older versions without AMQP heartbeat support.  We set TCP_USER_TIMEOUT to 5s in order to quickly notice failed connections [1].  What happens is roughly:

- There are a bunch of messages in a queue

- Because of no QoS, they all get flushed to the consumer(s)

- The consumer(s) can't process them fast enough, meaning they don't call recv() on the socket

- Messages buffer in the kernel on the consumer, using up the size of the recv buffer until it's full and the window drops to zero

- The server probes the zero window for 5 seconds, hits the timeout, and closes the connection due to timeout

[1] This is primarily due to weird behavior during VIP failover which we don't even use presently.  See http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html.  Maybe we should just turn this off for now...
Comment 2 John Eckersberg 2016-01-07 13:31:21 EST
Comment 3 Perry Myers 2016-01-12 10:01:11 EST
@eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3 more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)
Comment 4 John Eckersberg 2016-01-12 14:33:35 EST
(In reply to Perry Myers from comment #3)
> @eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3
> more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

Yeah it'd be nice to get it everywhere.  I guess hold for now until the upstream patch gets accepted, because it looks like it may be slightly more invasive than I originally thought and maybe the backport won't be so straightforward/feasible.  We'll see.  I'll keep the needinfo? to keep it on my radar for clones.

/me goes off to amend the review.
Comment 5 John Eckersberg 2016-01-28 15:33:17 EST
Comment 6 John Eckersberg 2016-02-12 21:17:04 EST
Comment 11 John Eckersberg 2016-02-26 16:35:57 EST
Note that this requires a config change as well by setting rabbit_qos_prefetch_count=N in nova.conf etc.  The default value of 0 disables QoS and retains the old behavior.  I would suggest to try setting it to 1 at first so that each consumer only takes one thing from the queue at a time, although this might negatively impact performance if it can't queue up at least a few things on the client side.  IIRC from looking at packet captures of this problem, there were something like 100-150 unacknowledged messages when the recv buffer would fill up, so to be safe I'd suggest a maximum of maybe 50.
Comment 15 Udi Shkalim 2016-03-08 10:08:56 EST
I was unable to recreate the problem.
Version: python-oslo-messaging-1.8.3-4.el7ost.noarch

Steps I did in order to try and recreate the problem:
1. Deployed 1 controller 1 compute setup
2. restart pcs service
3. disable the nova-conductor service and watch messages accumulate in the queue: rabbitmqctl list_queues name messages | egrep '^conductor\s'
4. Once reached 300 messages enable the nova-conductor service

No tcp.window.zero was captured on port 5672 on the controller.
No error/disconnected messages in rabbit or nova-conductor log. 

Tried the same steps above with adding the configuration fix:
 - "rabbit_qos_prefetch_count=50" to /etc/nova/nova.conf on the controller

No change.
Comment 16 John Eckersberg 2016-03-08 10:41:08 EST
Let's take a step back here.  We don't really need to reproduce the original circumstances.  There's a lot of timing issues that can make it hard to reproduce.

Instead, we know we want to enable QoS, so let's just focus on that.

To reproduce the original case, just query channel prefetch_count:

# rabbitmqctl list_channels name prefetch_count

You will see the prefetch_count is 0 for all channels.

To verify, set rabbit_qos_prefetch_count to some value other than 0, restart services, and re-run the above command to see that the prefetch_count is changed for those channels to reflect the configured value.
Comment 17 Udi Shkalim 2016-03-08 11:23:12 EST
Got it.

Changed the value of rabbit_qos_prefetch_count to 50.
rabbitmqctl list_channels name prefetch_count  | grep 50 -> (1)	50 -> (1)	50 -> (1)	50 -> (1)	50 -> (1)	50 -> (1)	50
Comment 19 errata-xmlrpc 2016-03-10 12:06:15 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

Comment 20 Peter Lemenkov 2016-07-21 17:21:17 EDT
*** Bug 1342629 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.