Red Hat Bugzilla – Bug 1295896
Utilize QoS to prevent excessive client side queueing of messages
Last modified: 2016-07-21 17:21:17 EDT
This is problematic with the way we've got rabbitmq configured, particularly on older versions without AMQP heartbeat support. We set TCP_USER_TIMEOUT to 5s in order to quickly notice failed connections . What happens is roughly:
- There are a bunch of messages in a queue
- Because of no QoS, they all get flushed to the consumer(s)
- The consumer(s) can't process them fast enough, meaning they don't call recv() on the socket
- Messages buffer in the kernel on the consumer, using up the size of the recv buffer until it's full and the window drops to zero
- The server probes the zero window for 5 seconds, hits the timeout, and closes the connection due to timeout
 This is primarily due to weird behavior during VIP failover which we don't even use presently. See http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html. Maybe we should just turn this off for now...
@eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3 more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)
(In reply to Perry Myers from comment #3)
> @eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3
> more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)
Yeah it'd be nice to get it everywhere. I guess hold for now until the upstream patch gets accepted, because it looks like it may be slightly more invasive than I originally thought and maybe the backport won't be so straightforward/feasible. We'll see. I'll keep the needinfo? to keep it on my radar for clones.
/me goes off to amend the review.
Note that this requires a config change as well by setting rabbit_qos_prefetch_count=N in nova.conf etc. The default value of 0 disables QoS and retains the old behavior. I would suggest to try setting it to 1 at first so that each consumer only takes one thing from the queue at a time, although this might negatively impact performance if it can't queue up at least a few things on the client side. IIRC from looking at packet captures of this problem, there were something like 100-150 unacknowledged messages when the recv buffer would fill up, so to be safe I'd suggest a maximum of maybe 50.
I was unable to recreate the problem.
Steps I did in order to try and recreate the problem:
1. Deployed 1 controller 1 compute setup
2. restart pcs service
3. disable the nova-conductor service and watch messages accumulate in the queue: rabbitmqctl list_queues name messages | egrep '^conductor\s'
4. Once reached 300 messages enable the nova-conductor service
No tcp.window.zero was captured on port 5672 on the controller.
No error/disconnected messages in rabbit or nova-conductor log.
Tried the same steps above with adding the configuration fix:
- "rabbit_qos_prefetch_count=50" to /etc/nova/nova.conf on the controller
Let's take a step back here. We don't really need to reproduce the original circumstances. There's a lot of timing issues that can make it hard to reproduce.
Instead, we know we want to enable QoS, so let's just focus on that.
To reproduce the original case, just query channel prefetch_count:
# rabbitmqctl list_channels name prefetch_count
You will see the prefetch_count is 0 for all channels.
To verify, set rabbit_qos_prefetch_count to some value other than 0, restart services, and re-run the above command to see that the prefetch_count is changed for those channels to reflect the configured value.
Changed the value of rabbit_qos_prefetch_count to 50.
rabbitmqctl list_channels name prefetch_count | grep 50
172.17.1.12:52630 -> 172.17.1.12:5672 (1) 50
172.17.1.12:52795 -> 172.17.1.12:5672 (1) 50
172.17.1.12:52855 -> 172.17.1.12:5672 (1) 50
172.17.1.12:52856 -> 172.17.1.12:5672 (1) 50
172.17.1.12:52857 -> 172.17.1.12:5672 (1) 50
172.17.1.12:52859 -> 172.17.1.12:5672 (1) 50
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
*** Bug 1342629 has been marked as a duplicate of this bug. ***