Bug 1295896

Summary: Utilize QoS to prevent excessive client side queueing of messages
Product: Red Hat OpenStack Reporter: John Eckersberg <jeckersb>
Component: python-oslo-messagingAssignee: John Eckersberg <jeckersb>
Status: CLOSED ERRATA QA Contact: Udi Shkalim <ushkalim>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: apevec, dmaley, fpercoco, ggillies, jeckersb, lhh, mfuruta, mwagner, plemenko, rhosp-bugs-internal, ushkalim, yeylon
Target Milestone: asyncKeywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-oslo-messaging-1.8.3-4.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1302861 1302873 1310807 (view as bug list) Environment:
Last Closed: 2016-03-10 17:06:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1302861, 1302873, 1310807    

Comment 1 John Eckersberg 2016-01-05 18:22:42 UTC
This is problematic with the way we've got rabbitmq configured, particularly on older versions without AMQP heartbeat support.  We set TCP_USER_TIMEOUT to 5s in order to quickly notice failed connections [1].  What happens is roughly:

- There are a bunch of messages in a queue

- Because of no QoS, they all get flushed to the consumer(s)

- The consumer(s) can't process them fast enough, meaning they don't call recv() on the socket

- Messages buffer in the kernel on the consumer, using up the size of the recv buffer until it's full and the window drops to zero

- The server probes the zero window for 5 seconds, hits the timeout, and closes the connection due to timeout


[1] This is primarily due to weird behavior during VIP failover which we don't even use presently.  See http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html.  Maybe we should just turn this off for now...

Comment 2 John Eckersberg 2016-01-07 18:31:21 UTC
https://review.openstack.org/#/c/264911/

Comment 3 Perry Myers 2016-01-12 15:01:11 UTC
@eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3 more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

Comment 4 John Eckersberg 2016-01-12 19:33:35 UTC
(In reply to Perry Myers from comment #3)
> @eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3
> more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6)

Yeah it'd be nice to get it everywhere.  I guess hold for now until the upstream patch gets accepted, because it looks like it may be slightly more invasive than I originally thought and maybe the backport won't be so straightforward/feasible.  We'll see.  I'll keep the needinfo? to keep it on my radar for clones.

/me goes off to amend the review.

Comment 5 John Eckersberg 2016-01-28 20:33:17 UTC
https://code.engineering.redhat.com/gerrit/#/c/66698/

Comment 6 John Eckersberg 2016-02-13 02:17:04 UTC
https://review.openstack.org/#/c/279834/

Comment 11 John Eckersberg 2016-02-26 21:35:57 UTC
Note that this requires a config change as well by setting rabbit_qos_prefetch_count=N in nova.conf etc.  The default value of 0 disables QoS and retains the old behavior.  I would suggest to try setting it to 1 at first so that each consumer only takes one thing from the queue at a time, although this might negatively impact performance if it can't queue up at least a few things on the client side.  IIRC from looking at packet captures of this problem, there were something like 100-150 unacknowledged messages when the recv buffer would fill up, so to be safe I'd suggest a maximum of maybe 50.

Comment 15 Udi Shkalim 2016-03-08 15:08:56 UTC
I was unable to recreate the problem.
Version: python-oslo-messaging-1.8.3-4.el7ost.noarch

Steps I did in order to try and recreate the problem:
1. Deployed 1 controller 1 compute setup
2. restart pcs service
3. disable the nova-conductor service and watch messages accumulate in the queue: rabbitmqctl list_queues name messages | egrep '^conductor\s'
4. Once reached 300 messages enable the nova-conductor service

No tcp.window.zero was captured on port 5672 on the controller.
No error/disconnected messages in rabbit or nova-conductor log. 

Tried the same steps above with adding the configuration fix:
 - "rabbit_qos_prefetch_count=50" to /etc/nova/nova.conf on the controller

No change.

Comment 16 John Eckersberg 2016-03-08 15:41:08 UTC
Let's take a step back here.  We don't really need to reproduce the original circumstances.  There's a lot of timing issues that can make it hard to reproduce.

Instead, we know we want to enable QoS, so let's just focus on that.

To reproduce the original case, just query channel prefetch_count:

# rabbitmqctl list_channels name prefetch_count

You will see the prefetch_count is 0 for all channels.

To verify, set rabbit_qos_prefetch_count to some value other than 0, restart services, and re-run the above command to see that the prefetch_count is changed for those channels to reflect the configured value.

Comment 17 Udi Shkalim 2016-03-08 16:23:12 UTC
Got it.

Changed the value of rabbit_qos_prefetch_count to 50.
 
rabbitmqctl list_channels name prefetch_count  | grep 50
172.17.1.12:52630 -> 172.17.1.12:5672 (1)	50
172.17.1.12:52795 -> 172.17.1.12:5672 (1)	50
172.17.1.12:52855 -> 172.17.1.12:5672 (1)	50
172.17.1.12:52856 -> 172.17.1.12:5672 (1)	50
172.17.1.12:52857 -> 172.17.1.12:5672 (1)	50
172.17.1.12:52859 -> 172.17.1.12:5672 (1)	50

Comment 19 errata-xmlrpc 2016-03-10 17:06:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0437.html

Comment 20 Peter Lemenkov 2016-07-21 21:21:17 UTC
*** Bug 1342629 has been marked as a duplicate of this bug. ***