+++ This bug was initially created as a clone of Bug #1295896 +++ https://bugs.launchpad.net/oslo.messaging/+bug/1531222 --- Additional comment from John Eckersberg on 2016-01-05 13:22:42 EST --- This is problematic with the way we've got rabbitmq configured, particularly on older versions without AMQP heartbeat support. We set TCP_USER_TIMEOUT to 5s in order to quickly notice failed connections [1]. What happens is roughly: - There are a bunch of messages in a queue - Because of no QoS, they all get flushed to the consumer(s) - The consumer(s) can't process them fast enough, meaning they don't call recv() on the socket - Messages buffer in the kernel on the consumer, using up the size of the recv buffer until it's full and the window drops to zero - The server probes the zero window for 5 seconds, hits the timeout, and closes the connection due to timeout [1] This is primarily due to weird behavior during VIP failover which we don't even use presently. See http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html. Maybe we should just turn this off for now... --- Additional comment from John Eckersberg on 2016-01-07 13:31:21 EST --- https://review.openstack.org/#/c/264911/ --- Additional comment from Perry Myers on 2016-01-12 10:01:11 EST --- @eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3 more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6) --- Additional comment from John Eckersberg on 2016-01-12 14:33:35 EST --- (In reply to Perry Myers from comment #3) > @eck: Do we need this on OSP5 and OSP6 as well? If so we need this cloned 3 > more times for those releases (OSP5/RHEL6, OSP5/RHEL7, OSP6) Yeah it'd be nice to get it everywhere. I guess hold for now until the upstream patch gets accepted, because it looks like it may be slightly more invasive than I originally thought and maybe the backport won't be so straightforward/feasible. We'll see. I'll keep the needinfo? to keep it on my radar for clones. /me goes off to amend the review.
https://code.engineering.redhat.com/gerrit/#/c/66702/
Closing this because the backport from upstream is more involved and probably not worth the effort for now.