Bug 1434593 - rabbitmqctl list_queues/list_connections hangs indefinitely in OSP10 due to missing iptables rules.
Summary: rabbitmqctl list_queues/list_connections hangs indefinitely in OSP10 due to m...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: z4
: 10.0 (Newton)
Assignee: Peter Lemenkov
QA Contact: Udi Shkalim
URL:
Whiteboard:
: 1466803 1474507 1640455 1742842 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-21 21:20 UTC by Vincent S. Cojot
Modified: 2023-10-06 17:36 UTC (History)
22 users (show)

Fixed In Version: rabbitmq-server-3.6.3-7.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-06 17:06:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rabbitmq rabbitmq-server pull 683 0 None closed Listing items in parallel 2020-07-04 14:28:47 UTC
Red Hat Issue Tracker OSP-4614 0 None None None 2022-03-13 14:43:44 UTC
Red Hat Knowledge Base (Solution) 3121151 0 None None None 2017-07-20 12:52:02 UTC
Red Hat Product Errata RHBA-2017:2653 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 Bug Fix and Enhancement Advisory 2017-09-06 20:54:38 UTC

Description Vincent S. Cojot 2017-03-21 21:20:25 UTC
Description of problem:

On OSP10 overcloud controllers, rabbitmqctl hangs indefinitly after some output.
I noted that flushing the iptables rules (iptables -F) on the controllers fixes the problem.

We have seen hangs of over 30 minutes without any activity:

[root@krynn-ctrl-0 ~]# rabbitmqctl list_connections
Listing connections ...
guest   10.0.0.18       48032   running

^C
Session terminated, killing shell... ...killed.

One thing I noticed is that running rabbitmqctl seems to put 'beam.smp' into a strange mode where it opens a LISTENing socket on a 3rd (ephemeral) port. Of course, we don't have this port in our iptables rules, hence the hang.

Here's rabbitmq/beam under normal activity:
[root@krynn-ctrl-0 ~]# netstat -anp|grep beam|grep LISTEN
tcp        0      0 10.0.0.24:5672          0.0.0.0:*               LISTEN      469305/beam.smp     
tcp        0      0 0.0.0.0:25672           0.0.0.0:*               LISTEN      469305/beam.smp     

Here's the same rabbit when I'm running 'rabbitmqctl list_connections' in another window:
[root@krynn-ctrl-0 ~]# netstat -anp|grep beam|grep LISTEN
tcp        0      0 0.0.0.0:34484           0.0.0.0:*               LISTEN      789613/beam.smp     
tcp        0      0 10.0.0.24:5672          0.0.0.0:*               LISTEN      469305/beam.smp     
tcp        0      0 0.0.0.0:25672           0.0.0.0:*               LISTEN      469305/beam.smp     

see the port on 34484? It changes everytime. Here's another run (ctrl-c the previous rabbitmqctl and run it again):
[root@krynn-ctrl-0 ~]# netstat -anp|grep beam|grep LISTEN
tcp        0      0 0.0.0.0:46356           0.0.0.0:*               LISTEN      793007/beam.smp     
tcp        0      0 10.0.0.24:5672          0.0.0.0:*               LISTEN      469305/beam.smp     
tcp        0      0 0.0.0.0:25672           0.0.0.0:*               LISTEN      469305/beam.smp     

This is OSP10 with the 20170228 RHOSP images.

Comment 1 Vincent S. Cojot 2017-03-21 21:24:42 UTC
This happens on a freshly deployed OSP10 with the latest patches (20170321).

4369,5672,33239 and 25672 are the only ports in the iptables rules.

[root@krynn-ctrl-0 ~]# iptables -L -nvv|grep 4369
 2545  153K ACCEPT     tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            multiport dports 4369,5672,33239,25672 /* 109 rabbitmq */ state NEW

Comment 2 Matt Flusche 2017-05-15 17:51:25 UTC
Hi,

I'm working with a customer who hit this issue also; any update on a fix or work-around?

Comment 3 David Hill 2017-05-16 13:57:09 UTC
Delete all the INPUT -j REJECT rules until we have a proper fix.  This looks like a FTP-DATA issue and so far, I haven't found a configuration parameter that would change this behavior with the rabbitmqctl client.

Comment 4 Vincent S. Cojot 2017-05-18 14:37:25 UTC
[stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables --line-numbers -n -v -L INPUT' ctrl |grep REJECT
75    3597  160K REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited
75    4421  208K REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited
75    5847  282K REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited

[stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables --line-numbers -n -v -L FORWARD' ctrl |grep REJECT
3        0     0 REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited
3        0     0 REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited
3        0     0 REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited

so lines 75 in INPUT and 3 in FORWARD.

Comment 5 Vincent S. Cojot 2017-05-18 14:37:56 UTC
Deleted those lines:
[stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables -v -D INPUT 75' ctrl 
krynn-ctrl-1 | SUCCESS | rc=0 >>


krynn-ctrl-0 | SUCCESS | rc=0 >>


krynn-ctrl-2 | SUCCESS | rc=0 >>


[stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables -v -D FORWARD 3' ctrl 
krynn-ctrl-1 | SUCCESS | rc=0 >>


krynn-ctrl-0 | SUCCESS | rc=0 >>


krynn-ctrl-2 | SUCCESS | rc=0 >>

Comment 6 Vincent S. Cojot 2017-05-18 15:18:12 UTC
Even with the REJECT rules removed, I am still seeing the hang.

Comment 7 Vincent S. Cojot 2017-05-18 18:06:04 UTC
When I "strace -f -s1024 rabbitmqctl list_queues", as it starts hanging, I'm seeing a repeated pattern of:

156189 sched_yield()                    = 0
156189 sched_yield()                    = 0
156189 sched_yield()                    = 0
156189 sched_yield()                    = 0
156189 sched_yield()                    = 0
156189 sched_yield()                    = 0
156189 sched_yield()                    = 0
156189 sched_yield()                    = 0
156189 sched_yield()                    = 0
156189 sched_yield()                    = 0
156189 futex(0x7ff5638c0450, FUTEX_WAIT_PRIVATE, 4294967295, {14, 992556869} <unfinished ...>
156191 <... ppoll resumed> )            = 1 ([{fd=44, revents=POLLIN|POLLRDNORM}], left {3, 795857414})
156191 recvfrom(44, "\0\0\0\0", 1460, 0, NULL, NULL) = 4
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {3, 792331959}, NULL, 8) = 0 (Timeout)
156191 writev(44, [{"\0\0\0\0", 4}], 1) = 4
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {14, 999423541}, NULL, 8^C

Comment 8 Vincent S. Cojot 2017-05-18 18:15:23 UTC
As it's hanging, I'm seeing a steady flow of the following in /var/log/messages:

May 18 14:08:08 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:de:c6:92:2e:91:62:08:00 SRC=10.0.0.15 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=16489 DF PROTO=TCP SPT=60166 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 
May 18 14:08:08 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:2a:1c:44:a7:30:53:08:00 SRC=10.0.0.18 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=15229 DF PROTO=TCP SPT=41427 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 
May 18 14:08:09 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:de:c6:92:2e:91:62:08:00 SRC=10.0.0.15 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=16490 DF PROTO=TCP SPT=60166 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 
May 18 14:08:09 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:2a:1c:44:a7:30:53:08:00 SRC=10.0.0.18 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=15230 DF PROTO=TCP SPT=41427 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 
May 18 14:08:11 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:de:c6:92:2e:91:62:08:00 SRC=10.0.0.15 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=16491 DF PROTO=TCP SPT=60166 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 
May 18 14:08:11 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:2a:1c:44:a7:30:53:08:00 SRC=10.0.0.18 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=15231 DF PROTO=TCP SPT=41427 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 
^C

POrt 41192 is rabbitmq:
[root@krynn-ctrl-1 log]# netstat -anp|grep 41192
tcp        0      0 0.0.0.0:41192           0.0.0.0:*               LISTEN      156024/beam.smp     

10.0.0.15 to 21 are the IPs of my controllers on the internal API VLAN:
# grep 10.0.0 /etc/hosts|grep interna
10.0.0.15 krynn-ctrl-0.internalapi.localdomain krynn-ctrl-0.internalapi
10.0.0.21 krynn-ctrl-1.internalapi.localdomain krynn-ctrl-1.internalapi
10.0.0.18 krynn-ctrl-2.internalapi.localdomain krynn-ctrl-2.internalapi
10.0.0.19 krynn-cmpt-0.internalapi.localdomain krynn-cmpt-0.internalapi
10.0.0.14       overcloud.internalapi.localdomain       # FQDN of the internal api VIP

Comment 9 Vincent S. Cojot 2017-05-18 18:20:06 UTC
If I add a simple IPtables rule, list_queues does not hang anymore:
ansible -i hosts -m command -a 'sudo /sbin/iptables -I INPUT 1 -i vlan10 -d 10.0.0.0/24 -s 10.0.0.0/24 -j ACCEPT' ctrl

In my env file for this deploy, I have:

InternalApiNetCidr: 10.0.0.0/24
InternalApiNetworkVlanID: 10
InternalApiAllocationPools: [{'start': '10.0.0.10', 'end': '10.0.0.200'}]

Result:
[root@krynn-ctrl-1 log]# time rabbitmqctl list_connections > /dev/null 

real    0m1.587s
user    0m0.646s
sys     0m0.485s

Comment 10 Peter Lemenkov 2017-05-22 14:04:08 UTC
(In reply to Vincent S. Cojot from comment #9)
> If I add a simple IPtables rule, list_queues does not hang anymore:
> ansible -i hosts -m command -a 'sudo /sbin/iptables -I INPUT 1 -i vlan10 -d
> 10.0.0.0/24 -s 10.0.0.0/24 -j ACCEPT' ctrl
> 
> In my env file for this deploy, I have:
> 
> InternalApiNetCidr: 10.0.0.0/24
> InternalApiNetworkVlanID: 10
> InternalApiAllocationPools: [{'start': '10.0.0.10', 'end': '10.0.0.200'}]

Thanks for debugging this! So we should add/modify an extra iptables deployment rule while building an OpenStack cluster. I guess this should be reassigned to the Director then.

Comment 11 Peter Lemenkov 2017-05-22 14:35:03 UTC
(In reply to Vincent S. Cojot from comment #8)
> As it's hanging, I'm seeing a steady flow of the following in
> /var/log/messages:

> POrt 41192 is rabbitmq:
> [root@krynn-ctrl-1 log]# netstat -anp|grep 41192
> tcp        0      0 0.0.0.0:41192           0.0.0.0:*               LISTEN  
> 156024/beam.smp     

 Honestly I can't remember where 41192 port comes from. My guess is that  Erlang distribution port was reassigned once again (?).

Comment 12 Peter Lemenkov 2017-05-22 14:46:35 UTC
Sorry for reassigning - I'll keep it assigned against rabbitmq-server

Comment 13 Benjamin Schmaus 2017-05-25 11:52:24 UTC
Customer confirmed this issue present in OSP11 as well:

puppet-tripleo-6.3.0-12.el7ost.noarch
openstack-tripleo-heat-templates-6.0.0-10.el7ost.noarch

Completely removing the DROP rules works to get around issue.

Comment 19 Peter Lemenkov 2017-06-30 13:56:43 UTC
*** Bug 1466803 has been marked as a duplicate of this bug. ***

Comment 23 Shinobu KINJO 2017-08-01 03:32:19 UTC
Which upstream commit caused this regression?

Comment 31 Udi Shkalim 2017-08-17 12:59:49 UTC
Verified on rabbitmq-server-3.6.3-7.el7ost.noarch

Multiple tests during traffic (boot instances) seems that the command is not stuck - rabbitmqctl list_connections

Comment 33 errata-xmlrpc 2017-09-06 17:06:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2653

Comment 34 Peter Lemenkov 2017-09-12 15:01:25 UTC
*** Bug 1474507 has been marked as a duplicate of this bug. ***

Comment 35 Peter Lemenkov 2019-08-19 14:33:17 UTC
*** Bug 1742842 has been marked as a duplicate of this bug. ***

Comment 36 Luca Miccini 2019-10-25 09:47:10 UTC
*** Bug 1640455 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.