Description of problem: On OSP10 overcloud controllers, rabbitmqctl hangs indefinitly after some output. I noted that flushing the iptables rules (iptables -F) on the controllers fixes the problem. We have seen hangs of over 30 minutes without any activity: [root@krynn-ctrl-0 ~]# rabbitmqctl list_connections Listing connections ... guest 10.0.0.18 48032 running ^C Session terminated, killing shell... ...killed. One thing I noticed is that running rabbitmqctl seems to put 'beam.smp' into a strange mode where it opens a LISTENing socket on a 3rd (ephemeral) port. Of course, we don't have this port in our iptables rules, hence the hang. Here's rabbitmq/beam under normal activity: [root@krynn-ctrl-0 ~]# netstat -anp|grep beam|grep LISTEN tcp 0 0 10.0.0.24:5672 0.0.0.0:* LISTEN 469305/beam.smp tcp 0 0 0.0.0.0:25672 0.0.0.0:* LISTEN 469305/beam.smp Here's the same rabbit when I'm running 'rabbitmqctl list_connections' in another window: [root@krynn-ctrl-0 ~]# netstat -anp|grep beam|grep LISTEN tcp 0 0 0.0.0.0:34484 0.0.0.0:* LISTEN 789613/beam.smp tcp 0 0 10.0.0.24:5672 0.0.0.0:* LISTEN 469305/beam.smp tcp 0 0 0.0.0.0:25672 0.0.0.0:* LISTEN 469305/beam.smp see the port on 34484? It changes everytime. Here's another run (ctrl-c the previous rabbitmqctl and run it again): [root@krynn-ctrl-0 ~]# netstat -anp|grep beam|grep LISTEN tcp 0 0 0.0.0.0:46356 0.0.0.0:* LISTEN 793007/beam.smp tcp 0 0 10.0.0.24:5672 0.0.0.0:* LISTEN 469305/beam.smp tcp 0 0 0.0.0.0:25672 0.0.0.0:* LISTEN 469305/beam.smp This is OSP10 with the 20170228 RHOSP images.
This happens on a freshly deployed OSP10 with the latest patches (20170321). 4369,5672,33239 and 25672 are the only ports in the iptables rules. [root@krynn-ctrl-0 ~]# iptables -L -nvv|grep 4369 2545 153K ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 multiport dports 4369,5672,33239,25672 /* 109 rabbitmq */ state NEW
Hi, I'm working with a customer who hit this issue also; any update on a fix or work-around?
Delete all the INPUT -j REJECT rules until we have a proper fix. This looks like a FTP-DATA issue and so far, I haven't found a configuration parameter that would change this behavior with the rabbitmqctl client.
[stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables --line-numbers -n -v -L INPUT' ctrl |grep REJECT 75 3597 160K REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited 75 4421 208K REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited 75 5847 282K REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited [stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables --line-numbers -n -v -L FORWARD' ctrl |grep REJECT 3 0 0 REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited 3 0 0 REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited 3 0 0 REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited so lines 75 in INPUT and 3 in FORWARD.
Deleted those lines: [stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables -v -D INPUT 75' ctrl krynn-ctrl-1 | SUCCESS | rc=0 >> krynn-ctrl-0 | SUCCESS | rc=0 >> krynn-ctrl-2 | SUCCESS | rc=0 >> [stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables -v -D FORWARD 3' ctrl krynn-ctrl-1 | SUCCESS | rc=0 >> krynn-ctrl-0 | SUCCESS | rc=0 >> krynn-ctrl-2 | SUCCESS | rc=0 >>
Even with the REJECT rules removed, I am still seeing the hang.
When I "strace -f -s1024 rabbitmqctl list_queues", as it starts hanging, I'm seeing a repeated pattern of: 156189 sched_yield() = 0 156189 sched_yield() = 0 156189 sched_yield() = 0 156189 sched_yield() = 0 156189 sched_yield() = 0 156189 sched_yield() = 0 156189 sched_yield() = 0 156189 sched_yield() = 0 156189 sched_yield() = 0 156189 sched_yield() = 0 156189 futex(0x7ff5638c0450, FUTEX_WAIT_PRIVATE, 4294967295, {14, 992556869} <unfinished ...> 156191 <... ppoll resumed> ) = 1 ([{fd=44, revents=POLLIN|POLLRDNORM}], left {3, 795857414}) 156191 recvfrom(44, "\0\0\0\0", 1460, 0, NULL, NULL) = 4 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {3, 792331959}, NULL, 8) = 0 (Timeout) 156191 writev(44, [{"\0\0\0\0", 4}], 1) = 4 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout) 156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {14, 999423541}, NULL, 8^C
As it's hanging, I'm seeing a steady flow of the following in /var/log/messages: May 18 14:08:08 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:de:c6:92:2e:91:62:08:00 SRC=10.0.0.15 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=16489 DF PROTO=TCP SPT=60166 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 May 18 14:08:08 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:2a:1c:44:a7:30:53:08:00 SRC=10.0.0.18 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=15229 DF PROTO=TCP SPT=41427 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 May 18 14:08:09 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:de:c6:92:2e:91:62:08:00 SRC=10.0.0.15 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=16490 DF PROTO=TCP SPT=60166 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 May 18 14:08:09 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:2a:1c:44:a7:30:53:08:00 SRC=10.0.0.18 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=15230 DF PROTO=TCP SPT=41427 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 May 18 14:08:11 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:de:c6:92:2e:91:62:08:00 SRC=10.0.0.15 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=16491 DF PROTO=TCP SPT=60166 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 May 18 14:08:11 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:2a:1c:44:a7:30:53:08:00 SRC=10.0.0.18 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=15231 DF PROTO=TCP SPT=41427 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 ^C POrt 41192 is rabbitmq: [root@krynn-ctrl-1 log]# netstat -anp|grep 41192 tcp 0 0 0.0.0.0:41192 0.0.0.0:* LISTEN 156024/beam.smp 10.0.0.15 to 21 are the IPs of my controllers on the internal API VLAN: # grep 10.0.0 /etc/hosts|grep interna 10.0.0.15 krynn-ctrl-0.internalapi.localdomain krynn-ctrl-0.internalapi 10.0.0.21 krynn-ctrl-1.internalapi.localdomain krynn-ctrl-1.internalapi 10.0.0.18 krynn-ctrl-2.internalapi.localdomain krynn-ctrl-2.internalapi 10.0.0.19 krynn-cmpt-0.internalapi.localdomain krynn-cmpt-0.internalapi 10.0.0.14 overcloud.internalapi.localdomain # FQDN of the internal api VIP
If I add a simple IPtables rule, list_queues does not hang anymore: ansible -i hosts -m command -a 'sudo /sbin/iptables -I INPUT 1 -i vlan10 -d 10.0.0.0/24 -s 10.0.0.0/24 -j ACCEPT' ctrl In my env file for this deploy, I have: InternalApiNetCidr: 10.0.0.0/24 InternalApiNetworkVlanID: 10 InternalApiAllocationPools: [{'start': '10.0.0.10', 'end': '10.0.0.200'}] Result: [root@krynn-ctrl-1 log]# time rabbitmqctl list_connections > /dev/null real 0m1.587s user 0m0.646s sys 0m0.485s
(In reply to Vincent S. Cojot from comment #9) > If I add a simple IPtables rule, list_queues does not hang anymore: > ansible -i hosts -m command -a 'sudo /sbin/iptables -I INPUT 1 -i vlan10 -d > 10.0.0.0/24 -s 10.0.0.0/24 -j ACCEPT' ctrl > > In my env file for this deploy, I have: > > InternalApiNetCidr: 10.0.0.0/24 > InternalApiNetworkVlanID: 10 > InternalApiAllocationPools: [{'start': '10.0.0.10', 'end': '10.0.0.200'}] Thanks for debugging this! So we should add/modify an extra iptables deployment rule while building an OpenStack cluster. I guess this should be reassigned to the Director then.
(In reply to Vincent S. Cojot from comment #8) > As it's hanging, I'm seeing a steady flow of the following in > /var/log/messages: > POrt 41192 is rabbitmq: > [root@krynn-ctrl-1 log]# netstat -anp|grep 41192 > tcp 0 0 0.0.0.0:41192 0.0.0.0:* LISTEN > 156024/beam.smp Honestly I can't remember where 41192 port comes from. My guess is that Erlang distribution port was reassigned once again (?).
Sorry for reassigning - I'll keep it assigned against rabbitmq-server
Customer confirmed this issue present in OSP11 as well: puppet-tripleo-6.3.0-12.el7ost.noarch openstack-tripleo-heat-templates-6.0.0-10.el7ost.noarch Completely removing the DROP rules works to get around issue.
*** Bug 1466803 has been marked as a duplicate of this bug. ***
Which upstream commit caused this regression?
Verified on rabbitmq-server-3.6.3-7.el7ost.noarch Multiple tests during traffic (boot instances) seems that the command is not stuck - rabbitmqctl list_connections
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2653
*** Bug 1474507 has been marked as a duplicate of this bug. ***
*** Bug 1742842 has been marked as a duplicate of this bug. ***
*** Bug 1640455 has been marked as a duplicate of this bug. ***