Bug 1434593
| Summary: | rabbitmqctl list_queues/list_connections hangs indefinitely in OSP10 due to missing iptables rules. | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Vincent S. Cojot <vcojot> |
| Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> |
| Status: | CLOSED ERRATA | QA Contact: | Udi Shkalim <ushkalim> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 10.0 (Newton) | CC: | apevec, bschmaus, cfields, dbecker, dhill, fdinitto, hfukumot, jeckersb, jmelvin, lhh, mburns, mflusche, mircea.vutcovici, morazi, mschuppe, plemenko, rhel-osp-director-maint, rhosp-bugs-internal, rlondhe, rrubins, skinjo, srevivo |
| Target Milestone: | z4 | Keywords: | Triaged, ZStream |
| Target Release: | 10.0 (Newton) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | rabbitmq-server-3.6.3-7.el7ost | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-09-06 17:06:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Vincent S. Cojot
2017-03-21 21:20:25 UTC
This happens on a freshly deployed OSP10 with the latest patches (20170321). 4369,5672,33239 and 25672 are the only ports in the iptables rules. [root@krynn-ctrl-0 ~]# iptables -L -nvv|grep 4369 2545 153K ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 multiport dports 4369,5672,33239,25672 /* 109 rabbitmq */ state NEW Hi, I'm working with a customer who hit this issue also; any update on a fix or work-around? Delete all the INPUT -j REJECT rules until we have a proper fix. This looks like a FTP-DATA issue and so far, I haven't found a configuration parameter that would change this behavior with the rabbitmqctl client. [stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables --line-numbers -n -v -L INPUT' ctrl |grep REJECT 75 3597 160K REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited 75 4421 208K REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited 75 5847 282K REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited [stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables --line-numbers -n -v -L FORWARD' ctrl |grep REJECT 3 0 0 REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited 3 0 0 REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited 3 0 0 REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited so lines 75 in INPUT and 3 in FORWARD. Deleted those lines: [stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables -v -D INPUT 75' ctrl krynn-ctrl-1 | SUCCESS | rc=0 >> krynn-ctrl-0 | SUCCESS | rc=0 >> krynn-ctrl-2 | SUCCESS | rc=0 >> [stack@instack ~]$ ansible -i hosts -m command -a 'sudo /sbin/iptables -v -D FORWARD 3' ctrl krynn-ctrl-1 | SUCCESS | rc=0 >> krynn-ctrl-0 | SUCCESS | rc=0 >> krynn-ctrl-2 | SUCCESS | rc=0 >> Even with the REJECT rules removed, I am still seeing the hang. When I "strace -f -s1024 rabbitmqctl list_queues", as it starts hanging, I'm seeing a repeated pattern of:
156189 sched_yield() = 0
156189 sched_yield() = 0
156189 sched_yield() = 0
156189 sched_yield() = 0
156189 sched_yield() = 0
156189 sched_yield() = 0
156189 sched_yield() = 0
156189 sched_yield() = 0
156189 sched_yield() = 0
156189 sched_yield() = 0
156189 futex(0x7ff5638c0450, FUTEX_WAIT_PRIVATE, 4294967295, {14, 992556869} <unfinished ...>
156191 <... ppoll resumed> ) = 1 ([{fd=44, revents=POLLIN|POLLRDNORM}], left {3, 795857414})
156191 recvfrom(44, "\0\0\0\0", 1460, 0, NULL, NULL) = 4
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {3, 792331959}, NULL, 8) = 0 (Timeout)
156191 writev(44, [{"\0\0\0\0", 4}], 1) = 4
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {0, 0}, NULL, 8) = 0 (Timeout)
156191 ppoll([{fd=4, events=POLLIN|POLLRDNORM}, {fd=42, events=POLLIN|POLLRDNORM}, {fd=41, events=POLLIN|POLLRDNORM}, {fd=43, events=POLLIN|POLLRDNORM}, {fd=44, events=POLLIN|POLLRDNORM}], 5, {14, 999423541}, NULL, 8^C
As it's hanging, I'm seeing a steady flow of the following in /var/log/messages: May 18 14:08:08 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:de:c6:92:2e:91:62:08:00 SRC=10.0.0.15 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=16489 DF PROTO=TCP SPT=60166 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 May 18 14:08:08 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:2a:1c:44:a7:30:53:08:00 SRC=10.0.0.18 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=15229 DF PROTO=TCP SPT=41427 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 May 18 14:08:09 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:de:c6:92:2e:91:62:08:00 SRC=10.0.0.15 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=16490 DF PROTO=TCP SPT=60166 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 May 18 14:08:09 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:2a:1c:44:a7:30:53:08:00 SRC=10.0.0.18 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=15230 DF PROTO=TCP SPT=41427 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 May 18 14:08:11 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:de:c6:92:2e:91:62:08:00 SRC=10.0.0.15 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=16491 DF PROTO=TCP SPT=60166 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 May 18 14:08:11 krynn-ctrl-1 kernel: IN=vlan10 OUT= MAC=f6:38:95:25:65:1d:2a:1c:44:a7:30:53:08:00 SRC=10.0.0.18 DST=10.0.0.21 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=15231 DF PROTO=TCP SPT=41427 DPT=41192 WINDOW=65535 RES=0x00 SYN URGP=0 ^C POrt 41192 is rabbitmq: [root@krynn-ctrl-1 log]# netstat -anp|grep 41192 tcp 0 0 0.0.0.0:41192 0.0.0.0:* LISTEN 156024/beam.smp 10.0.0.15 to 21 are the IPs of my controllers on the internal API VLAN: # grep 10.0.0 /etc/hosts|grep interna 10.0.0.15 krynn-ctrl-0.internalapi.localdomain krynn-ctrl-0.internalapi 10.0.0.21 krynn-ctrl-1.internalapi.localdomain krynn-ctrl-1.internalapi 10.0.0.18 krynn-ctrl-2.internalapi.localdomain krynn-ctrl-2.internalapi 10.0.0.19 krynn-cmpt-0.internalapi.localdomain krynn-cmpt-0.internalapi 10.0.0.14 overcloud.internalapi.localdomain # FQDN of the internal api VIP If I add a simple IPtables rule, list_queues does not hang anymore:
ansible -i hosts -m command -a 'sudo /sbin/iptables -I INPUT 1 -i vlan10 -d 10.0.0.0/24 -s 10.0.0.0/24 -j ACCEPT' ctrl
In my env file for this deploy, I have:
InternalApiNetCidr: 10.0.0.0/24
InternalApiNetworkVlanID: 10
InternalApiAllocationPools: [{'start': '10.0.0.10', 'end': '10.0.0.200'}]
Result:
[root@krynn-ctrl-1 log]# time rabbitmqctl list_connections > /dev/null
real 0m1.587s
user 0m0.646s
sys 0m0.485s
(In reply to Vincent S. Cojot from comment #9) > If I add a simple IPtables rule, list_queues does not hang anymore: > ansible -i hosts -m command -a 'sudo /sbin/iptables -I INPUT 1 -i vlan10 -d > 10.0.0.0/24 -s 10.0.0.0/24 -j ACCEPT' ctrl > > In my env file for this deploy, I have: > > InternalApiNetCidr: 10.0.0.0/24 > InternalApiNetworkVlanID: 10 > InternalApiAllocationPools: [{'start': '10.0.0.10', 'end': '10.0.0.200'}] Thanks for debugging this! So we should add/modify an extra iptables deployment rule while building an OpenStack cluster. I guess this should be reassigned to the Director then. (In reply to Vincent S. Cojot from comment #8) > As it's hanging, I'm seeing a steady flow of the following in > /var/log/messages: > POrt 41192 is rabbitmq: > [root@krynn-ctrl-1 log]# netstat -anp|grep 41192 > tcp 0 0 0.0.0.0:41192 0.0.0.0:* LISTEN > 156024/beam.smp Honestly I can't remember where 41192 port comes from. My guess is that Erlang distribution port was reassigned once again (?). Sorry for reassigning - I'll keep it assigned against rabbitmq-server Customer confirmed this issue present in OSP11 as well: puppet-tripleo-6.3.0-12.el7ost.noarch openstack-tripleo-heat-templates-6.0.0-10.el7ost.noarch Completely removing the DROP rules works to get around issue. *** Bug 1466803 has been marked as a duplicate of this bug. *** Which upstream commit caused this regression? Verified on rabbitmq-server-3.6.3-7.el7ost.noarch Multiple tests during traffic (boot instances) seems that the command is not stuck - rabbitmqctl list_connections Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2653 *** Bug 1474507 has been marked as a duplicate of this bug. *** *** Bug 1742842 has been marked as a duplicate of this bug. *** *** Bug 1640455 has been marked as a duplicate of this bug. *** |