Description of problem: in downstream OpenStack CI, we experience random connectivity issues: https://bugzilla.redhat.com/show_bug.cgi?id=1438662 Month-long investigation uncovered that in particular situations, ARP table entries may cycle through STALE-DELAY-REACHABLE states without issuing a single ARP probe even when no matching upper layer protocol traffic arrives from the entry lladdr. This is because till 4.11, the following patch series was not included: https://www.spinics.net/lists/linux-rdma/msg45907.html The patch series tackles the problem where confirmations happen on dst_entry structure that can be reused for multiple ARP entries. So seemingly unrelated traffic actually "confirms" entries, which explains why not a single ARP probe to update the affected ARP entries is sent during failing test runs. With (most of) those patches applied to RHEL7 kernel, I was able to pass OpenStack scenario test runs. With patches applied, I see (in captured .pcap) that after 5 seconds of not being able to reach the lladdr, kernel correctly issues a ARP probe, updates lladdr in ARP table with the new MAC address, and successfully establishes connections to the IP address. An attempt to backport (most of) those patches from the series that is proved to work can be found at: https://github.com/booxter/centos-kernel (note it doesn't include sctp and the last cleanup patch since I didn't think that could affect my testing, and indeed it did not). Besides missing patches, it would need some refinement to accommodate for KABI stability requirements set for RHEL kernels. Version-Release number of selected component (if applicable): 3.10.0-514.22.1.el7 How reproducible: I couldn't come up with a script that would show the issue isolated from OpenStack and its test suite, but see steps below. Steps to Reproduce: Though I don't have easy steps to reproduce, I will try to explain the state in which the system is stuck with broken ARP entries, the following should happen: 1. an existing stale ARP entry for IP1 with MAC1 should exist; 2. IP1 is moved to another device with MAC2; 3. Linux kernel sends first packet directed to IP1 with MAC1, ARP entry transitions to DELAY; 4. in the meantime while kernel waits for delay timeout to happen (which is set to 5s), another upper layer protocol packet that is not directed to the lladdr triggers confirmation of the wrong ARP entry; 5. kernel delay timer triggers, but since the entry is now confirmed, kernel bails out of ARP probes and just transitions the entry to REACHABLE; 6. after next REACHABLE -> STALE transition, steps 3-5 repeat. Actual results: a random ARP entry never gets updated with a new lladdr. Expected results: with no matching traffic coming from the old lladdr, ARP probe should eventually trigger and update the entry with a new lladdr. Additional info: I collected a lot of information on the failure at: https://docs.google.com/document/d/1vmL2X1Cdu9yMmRaJBt5YsFuBvJw_eZj_YWu5SeqA_2U/edit?usp=sharing I guess the "Diggin' the Kernel" section should have all the details needed to understand how the error renders. Also note that unrelated to the OpenStack CI issue I was debugging, we got a customer reporting that after a failover of one of their services, they got invalid REACHABLE ARP entries staying like that after 24h+ after failover: https://access.redhat.com/support/cases/#/case/01814594 I suspect this is the same issue. The issue doesn't show up in -pegas builds because the series of patches is included there. Final note: this situation may render IP address roaming/failover not effective "thanks" to a specially crafted traffic that happen to arrive the same node. Which begs the question whether there is a reason to track the bug as security related. That's why I am leaving the bug report closed from public for now.
(In reply to Ihar Hrachyshka from comment #0) > Final note: this situation may render IP address roaming/failover not > effective "thanks" to a specially crafted traffic that happen to arrive the > same node. Which begs the question whether there is a reason to track the > bug as security related. That's why I am leaving the bug report closed from > public for now. I don't think this makes things any worse, I think the attacker would need to use address spoofing to execute such an attack, and the effect of this would be similar to that of other address spoofing attacks.
No you don't need to spoof anything to render connectivity between a node X and a service IP1. Just make sure that there is special traffic to node X that makes it confirm ARP entry for IP1 over and over (that seems to be any traffic from same network subnet on the same l2 domain to the node X); and then just wait for failover for IP1 to occur. Once it occurs, X can't ever get out of confirmation loop and restore connectivity to IP1.
It affects OpenStack CI and customers. We would like to ask to backport the series back to 7.3 if possible.
Marking the bug as a blocker for 7.4. The rationale is as follows: 1. the bug affects OpenStack CI. We patched OpenStack Neutron L3 agent a bit to reduce the risk of failure, but it is not a complete solution. 2. the bug affects production environments (see attached customer case) that are not even relying on Neutron L3 agent gratuitous ARPs. The effect of the bug is that connectivity between two nodes on the same network segment may become broken and stay that way for indefinite time, which is a big deal, and one may even argue that's worth a special security concern. I understand we are late in 7.4 timeframe. To explain why the bug pops up now only: we set up and started debugging OpenStack CI jobs that could trigger the failure mode several months ago, and spent a lot of time to dig to the point where we realized that's a kernel issue and not OpenStack. (Actually, a set of issues that all combined render the CI jobs totally broken.) Ideally, we would see that in 7.3 too, but at least 7.4 would be a good start. I understand that the series of patches has significant impact and so we need to be cautious about cost-benefit. That being said, the benefit of production environments not breaking on IP failover sounds like a significant one to me.
Posted: http://post-office.corp.redhat.com/archives/rhkernel-list/2017-May/msg01774.html Corresponding Brew build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13231449
There seems to be a decision that the bug doesn't justify special handling security wise. For this reason, I open the description and relevant comments to the public.
set qa_ack based on comment 11
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing
Patch(es) available on kernel-3.10.0-678.el7
Ofer, please advise on how we can test the new kernel in scope of OSP11 CI.
reproducer: [root@ibm-x3650m4-04 arp_test]# cat repo.sh #!/bin/bash ip netns add host1 ip netns add host2 ip netns add host3 brctl addbr br0 ip link add veth1 type veth peer name veth1_br ip link add veth2 type veth peer name veth2_br ip link add veth3 type veth peer name veth3_br ip link set veth1 netns host1 ip link set veth2 netns host2 ip link set veth3 netns host3 brctl addif br0 veth1_br brctl addif br0 veth2_br brctl addif br0 veth3_br ip netns exec host1 ip link set lo up ip netns exec host1 ip link set veth1 up ip netns exec host1 ip addr add 192.168.1.1/24 dev veth1 ip netns exec host1 ip addr add 2000::1/64 dev veth1 ip netns exec host2 ip link set lo up ip netns exec host2 ip link set veth2 up ip netns exec host2 ip addr add 192.168.1.2/24 dev veth2 ip netns exec host2 ip addr add 2000::2/64 dev veth2 ip netns exec host3 ip link set lo up ip netns exec host3 ip link set veth3 up ip netns exec host3 ip addr add 192.168.1.3/24 dev veth3 ip netns exec host3 ip addr add 2000::3/64 dev veth3 ip link set br0 up ip link set veth1_br up ip link set veth2_br up ip link set veth3_br up ip netns exec host3 nc -l -k 10010 & sleep 1 ip netns exec host1 taskset --cpu-list 0 nc 192.168.1.3 10010 < /dev/zero & sleep 5 echo "host3 neigh setup" ip netns exec host1 ip neigh show ip netns exec host2 nc -l 10010 & sleep 1 ip netns exec host1 taskset --cpu-list 0 nc 192.168.1.2 10010 & sleep 5 echo "host2 neigh setup" ip netns exec host1 ip neigh show echo "down host2 and change ip to host3" ip netns exec host2 ip link set veth2 down ip netns exec host3 ip addr add 192.168.1.2/24 dev veth3 ip netns exec host3 ip addr sh sleep 60 echo "host2 neigh stale" ip netns exec host1 ip neigh show echo "send packet to host2 on host1" #ip netns exec host1 taskset --cpu-list 1 nc 192.168.1.2 10010 & ip netns exec host1 taskset --cpu-list 0 ping 192.168.1.2 -c 1 -W 1 -w 1 sleep 1 i=0 while [ $i -lt 15 ] do ip netns exec host1 ip neigh show echo "" ip netns exec host1 taskset --cpu-list 0 ping 192.168.1.2 -c 1 -W 1 -w 1 let i+=1 done jobs -p | xargs kill -9 killall -9 nc reproduced on 3.10.0-675: [root@ibm-x3650m4-04 arp_test]# uname -a Linux ibm-x3650m4-04.rhts.eng.pek2.redhat.com 3.10.0-675.el7.x86_64 #1 SMP Mon May 29 23:22:32 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux [root@ibm-x3650m4-04 arp_test]# ./repo.sh host3 neigh setup 192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE host2 neigh setup 192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c REACHABLE 192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE down host2 and change ip to host3 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 13: veth3@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000 link/ether c6:1d:84:d1:09:54 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.1.3/24 scope global veth3 valid_lft forever preferred_lft forever inet 192.168.1.2/24 scope global secondary veth3 valid_lft forever preferred_lft forever inet6 2000::3/64 scope global valid_lft forever preferred_lft forever inet6 fe80::c41d:84ff:fed1:954/64 scope link valid_lft forever preferred_lft forever host2 neigh stale 192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c STALE 192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE send packet to host2 on host1 PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms 192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c DELAY 192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms 192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c DELAY 192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms 192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c DELAY 192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms 192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c REACHABLE 192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE <===== mac for 192.168.1.2 not updated Verified on 3.10.0-679: [root@ibm-x3650m4-04 arp_test]# uname -a Linux ibm-x3650m4-04.rhts.eng.pek2.redhat.com 3.10.0-679.el7.x86_64 #1 SMP Mon Jun 5 23:13:08 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux [root@ibm-x3650m4-04 arp_test]# ./repo.sh host3 neigh setup 192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE host2 neigh setup 192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 REACHABLE 192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE down host2 and change ip to host3 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 83: veth3@if82: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000 link/ether b2:e4:26:21:6f:48 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.1.3/24 scope global veth3 valid_lft forever preferred_lft forever inet 192.168.1.2/24 scope global secondary veth3 valid_lft forever preferred_lft forever inet6 2000::3/64 scope global valid_lft forever preferred_lft forever inet6 fe80::b0e4:26ff:fe21:6f48/64 scope link valid_lft forever preferred_lft forever host2 neigh stale 192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 STALE 192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE send packet to host2 on host1 PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms 192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 DELAY 192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms 192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 DELAY 192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms 192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 DELAY 192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms 192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 PROBE 192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms 192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 PROBE 192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms 192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 PROBE 192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.049 ms --- 192.168.1.2 ping statistics --- 2 packets transmitted, 1 received, 50% packet loss, time 999ms rtt min/avg/max/mdev = 0.049/0.049/0.049/0.000 ms 192.168.1.2 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE 192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE <==== mac for 192.168.1.2 updated
For the recond, we did testing of the new kernel in OSP environment, and it fixed the CI issue we experienced.
The first kernel version with the backported fix for this BZ was -678, which can be found here: http://download-node-02.eng.bos.redhat.com/brewroot/packages/kernel/3.10.0/678.el7/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842