Bug 1450203

Summary:	Irrelevant upper layer protocol traffic may erroneously "confirm" neigh entries
Product:	Red Hat Enterprise Linux 7	Reporter:	Ihar Hrachyshka <ihrachys>
Component:	kernel	Assignee:	Lance Richardson <lrichard>
kernel sub component:	arp/icmp	QA Contact:	Jianlin Shi <jishi>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aloughla, atragler, dgilbert, ealcaniz, ihrachys, jiji, lmiksik, lrichard, oblaut, sukulkar
Version:	7.3
Target Milestone:	rc
Target Release:	7.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-02 07:31:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1438662

Description Ihar Hrachyshka 2017-05-11 19:38:46 UTC

Description of problem: in downstream OpenStack CI, we experience random connectivity issues: https://bugzilla.redhat.com/show_bug.cgi?id=1438662

Month-long investigation uncovered that in particular situations, ARP table entries may cycle through STALE-DELAY-REACHABLE states without issuing a single ARP probe even when no matching upper layer protocol traffic arrives from the entry lladdr. This is because till 4.11, the following patch series was not included:

https://www.spinics.net/lists/linux-rdma/msg45907.html

The patch series tackles the problem where confirmations happen on dst_entry structure that can be reused for multiple ARP entries. So seemingly unrelated traffic actually "confirms" entries, which explains why not a single ARP probe to update the affected ARP entries is sent during failing test runs.

With (most of) those patches applied to RHEL7 kernel, I was able to pass OpenStack scenario test runs. With patches applied, I see (in captured .pcap) that after 5 seconds of not being able to reach the lladdr, kernel correctly issues a ARP probe, updates lladdr in ARP table with the new MAC address, and successfully establishes connections to the IP address.

An attempt to backport (most of) those patches from the series that is proved to work can be found at: https://github.com/booxter/centos-kernel (note it doesn't include sctp and the last cleanup patch since I didn't think that could affect my testing, and indeed it did not). Besides missing patches, it would need some refinement to accommodate for KABI stability requirements set for RHEL kernels.

Version-Release number of selected component (if applicable): 3.10.0-514.22.1.el7

How reproducible: I couldn't come up with a script that would show the issue isolated from OpenStack and its test suite, but see steps below.

Steps to Reproduce:

Though I don't have easy steps to reproduce, I will try to explain the state in which the system is stuck with broken ARP entries, the following should happen:

1. an existing stale ARP entry for IP1 with MAC1 should exist;
2. IP1 is moved to another device with MAC2;
3. Linux kernel sends first packet directed to IP1 with MAC1, ARP entry transitions to DELAY;
4. in the meantime while kernel waits for delay timeout to happen (which is set to 5s), another upper layer protocol packet that is not directed to the lladdr triggers confirmation of the wrong ARP entry;
5. kernel delay timer triggers, but since the entry is now confirmed, kernel bails out of ARP probes and just transitions the entry to REACHABLE;
6. after next REACHABLE -> STALE transition, steps 3-5 repeat.

Actual results: a random ARP entry never gets updated with a new lladdr.

Expected results: with no matching traffic coming from the old lladdr, ARP probe should eventually trigger and update the entry with a new lladdr.

Additional info: I collected a lot of information on the failure at: https://docs.google.com/document/d/1vmL2X1Cdu9yMmRaJBt5YsFuBvJw_eZj_YWu5SeqA_2U/edit?usp=sharing I guess the "Diggin' the Kernel" section should have all the details needed to understand how the error renders.

Also note that unrelated to the OpenStack CI issue I was debugging, we got a customer reporting that after a failover of one of their services, they got invalid REACHABLE ARP entries staying like that after 24h+ after failover: https://access.redhat.com/support/cases/#/case/01814594 I suspect this is the same issue.

The issue doesn't show up in -pegas builds because the series of patches is included there.

Final note: this situation may render IP address roaming/failover not effective "thanks" to a specially crafted traffic that happen to arrive the same node. Which begs the question whether there is a reason to track the bug as security related. That's why I am leaving the bug report closed from public for now.

Comment 2 Lance Richardson 2017-05-15 14:05:46 UTC

(In reply to Ihar Hrachyshka from comment #0)

> Final note: this situation may render IP address roaming/failover not
> effective "thanks" to a specially crafted traffic that happen to arrive the
> same node. Which begs the question whether there is a reason to track the
> bug as security related. That's why I am leaving the bug report closed from
> public for now.

I don't think this makes things any worse, I think the attacker would need
to use address spoofing to execute such an attack, and the effect of this
would be similar to that of other address spoofing attacks.

Comment 3 Ihar Hrachyshka 2017-05-16 03:07:11 UTC

No you don't need to spoof anything to render connectivity between a node X and a service IP1. Just make sure that there is special traffic to node X that makes it confirm ARP entry for IP1 over and over (that seems to be any traffic from same network subnet on the same l2 domain to the node X); and then just wait for failover for IP1 to occur. Once it occurs, X can't ever get out of confirmation loop and restore connectivity to IP1.

Comment 4 Ihar Hrachyshka 2017-05-16 16:32:17 UTC

It affects OpenStack CI and customers. We would like to ask to backport the series back to 7.3 if possible.

Comment 6 Ihar Hrachyshka 2017-05-18 17:41:07 UTC

Marking the bug as a blocker for 7.4. The rationale is as follows:

1. the bug affects OpenStack CI. We patched OpenStack Neutron L3 agent a bit to reduce the risk of failure, but it is not a complete solution.

2. the bug affects production environments (see attached customer case) that are not even relying on Neutron L3 agent gratuitous ARPs. The effect of the bug is that connectivity between two nodes on the same network segment may become broken and stay that way for indefinite time, which is a big deal, and one may even argue that's worth a special security concern.

I understand we are late in 7.4 timeframe. To explain why the bug pops up now only: we set up and started debugging OpenStack CI jobs that could trigger the failure mode several months ago, and spent a lot of time to dig to the point where we realized that's a kernel issue and not OpenStack. (Actually, a set of issues that all combined render the CI jobs totally broken.)

Ideally, we would see that in 7.3 too, but at least 7.4 would be a good start. I understand that the series of patches has significant impact and so we need to be cautious about cost-benefit. That being said, the benefit of production environments not breaking on IP failover sounds like a significant one to me.

Comment 7 Lance Richardson 2017-05-19 12:36:13 UTC

Posted:
  http://post-office.corp.redhat.com/archives/rhkernel-list/2017-May/msg01774.html

Corresponding Brew build:
  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13231449

Comment 8 Ihar Hrachyshka 2017-05-24 17:13:36 UTC

There seems to be a decision that the bug doesn't justify special handling security wise. For this reason, I open the description and relevant comments to the public.

Comment 12 Jianlin Shi 2017-05-27 00:29:21 UTC

set qa_ack based on comment 11

Comment 13 Rafael Aquini 2017-06-05 15:21:17 UTC

Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 15 Rafael Aquini 2017-06-06 17:10:43 UTC

Patch(es) available on kernel-3.10.0-678.el7

Comment 18 Ihar Hrachyshka 2017-06-09 17:44:04 UTC

Ofer, please advise on how we can test the new kernel in scope of OSP11 CI.

Comment 19 Jianlin Shi 2017-06-15 02:05:34 UTC

reproducer:

[root@ibm-x3650m4-04 arp_test]# cat repo.sh 
#!/bin/bash

ip netns add host1
ip netns add host2
ip netns add host3
brctl addbr br0
ip link add veth1 type veth peer name veth1_br
ip link add veth2 type veth peer name veth2_br
ip link add veth3 type veth peer name veth3_br
ip link set veth1 netns host1
ip link set veth2 netns host2
ip link set veth3 netns host3
brctl addif br0 veth1_br
brctl addif br0 veth2_br
brctl addif br0 veth3_br
ip netns exec host1 ip link set lo up
ip netns exec host1 ip link set veth1 up
ip netns exec host1 ip addr add 192.168.1.1/24 dev veth1
ip netns exec host1 ip addr add 2000::1/64 dev veth1
ip netns exec host2 ip link set lo up
ip netns exec host2 ip link set veth2 up
ip netns exec host2 ip addr add 192.168.1.2/24 dev veth2
ip netns exec host2 ip addr add 2000::2/64 dev veth2
ip netns exec host3 ip link set lo up
ip netns exec host3 ip link set veth3 up
ip netns exec host3 ip addr add 192.168.1.3/24 dev veth3
ip netns exec host3 ip addr add 2000::3/64 dev veth3

ip link set br0 up
ip link set veth1_br up
ip link set veth2_br up
ip link set veth3_br up

ip netns exec host3 nc -l -k 10010 &
sleep 1
ip netns exec host1 taskset --cpu-list 0 nc 192.168.1.3 10010 < /dev/zero &
sleep 5
echo "host3 neigh setup"
ip netns exec host1 ip neigh show
ip netns exec host2 nc -l 10010 &
sleep 1
ip netns exec host1 taskset --cpu-list 0 nc 192.168.1.2 10010 &
sleep 5
echo "host2 neigh setup"
ip netns exec host1 ip neigh show
echo "down host2 and change ip to host3"
ip netns exec host2 ip link set veth2 down
ip netns exec host3 ip addr add 192.168.1.2/24 dev veth3
ip netns exec host3 ip addr sh
sleep 60
echo "host2 neigh stale"
ip netns exec host1 ip neigh show
echo "send packet to host2 on host1"
#ip netns exec host1 taskset --cpu-list 1 nc 192.168.1.2 10010 &
ip netns exec host1 taskset --cpu-list 0 ping 192.168.1.2  -c 1 -W 1 -w 1
sleep 1
i=0
while [ $i -lt 15 ]
do
ip netns exec host1 ip neigh show
echo ""
ip netns exec host1 taskset --cpu-list 0 ping 192.168.1.2  -c 1 -W 1 -w 1
let i+=1
done

jobs -p | xargs kill -9
killall -9 nc


reproduced on 3.10.0-675:

[root@ibm-x3650m4-04 arp_test]# uname -a
Linux ibm-x3650m4-04.rhts.eng.pek2.redhat.com 3.10.0-675.el7.x86_64 #1 SMP Mon May 29 23:22:32 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux


[root@ibm-x3650m4-04 arp_test]# ./repo.sh 
host3 neigh setup
192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE
host2 neigh setup
192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c REACHABLE
192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE
down host2 and change ip to host3
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
13: veth3@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether c6:1d:84:d1:09:54 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.1.3/24 scope global veth3
       valid_lft forever preferred_lft forever
    inet 192.168.1.2/24 scope global secondary veth3
       valid_lft forever preferred_lft forever
    inet6 2000::3/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::c41d:84ff:fed1:954/64 scope link 
       valid_lft forever preferred_lft forever
host2 neigh stale
192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c STALE
192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE
send packet to host2 on host1
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c DELAY
192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE

PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c DELAY
192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE

PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c DELAY
192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE

PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

192.168.1.2 dev veth1 lladdr 16:07:5a:37:06:2c REACHABLE
192.168.1.3 dev veth1 lladdr c6:1d:84:d1:09:54 REACHABLE

<===== mac for 192.168.1.2 not updated


Verified on 3.10.0-679:

[root@ibm-x3650m4-04 arp_test]# uname -a
Linux ibm-x3650m4-04.rhts.eng.pek2.redhat.com 3.10.0-679.el7.x86_64 #1 SMP Mon Jun 5 23:13:08 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

[root@ibm-x3650m4-04 arp_test]# ./repo.sh 
host3 neigh setup
192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE
host2 neigh setup
192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 REACHABLE
192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE
down host2 and change ip to host3
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
83: veth3@if82: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether b2:e4:26:21:6f:48 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.1.3/24 scope global veth3
       valid_lft forever preferred_lft forever
    inet 192.168.1.2/24 scope global secondary veth3
       valid_lft forever preferred_lft forever
    inet6 2000::3/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::b0e4:26ff:fe21:6f48/64 scope link 
       valid_lft forever preferred_lft forever
host2 neigh stale
192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 STALE
192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE
send packet to host2 on host1
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 DELAY
192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE

PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 DELAY
192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE

PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 DELAY
192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE

PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 PROBE
192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE

PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 PROBE
192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE

PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

192.168.1.2 dev veth1 lladdr f6:ca:a5:2e:a1:79 PROBE
192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE

PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.049 ms
--- 192.168.1.2 ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 999ms
rtt min/avg/max/mdev = 0.049/0.049/0.049/0.000 ms
192.168.1.2 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE
192.168.1.3 dev veth1 lladdr b2:e4:26:21:6f:48 REACHABLE

<==== mac for 192.168.1.2 updated

Comment 20 Ihar Hrachyshka 2017-07-05 18:44:55 UTC

For the recond, we did testing of the new kernel in OSP environment, and it fixed the CI issue we experienced.

Comment 22 Lance Richardson 2017-07-05 19:11:13 UTC

The first kernel version with the backported fix for this BZ was -678, which
can be found here:

  http://download-node-02.eng.bos.redhat.com/brewroot/packages/kernel/3.10.0/678.el7/

Comment 28 errata-xmlrpc 2017-08-02 07:31:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842