Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1866495

Summary:	OpenShift 3.11.248 fix for CVE-2020-8558 has exposed RHEL 7 source IP bug
Product:	OpenShift Container Platform	Reporter:	Brad <behle>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Status:	CLOSED DUPLICATE	QA Contact:	Sunil Choudhary <schoudha>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.11.0	CC:	aos-bugs, bbennett, bretm, bshirren, jokerman, rbost
Target Milestone:	---
Target Release:	4.6.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-10 16:31:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Brad 2020-08-05 17:50:24 UTC

Description of problem:

Using the latest OpenShift 3.11.248 (as a part of Redhat Openshift Kubernetes Service on IBM Cloud, ROKS 3.11) we are seeing many cases where liveness and readiness probes are failing for pods.  The probes that are failing at ones that try to contact a port that is listening on localhost of a hostNetwork pod (calico-node is one example).  We have tracked down what we believe to be the cause, and it is a combination of:

1. New rule in iptables filter table, which was added to address CVE-2020-8558 that is dropping these liveness/readiness check packets.  Here is the rule/chain on a node experiencing this problem (with many dropped packets):

```
Chain KUBE-FIREWALL (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
25377 2426K DROP       all  --  *      *      !127.0.0.0/8          127.0.0.0/8          /* block incoming localnet connections */ ! ctstate RELATED,ESTABLISHED,DNAT
```

And here is the rule on a node that is NOT experiencing this problem (no DROPs):

```
Chain KUBE-FIREWALL (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
    0     0 DROP       all  --  *      *      !127.0.0.0/8          127.0.0.0/8          /* block incoming localnet connections */ ! ctstate RELATED,ESTABLISHED,DNAT
```

2. On the RHEL nodes that are hitting this problem, when I run curl to try to connect to a liveness/readiness check port from the local node doing exactly what kubelet is doing, something like: `curl http://localhost:9099/` it will timeout, and in tcpdump I can see that the source IP of the packets is NOT a localhost IP (i.e. not in 127.0.0.0/8), instead it is for some reason using the private IP of the node, 10.138.184.112, and that the SYN packets are not getting a response because they are being dropped by the iptables rule shown above.

```
[root@kube-bskvlvjs041nuk5r1pb0-kubee2epvgx-default-00000144 ~]# tcpdump -lnei lo port 9099
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
10:02:43.732740 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 74: 10.138.184.112.36788 > 127.0.0.1.9099: Flags [S], seq 3331120788, win 65535, options [mss 65495,sackOK,TS val 52407966 ecr 0,nop,wscale 9], length 0
10:02:44.733879 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 74: 10.138.184.112.36788 > 127.0.0.1.9099: Flags [S], seq 3331120788, win 65535, options [mss 65495,sackOK,TS val 52408968 ecr 0,nop,wscale 9], length 0
10:02:46.737949 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 74: 10.138.184.112.36788 > 127.0.0.1.9099: Flags [S], seq 3331120788, win 65535, options [mss 65495,sackOK,TS val 52410972 ecr 0,nop,wscale 9], length 0
10:02:50.741899 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 74: 10.138.184.112.36788 > 127.0.0.1.9099: Flags [S], seq 3331120788, win 65535, options [mss 65495,sackOK,TS val 52414976 ecr 0,nop,wscale 9], length 0
10:07:25.852721 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 74: 10.138.184.112.41694 > 127.0.0.1.9099: Flags [S], seq 3430041411, win 65535, options [mss 65495,sackOK,TS val 52690086 ecr 0,nop,wscale 9], length 0
```

on the node that is working, tcpdump shows the source IP of this traffic as 127.0.0.1 (as expected), and that the traffic is therefore not being dropped:

```
[root@kube-bskvlvjs041nuk5r1pb0-kubee2epvgx-default-00000210 ~]# tcpdump -lnei lo port 9099
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
10:05:02.824523 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 74: 127.0.0.1.19572 > 127.0.0.1.9099: Flags [S], seq 9456684, win 65535, options [mss 65495,sackOK,TS val 52541273 ecr 0,nop,wscale 9], length 0
10:05:02.824575 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 74: 127.0.0.1.9099 > 127.0.0.1.19572: Flags [S.], seq 2108984837, ack 9456685, win 65535, options [mss 65495,sackOK,TS val 52541274 ecr 52541273,nop,wscale 9], length 0
10:05:02.824598 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 66: 127.0.0.1.19572 > 127.0.0.1.9099: Flags [.], ack 1, win 256, options [nop,nop,TS val 52541274 ecr 52541274], length 0
10:05:02.824909 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 170: 127.0.0.1.19572 > 127.0.0.1.9099: Flags [P.], seq 1:105, ack 1, win 256, options [nop,nop,TS val 52541274 ecr 52541274], length 104
10:05:02.824934 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 66: 127.0.0.1.9099 > 127.0.0.1.19572: Flags [.], ack 105, win 256, options [nop,nop,TS val 52541274 ecr 52541274], length 0
10:05:02.825221 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 130: 127.0.0.1.9099 > 127.0.0.1.19572: Flags [P.], seq 1:65, ack 105, win 256, options [nop,nop,TS val 52541274 ecr 52541274], length 64
10:05:02.825242 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 66: 127.0.0.1.19572 > 127.0.0.1.9099: Flags [.], ack 65, win 256, options [nop,nop,TS val 52541274 ecr 52541274], length 0
10:05:02.826434 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 66: 127.0.0.1.19572 > 127.0.0.1.9099: Flags [F.], seq 105, ack 65, win 256, options [nop,nop,TS val 52541275 ecr 52541274], length 0
10:05:02.826556 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 66: 127.0.0.1.9099 > 127.0.0.1.19572: Flags [F.], seq 65, ack 106, win 256, options [nop,nop,TS val 52541276 ecr 52541275], length 0
10:05:02.826573 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 66: 127.0.0.1.19572 > 127.0.0.1.9099: Flags [.], ack 66, win 256, options [nop,nop,TS val 52541276 ecr 52541276], length 0
```

My guess is that this source IP inconsistency/bug has been around for a while, but wasn't a problem with Openshift until this security fix.

Version-Release number of selected component (if applicable): OpenShift 3.11.248


How reproducible:
Not exactly sure.  We see it in our regression testing in about 25% of the nodes we deploy.


Steps to Reproduce:
1. Deploy latest Openshift 3.11 with calico-node pod (or some other pod with a liveness probe that curls a hostNetwork localhost port)
2. Keep adding worker nodes until one of them has the pod keep failing the liveness probe
3.

Actual results:

On certain nodes, source IP not being in 127.0.0.0/8 causes liveness probe packets to be dropped

Expected results:

Liveness probes to localhost ports from kubelet should succeed (or at least not be blocked by k8s iptables rule listed above)


Additional info:

Comment 1 Ben Bennett 2020-08-05 18:24:27 UTC

This was fixed in origin by https://github.com/openshift/origin/pull/25141/files for the bug https://bugzilla.redhat.com/show_bug.cgi?id=1849175.  But that's just a backport of Casey's https://github.com/kubernetes/kubernetes/pull/91569 PR.

Comment 2 Brad 2020-08-05 20:59:29 UTC

Just to clarify, I don't believe this bug has been fixed yet.  The PRs mentioned above solve CVE-2020-8558, but I believe they CAUSE the case I detailed above.  This bug is about how the fix for CVE-2020-8558 seems to have blocked liveness/readiness probes that use network calls to localhost from the local kubelet.

Comment 3 Brad 2020-08-06 21:10:29 UTC

I was able to narrow this down even further.  It appears like this problem appears only on nodes that have had a HostPort pod run on them.  It appears running a HostPort pod on a node causes the following iptables MASQUERADE rule to be added to the node at the end of the POSTROUTING chain in the nat table:

```
Chain POSTROUTING (policy ACCEPT 55 packets, 3828 bytes)
 pkts bytes target     prot opt in     out     source               destination         
18772 1304K cali-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:O3lYWMrLQYEMJtB5 */
17249 1218K KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
    0     0 MASQUERADE  all  --  *      !docker0  172.17.0.0/16        0.0.0.0/0           
  975  107K MASQUERADE  all  --  *      lo      127.0.0.0/8          0.0.0.0/0            /* SNAT for localhost access to hostports */
```

This `/* SNAT for localhost access to hostports */` rule is what is causing the behavior I described initially in this ticket.  When I remove this rule manually, the behavior goes back to normal and the liveness probe succeeds.

I think it is something in the Openshift components that adds this rule and implements HostPorts in general, but I'm not sure about that.  I've looked at the "portmap" CNI plugin  which is used to implement HostPort on base k8s clusters, and they both use more specific iptables rules so that only traffic bound for a HostPort is MASQ'd.  So I think something similar would fix this issue.

Comment 4 Casey Callendrello 2020-08-10 11:58:16 UTC

Brad,
Thanks for the excellent analysis. I think you're exactly right.

Comment 5 Casey Callendrello 2020-08-10 11:58:27 UTC

*** Bug 1866132 has been marked as a duplicate of this bug. ***

Comment 6 Casey Callendrello 2020-08-10 16:23:09 UTC

Update: we need to backport https://github.com/kubernetes/kubernetes/pull/80591 to CRI-O.

Comment 7 Casey Callendrello 2020-08-10 16:31:59 UTC

For bureaucratic reasons, marking this as a duplicate of bug 1866132. Your analysis was spot-on, and it turns out we already have a fix that needs to be backported.

*** This bug has been marked as a duplicate of bug 1866132 ***