Bug 1747532

Summary:	incorrect service iptables rules on one node
Product:	OpenShift Container Platform	Reporter:	Dan Winship <danw>
Component:	Networking	Assignee:	Dan Winship <danw>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, bbennett, cdc, dcbw, deads, eparis, knewcome, rhowe
Version:	4.2.0
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Upstream had an error in the DelaFIFO code. Consequence: Updates were dropped causing us to miss events, so the corresponding rules weren't present. Fix: Rebase to a later upstream version. Result: Events aren't dropped.	Story Points:	---
Clone Of:
Clones:	1812063 (view as bug list)		Environment:
Last Closed:	2019-11-08 11:27:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1812063

Description Dan Winship 2019-08-30 17:49:45 UTC

Eric Paris had a cluster where routes were inaccessible from one node. Investigation showed:

good node:
  -A KUBE-SERVICES -d 172.30.2.238/32 -p tcp -m comment --comment "openshift-ingress/router-default:http cluster IP" -m tcp --dport 80 -j KUBE-SVC-HEVFQXAKPPGAL4BV
  -A KUBE-SVC-HEVFQXAKPPGAL4BV -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-77P4SFRX7UPHHEOO
  -A KUBE-SVC-HEVFQXAKPPGAL4BV -j KUBE-SEP-X2EX327DV76UI2QJ
  -A KUBE-SEP-77P4SFRX7UPHHEOO -p tcp -m tcp -j DNAT --to-destination 10.128.2.4:80
  -A KUBE-SEP-X2EX327DV76UI2QJ -p tcp -m tcp -j DNAT --to-destination 10.131.0.6:80

bad node:
  -A KUBE-SERVICES -d 172.30.2.238/32 -p tcp -m comment --comment "openshift-ingress/router-default:http has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable

SDN logs showed that the bad node had seen the endpoints previously:

  I0830 01:44:22.612458    2535 service.go:332] Adding new service port "openshift-ingress/router-default:http" at 172.30.2.238:80/TCP
  I0830 01:44:22.618573    2535 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:http to [10.128.2.4:80 10.131.0.6:80]

and there was nothing after that suggesting that it had reason to believe the endpoints went away. We don't know whether it ever had the right iptables rules or not.

The bad node was also missing a bunch of other iptables rules. Eg, it had no rules at all for openshift-apiserver/api, and in fact, it seems to have never gotten a ServiceAdd for it:

good node:

  I0830 01:37:30.911095    4220 vnids.go:148] Associate netid 16577033 to namespace "openshift-apiserver" with mcEnabled false
  I0830 01:37:30.942157    4220 proxy.go:344] sdn proxy: add svc openshift-apiserver/api: &Service{...}
  I0830 01:37:30.942342    4220 service.go:332] Adding new service port "openshift-apiserver/api:https" at 172.30.223.112:443/TCP
  I0830 01:38:06.208773    4220 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-apiserver/api:https to [10.128.0.12:8443]
  ...

bad node:

  I0830 01:41:34.830345    2535 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-apiserver/api:https to [10.128.0.12:8443]
  ...
  I0830 01:41:35.832755    2535 vnids.go:148] Associate netid 16577033 to namespace "openshift-apiserver" with mcEnabled false

OK, so this is interesting... The "openshift-apiserver" Namespace was created at 01:37:30, and likewise the NetNamespace, and the "api" service. And the good node picked them up right away. But the bad node didn't see the NetNamespace until 4 minutes later. And it saw the "openshift-ingress" NetNamespace appear at the same time:

  I0830 01:41:35.832558    2535 vnids.go:148] Associate netid 13444890 to namespace "openshift-ingress" with mcEnabled false
  I0830 01:41:35.832635    2535 vnids.go:148] Associate netid 8158642 to namespace "openshift-node" with mcEnabled false
  I0830 01:41:35.832693    2535 vnids.go:148] Associate netid 13775210 to namespace "openshift" with mcEnabled false
  I0830 01:41:35.832755    2535 vnids.go:148] Associate netid 16577033 to namespace "openshift-apiserver" with mcEnabled false

so the bad node's VNID informer seems to have stalled for 5 minutes. (The last on-time update was at 01:34:46.)

(This doesn't explain why the proxy is missing updates though, since the proxy doesn't care about VNIDs. It just points to some sort of problem within the sdn process.)

This is not because of the bootstrap switchover, which appears to have happened at 01:43.

This looks a lot like bug 1668414, aka "Clusters that Eric creates end up missing Services/Endpoints except that no one else can reproduce it so probably OpenShift just hates Eric". But that was supposed to be fixed by the hybrid proxy updates that dcbw did. There may still be issues there though...

Comment 1 Casey Callendrello 2019-08-30 17:54:41 UTC

Are there metrics we can use to detect this?

If iptables itself is failing to run, we can catch this metric. However, if the change is ignored, then we're out of luck.

Comment 3 Dan Winship 2019-08-30 18:06:19 UTC

It's not that iptables is failing to run; if iptables was temporarily failing to run then everything would be fixed once it did start running again (since we resync the entire table when we resync). And if it was permanently failing to run then we'd see a set of rules that had been correct at some point in the past but was no longer correct, whereas what we actually see if a set of rules that are mostly correct for the present, but randomly missing a few things.

It appears that the proxy's internal state does not reflect all of the notifications it should have received. Either because it didn't receive all of the notifications that it should have, or because it updated its internal state incorrectly at some point.

Comment 5 Casey Callendrello 2019-09-02 15:52:15 UTC

I did a lot of debugging for a similar issue - it came up when we were working on disaster recovery, and the SDN watch failed to react to an endpoint update for default/kubernetes. The basic gist is this:

The sdn watches Endpoints. As with all watches, it gets a Update(old, new) change. As updates come in, it maintains a list of services / endpoints that need to be changed, which will happen the next time the proxy sync loop is called. The goal is to minimize real changes.

As a shortcut, if it sees Update(A, A), then it short-circuits and throws the update away. Now, to be fair, this is "guaranteed" not to happen by the informer code. However, it does happen in times of etcd disruption. I did a bunch of debugging, and never could pin down the issue. I put it down to guarantees being forfeit when etcd is recovered.

However, this seems to show that perhaps this bug *does* exist in normal cases. We should consider getting a change upstream in the EndpointChangeTracker that compares New against the current known state, rather than the Old as generated by the informer.

Comment 6 Dan Winship 2019-09-03 12:48:15 UTC

Based on the logs, that's not what happened in this case though. We got an Add for the Endpoints, and then did not log any further events for it, but then somehow ended up with no Endpoints.

Comment 7 Casey Callendrello 2019-09-04 16:36:06 UTC

I don't see the Add for the Service. Did we miss that, or did you just elide it from the logs?

Comment 8 Dan Winship 2019-09-04 18:58:34 UTC

I just didn't mention it:

I0830 01:44:22.612458    2535 service.go:332] Adding new service port "openshift-ingress/router-default:http" at 172.30.2.238:80/TCP
I0830 01:44:22.612483    2535 service.go:332] Adding new service port "openshift-ingress/router-default:https" at 172.30.2.238:443/TCP
I0830 01:44:22.618573    2535 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:http to [10.128.2.4:80 10.131.0.6:80]
I0830 01:44:22.618634    2535 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:https to [10.128.2.4:443 10.131.0.6:443]
I0830 01:44:22.774292    2535 proxier.go:1464] Opened local port "nodePort for openshift-ingress/router-default:http" (:31392/tcp)
I0830 01:44:22.774453    2535 proxier.go:1464] Opened local port "nodePort for openshift-ingress/router-default:https" (:30152/tcp)
I0830 01:44:22.806218    2535 healthcheck.go:151] Opening healthcheck "openshift-ingress/router-default" on port 31765

oh... huh, there's an "Adding new service port" from vendor/k8s.io/kubernetes/pkg/proxy/service.go, but no "sdn proxy: add svc ..." from pkg/network/proxy/proxy.go... So yeah, we didn't get an Add for the Service in this case. (Though we got an Update for it I guess?)

Comment 9 Casey Callendrello 2019-09-09 14:39:56 UTC

> (Though we got an Update for it I guess?)

Yeah, that's related to what's been on my mind: when we see Update(A1, A2), we should compare A2 against A_cached, not A1.

Of course, the informers should "never" have this happen. But clearly it does.

Comment 10 David Eads 2019-09-09 18:23:19 UTC

If you suspect that your informer isn't working, you could add debugging like https://github.com/openshift/origin/pull/21851 to the informers in question.  That output would be enough to represent a minimum reproducer of bad notification.

We are fairly confident that informers aren't dropping notifications since every controller and operator is built on them, but that output could provide proof.

Comment 11 Casey Callendrello 2019-09-10 09:33:53 UTC

Ah, that's a good suggestion.

Eric, any hints as how to reproduce this?

Comment 12 Dan Winship 2019-11-08 11:27:51 UTC

should be fixed by the DeltaFIFO fix from upstream