1707006 – kube-proxy: etcd recovery causes endpoint updates to be missed

Bug 1707006 - kube-proxy: etcd recovery causes endpoint updates to be missed

Summary: kube-proxy: etcd recovery causes endpoint updates to be missed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Casey Callendrello
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-06 15:25 UTC by Luis Sanchez
Modified:	2022-08-22 02:56 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:03:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (990.11 KB, application/gzip) 2019-05-09 15:48 UTC, Scott Dodson	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:04:13 UTC

Description Luis Sanchez 2019-05-06 15:25:47 UTC

Description of problem:

SDN operator was not indicating Degraded status when network was down.

Comment 1 Luis Sanchez 2019-05-06 15:29:25 UTC

The service network was not up, for example, the iptables rules for the kubernetes.default.svc did not match the cluster state.

Comment 2 Casey Callendrello 2019-05-06 17:14:46 UTC

Some background: This was a cluster where they were testing it for disaster recovery. *Every single node* had a stale endpoint for the apiserver in iptables.

The relevant logs look like this:
I0506 13:48:30.789227    2122 roundrobin.go:310] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.0.153.168:6443 10.0.173.211:6443]
I0506 13:48:30.789276    2122 roundrobin.go:240] Delete endpoint 10.0.132.206:6443 for service "default/kubernetes:https"
...
I0506 13:52:56.738151    2122 roundrobin.go:310] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.0.173.211:6443]
I0506 13:52:56.738193    2122 roundrobin.go:240] Delete endpoint 10.0.153.168:6443 for service "default/kubernetes:https"
...
W0506 13:54:12.300749    2122 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.NetworkPolicy ended with: too old resource version: 17428 (17711)
W0506 13:54:12.347245    2122 reflector.go:256] github.com/openshift/client-go/network/informers/externalversions/factory.go:101: watch of *v1.NetNamespace ended with: too old resource version: 15580 (17728)
W0506 13:54:17.285158    2122 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.Service ended with: too old resource version: 17142 (17711)
W0506 13:54:17.285181    2122 reflector.go:256] github.com/openshift/client-go/network/informers/externalversions/factory.go:101: watch of *v1.HostSubnet ended with: too old resource version: 17379 (17728)
E0506 13:54:32.269844    2122 reflector.go:237] k8s.io/client-go/informers/factory.go:132: Failed to watch *v1.Namespace: Get https://api-int.vrutkovs.devcluster.openshift.com:6443/api/v1/namespaces?resourceVersion=15591&timeout=5m42s&timeoutSeconds=342&watch=true: dial tcp 10.0.146.189:6443: connect: connection refused
E0506 13:54:32.270097    2122 reflector.go:237] k8s.io/client-go/informers/factory.go:132: Failed to watch *v1.Pod: Get https://api-int.vrutkovs.devcluster.openshift.com:6443/api/v1/pods?resourceVersion=17652&timeout=6m57s&timeoutSeconds=417&watch=true: dial tcp 10.0.146.189:6443: i/o timeout
E0506 13:54:32.270628    2122 reflector.go:237] github.com/openshift/client-go/network/informers/externalversions/factory.go:101: Failed to watch *v1.EgressNetworkPolicy: Get https://api-int.vrutkovs.devcluster.openshift.com:6443/apis/network.openshift.io/v1/egressnetworkpolicies?resourceVersion=14733&timeout=6m55s&timeoutSeconds=415&watch=true: dial tcp 10.0.146.189:6443: i/o timeout
E0506 13:54:32.270766    2122 reflector.go:237] k8s.io/client-go/informers/factory.go:132: Failed to watch *v1.Endpoints: Get https://api-int.vrutkovs.devcluster.openshift.com:6443/api/v1/endpoints?resourceVersion=17708&timeout=9m0s&timeoutSeconds=540&watch=true: dial tcp 10.0.146.189:6443: i/o timeout
...
I0506 14:02:03.327985    2122 roundrobin.go:310] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.0.153.168:6443]
I0506 14:02:03.328013    2122 roundrobin.go:240] Delete endpoint 10.0.153.168:6443 for service "default/kubernetes:https"
I0506 14:02:03.328026    2122 roundrobin.go:240] Delete endpoint 10.0.173.211:6443 for service "default/kubernetes:https"


However, the service only has one endpoint in iptables, and it's 10.0.173.211:6443.


Looking at the logs, every other service in iptables seems to reflect the most recent endpoint values exactly. It's just the apiserver that's wrong. Why?

Comment 3 Dan Williams 2019-05-06 17:40:15 UTC

Unidling is clearly active in this cluster, but is it being used? Are there any idled services? What are the metrics for the iptables proxy sync time?

Comment 4 Vadim Rutkovsky 2019-05-06 17:54:14 UTC

Steps to Reproduce:
1. Create an AWS cluster
2. Take down two masters (I deleted machines via machine API, make sure you pick masters which don't run machine api controller)
3. Restore the remaining master etcd using https://github.com/hexfusion/openshift-recovery/blob/master/bin/restore.sh

Comment 5 Dan Williams 2019-05-07 01:56:49 UTC

The cluster eventually recovered, but perhaps only after I kicked it by creating a new service and some endpoints in that service. Note that the roundrobin.go logs aren't necessarily authoritative, eg after this message:

I0506 21:28:17.461949    2122 roundrobin.go:310] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.0.138.21:6443 10.0.153.168:6443 10.0.167.122:6443]
I0506 21:28:17.461986    2122 roundrobin.go:240] Delete endpoint 10.0.138.21:6443 for service "default/kubernetes:https"

looking directly at iptables showed that 10.0.138.21 was still present:

sh-4.2# iptables-save | grep KUBE-SVC-NPX46M4PTMTKRN6Y
:KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0]
-A KUBE-SERVICES -d 172.30.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-KX4Q4YWDHKSSJL4E
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-EQLLAEROJ5B7CY76
-A KUBE-SVC-NPX46M4PTMTKRN6Y -j KUBE-SEP-UD5457WVRLS2I5GE
sh-4.2# iptables-save | grep 10.0.138.21
-A KUBE-SEP-KX4Q4YWDHKSSJL4E -s 10.0.138.21/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-KX4Q4YWDHKSSJL4E -p tcp -m tcp -j DNAT --to-destination 10.0.138.21:6443

So we can't necessarily trust what log messages actually do exist (and are for the userspace proxy anyway, *not* the iptables proxy).

We really need more debug logging in the proxy, eg something like https://github.com/openshift/origin/pull/22785 just for a couple days to get more visibility into what's going on.

Comment 6 Casey Callendrello 2019-05-07 09:24:56 UTC

Taking a look at this now. I'm going to try reproducing with a custom image that has more logging.

Comment 13 Scott Dodson 2019-05-09 15:48:42 UTC

Created attachment 1566234 [details]
logs

files gathered prior to removal of master-1 and master-2
kubernetes-endpoints-pre-nuke
iptables-pre-nuke

files gathered after removal of master-1 master-2 and after etcd recovery
sdn-j4c7n-p.log
iptables-post-nuke

files gathered after `pkill openshift-sdn`
iptables-post-nuke-post-pkill
sdn-j4c7n.log

final end state after new master-1 and master-2 have been added
kubernetes-endpoints-post-recovery

Comment 14 Casey Callendrello 2019-05-09 20:35:11 UTC

Something is definitely wrong - getting closer. Kube-proxy had programmed iptables with 3 endpoints for the apiserver. then, after a bit of informer disruption, we get

I0509 14:19:46.573319   66896 proxy.go:252] hybrid proxy: update ep default/kubernetes in main proxy
I0509 14:19:46.573335   66896 endpoints.go:234] Setting endpoints for "default/kubernetes:https" to [10.0.143.152:6443]
I0509 14:19:46.573352   66896 endpoints.go:234] Setting endpoints for "default/kubernetes:https" to [10.0.143.152:6443]
I0509 14:19:46.573373   66896 config.go:141] Calling handler.OnEndpointsUpdate

Somehow this doesn't trigger an iptables resync!?!?

15 seconds later, we get a periodic iptables resync:
I0509 14:20:04.432419   66896 proxier.go:679] Syncing iptables rules

And kube api has 3 endpoints! 
:KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0]
-A KUBE-SERVICES -m comment --comment "default/kubernetes:https cluster IP" -m tcp -p tcp -d 172.30.0.1/32 --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.33333 -j KUBE-SEP-NTEYGCVTG6WBQS55
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.50000 -j KUBE-SEP-BBW7QUGCULWEBL7V
-A KUBE-SVC-NPX46M4PTMTKRN6Y -j KUBE-SEP-THZTG7HLKH4H2IAI
:KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0]

Comment 15 Casey Callendrello 2019-05-09 20:38:07 UTC

There is a known issue in kube-proxy where if it receives two identical updates before it can sync, it will actually ignore them. I wonder if, during the api disruption, this is happening.

In fact, that would make a *lot* of sense.

Comment 16 Casey Callendrello 2019-05-13 14:57:22 UTC

I think I've figured it out. It's an upstream bug in the kube-proxy change tracker. I'll file a PR.

Comment 17 Casey Callendrello 2019-06-18 15:27:22 UTC

Update: Nevermind, despite extensive testing, I don't know what the cause is. There is a workaround, so this isn't drastic.

Comment 18 Clayton Coleman 2019-09-22 22:46:36 UTC

How have we deferred this out of 4.2?  This is a pretty serious problem with our product if etcd restore and recovery don't actually fix the cluster.

I'd like a timeframe for when this gets investigated and fixed, and if necessary what work needs to be moved out in order to fix it.

Comment 19 Casey Callendrello 2019-09-23 11:52:47 UTC

The recovery scripts were updated to restart openshift-sdn, so it's not a disaster. The ultimate conclusion was that, given most disaster scenarios, we should not expect informer correctness.

Comment 20 Casey Callendrello 2019-11-04 19:05:15 UTC

Okay, it turns out there was another bug in informers that someone else found. This should probably be fixed.

https://github.com/openshift/sdn/pull/60

Comment 22 Anurag saxena 2019-11-06 02:01:57 UTC

@vadim 

<<Take down two masters (I deleted machines via machine API, make sure you pick masters which don't run machine api controller)

Is this the right way t bring down masters? Can you explain how did you do via machine api?

oc debug node/<master>
chroot /host
cd /etc/kubernetes
mv manifests manifests.bak
systemctl restart kubelet

Comment 23 Vadim Rutkovsky 2019-11-06 09:20:39 UTC

(In reply to Anurag saxena from comment #22)
> @vadim 
> 
> <<Take down two masters (I deleted machines via machine API, make sure you
> pick masters which don't run machine api controller)
> 
> Is this the right way t bring down masters? Can you explain how did you do
> via machine api?
> 
> oc debug node/<master>
> chroot /host
> cd /etc/kubernetes
> mv manifests manifests.bak
> systemctl restart kubelet

I don't think that would work - machine has to be destroyed, e.g. powered off

Comment 24 Weibin Liang 2019-12-03 18:40:11 UTC

Verifying is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1779321

Comment 25 Weibin Liang 2019-12-03 19:36:54 UTC

(In reply to Weibin Liang from comment #24)
> Verifying is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1779321

Also wait for https://bugzilla.redhat.com/show_bug.cgi?id=1773562 to be verified first

Comment 26 Weibin Liang 2019-12-03 20:18:23 UTC

(In reply to Weibin Liang from comment #25)
> (In reply to Weibin Liang from comment #24)
> > Verifying is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1779321
> 
> Also wait for https://bugzilla.redhat.com/show_bug.cgi?id=1773562 to be
> verified first

 https://github.com/openshift/machine-config-operator/pull/1287

Comment 27 Weibin Liang 2019-12-13 13:44:15 UTC

Thanks @geliu who help to verify this bug on 4.3.0-0.nightly-2019-12-11-204310.


etcd recreate three new endpoints after killing two masters and recovering two new masters

Comment 29 errata-xmlrpc 2020-01-23 11:03:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.