Description of problem: SDN operator was not indicating Degraded status when network was down.
The service network was not up, for example, the iptables rules for the kubernetes.default.svc did not match the cluster state.
Some background: This was a cluster where they were testing it for disaster recovery. *Every single node* had a stale endpoint for the apiserver in iptables. The relevant logs look like this: I0506 13:48:30.789227 2122 roundrobin.go:310] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.0.153.168:6443 10.0.173.211:6443] I0506 13:48:30.789276 2122 roundrobin.go:240] Delete endpoint 10.0.132.206:6443 for service "default/kubernetes:https" ... I0506 13:52:56.738151 2122 roundrobin.go:310] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.0.173.211:6443] I0506 13:52:56.738193 2122 roundrobin.go:240] Delete endpoint 10.0.153.168:6443 for service "default/kubernetes:https" ... W0506 13:54:12.300749 2122 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.NetworkPolicy ended with: too old resource version: 17428 (17711) W0506 13:54:12.347245 2122 reflector.go:256] github.com/openshift/client-go/network/informers/externalversions/factory.go:101: watch of *v1.NetNamespace ended with: too old resource version: 15580 (17728) W0506 13:54:17.285158 2122 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.Service ended with: too old resource version: 17142 (17711) W0506 13:54:17.285181 2122 reflector.go:256] github.com/openshift/client-go/network/informers/externalversions/factory.go:101: watch of *v1.HostSubnet ended with: too old resource version: 17379 (17728) E0506 13:54:32.269844 2122 reflector.go:237] k8s.io/client-go/informers/factory.go:132: Failed to watch *v1.Namespace: Get https://api-int.vrutkovs.devcluster.openshift.com:6443/api/v1/namespaces?resourceVersion=15591&timeout=5m42s&timeoutSeconds=342&watch=true: dial tcp 10.0.146.189:6443: connect: connection refused E0506 13:54:32.270097 2122 reflector.go:237] k8s.io/client-go/informers/factory.go:132: Failed to watch *v1.Pod: Get https://api-int.vrutkovs.devcluster.openshift.com:6443/api/v1/pods?resourceVersion=17652&timeout=6m57s&timeoutSeconds=417&watch=true: dial tcp 10.0.146.189:6443: i/o timeout E0506 13:54:32.270628 2122 reflector.go:237] github.com/openshift/client-go/network/informers/externalversions/factory.go:101: Failed to watch *v1.EgressNetworkPolicy: Get https://api-int.vrutkovs.devcluster.openshift.com:6443/apis/network.openshift.io/v1/egressnetworkpolicies?resourceVersion=14733&timeout=6m55s&timeoutSeconds=415&watch=true: dial tcp 10.0.146.189:6443: i/o timeout E0506 13:54:32.270766 2122 reflector.go:237] k8s.io/client-go/informers/factory.go:132: Failed to watch *v1.Endpoints: Get https://api-int.vrutkovs.devcluster.openshift.com:6443/api/v1/endpoints?resourceVersion=17708&timeout=9m0s&timeoutSeconds=540&watch=true: dial tcp 10.0.146.189:6443: i/o timeout ... I0506 14:02:03.327985 2122 roundrobin.go:310] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.0.153.168:6443] I0506 14:02:03.328013 2122 roundrobin.go:240] Delete endpoint 10.0.153.168:6443 for service "default/kubernetes:https" I0506 14:02:03.328026 2122 roundrobin.go:240] Delete endpoint 10.0.173.211:6443 for service "default/kubernetes:https" However, the service only has one endpoint in iptables, and it's 10.0.173.211:6443. Looking at the logs, every other service in iptables seems to reflect the most recent endpoint values exactly. It's just the apiserver that's wrong. Why?
Unidling is clearly active in this cluster, but is it being used? Are there any idled services? What are the metrics for the iptables proxy sync time?
Steps to Reproduce: 1. Create an AWS cluster 2. Take down two masters (I deleted machines via machine API, make sure you pick masters which don't run machine api controller) 3. Restore the remaining master etcd using https://github.com/hexfusion/openshift-recovery/blob/master/bin/restore.sh
The cluster eventually recovered, but perhaps only after I kicked it by creating a new service and some endpoints in that service. Note that the roundrobin.go logs aren't necessarily authoritative, eg after this message: I0506 21:28:17.461949 2122 roundrobin.go:310] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.0.138.21:6443 10.0.153.168:6443 10.0.167.122:6443] I0506 21:28:17.461986 2122 roundrobin.go:240] Delete endpoint 10.0.138.21:6443 for service "default/kubernetes:https" looking directly at iptables showed that 10.0.138.21 was still present: sh-4.2# iptables-save | grep KUBE-SVC-NPX46M4PTMTKRN6Y :KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0] -A KUBE-SERVICES -d 172.30.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y -A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-KX4Q4YWDHKSSJL4E -A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-EQLLAEROJ5B7CY76 -A KUBE-SVC-NPX46M4PTMTKRN6Y -j KUBE-SEP-UD5457WVRLS2I5GE sh-4.2# iptables-save | grep 10.0.138.21 -A KUBE-SEP-KX4Q4YWDHKSSJL4E -s 10.0.138.21/32 -j KUBE-MARK-MASQ -A KUBE-SEP-KX4Q4YWDHKSSJL4E -p tcp -m tcp -j DNAT --to-destination 10.0.138.21:6443 So we can't necessarily trust what log messages actually do exist (and are for the userspace proxy anyway, *not* the iptables proxy). We really need more debug logging in the proxy, eg something like https://github.com/openshift/origin/pull/22785 just for a couple days to get more visibility into what's going on.
Taking a look at this now. I'm going to try reproducing with a custom image that has more logging.
Created attachment 1566234 [details] logs files gathered prior to removal of master-1 and master-2 kubernetes-endpoints-pre-nuke iptables-pre-nuke files gathered after removal of master-1 master-2 and after etcd recovery sdn-j4c7n-p.log iptables-post-nuke files gathered after `pkill openshift-sdn` iptables-post-nuke-post-pkill sdn-j4c7n.log final end state after new master-1 and master-2 have been added kubernetes-endpoints-post-recovery
Something is definitely wrong - getting closer. Kube-proxy had programmed iptables with 3 endpoints for the apiserver. then, after a bit of informer disruption, we get I0509 14:19:46.573319 66896 proxy.go:252] hybrid proxy: update ep default/kubernetes in main proxy I0509 14:19:46.573335 66896 endpoints.go:234] Setting endpoints for "default/kubernetes:https" to [10.0.143.152:6443] I0509 14:19:46.573352 66896 endpoints.go:234] Setting endpoints for "default/kubernetes:https" to [10.0.143.152:6443] I0509 14:19:46.573373 66896 config.go:141] Calling handler.OnEndpointsUpdate Somehow this doesn't trigger an iptables resync!?!? 15 seconds later, we get a periodic iptables resync: I0509 14:20:04.432419 66896 proxier.go:679] Syncing iptables rules And kube api has 3 endpoints! :KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0] -A KUBE-SERVICES -m comment --comment "default/kubernetes:https cluster IP" -m tcp -p tcp -d 172.30.0.1/32 --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y -A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.33333 -j KUBE-SEP-NTEYGCVTG6WBQS55 -A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.50000 -j KUBE-SEP-BBW7QUGCULWEBL7V -A KUBE-SVC-NPX46M4PTMTKRN6Y -j KUBE-SEP-THZTG7HLKH4H2IAI :KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0]
There is a known issue in kube-proxy where if it receives two identical updates before it can sync, it will actually ignore them. I wonder if, during the api disruption, this is happening. In fact, that would make a *lot* of sense.
I think I've figured it out. It's an upstream bug in the kube-proxy change tracker. I'll file a PR.
Update: Nevermind, despite extensive testing, I don't know what the cause is. There is a workaround, so this isn't drastic.
How have we deferred this out of 4.2? This is a pretty serious problem with our product if etcd restore and recovery don't actually fix the cluster. I'd like a timeframe for when this gets investigated and fixed, and if necessary what work needs to be moved out in order to fix it.
The recovery scripts were updated to restart openshift-sdn, so it's not a disaster. The ultimate conclusion was that, given most disaster scenarios, we should not expect informer correctness.
Okay, it turns out there was another bug in informers that someone else found. This should probably be fixed. https://github.com/openshift/sdn/pull/60
@vadim <<Take down two masters (I deleted machines via machine API, make sure you pick masters which don't run machine api controller) Is this the right way t bring down masters? Can you explain how did you do via machine api? oc debug node/<master> chroot /host cd /etc/kubernetes mv manifests manifests.bak systemctl restart kubelet
(In reply to Anurag saxena from comment #22) > @vadim > > <<Take down two masters (I deleted machines via machine API, make sure you > pick masters which don't run machine api controller) > > Is this the right way t bring down masters? Can you explain how did you do > via machine api? > > oc debug node/<master> > chroot /host > cd /etc/kubernetes > mv manifests manifests.bak > systemctl restart kubelet I don't think that would work - machine has to be destroyed, e.g. powered off
Verifying is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1779321
(In reply to Weibin Liang from comment #24) > Verifying is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1779321 Also wait for https://bugzilla.redhat.com/show_bug.cgi?id=1773562 to be verified first
(In reply to Weibin Liang from comment #25) > (In reply to Weibin Liang from comment #24) > > Verifying is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1779321 > > Also wait for https://bugzilla.redhat.com/show_bug.cgi?id=1773562 to be > verified first https://github.com/openshift/machine-config-operator/pull/1287
Thanks @geliu who help to verify this bug on 4.3.0-0.nightly-2019-12-11-204310. etcd recreate three new endpoints after killing two masters and recovering two new masters
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062