Bug 1873949 - Nodes goes into NotReady state (VMware)
Summary: Nodes goes into NotReady state (VMware)
Keywords:
Status: CLOSED DUPLICATE of bug 1873114
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.5
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-31 07:14 UTC by Juan Manuel Parrilla Madrid
Modified: 2024-03-25 16:23 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1873114
Environment:
Last Closed: 2020-11-17 15:27:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Juan Manuel Parrilla Madrid 2020-08-31 07:14:52 UTC
This is a duplicated issue for Node awareness about Bug #1873114

Description of problem:

OCP nodes goes into NotReady state after X hours after the deployment of the cluster. The environment is an IPv6 under Baremetal and these are the log traces of the node.

Kubelet log from the affected node:


```
Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: E0827 09:47:18.442090    2595 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com": an error on the server ("") has prevented the request from succeeding (get nodes openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com)
Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: E0827 09:47:18.528281    2595 webhook.go:111] Failed to make webhook authenticator request: Post https://api-int.mgmt-spoke2.e2e.bos.redhat.com:6443/apis/authentication.k8s.io/v1/tokenreviews: write tcp [2620:52:0:1304::3]:50606->[2620:52:0:1304::3]:6443: use of closed network connection
Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: E0827 09:47:18.528356    2595 server.go:253] Unable to authenticate the request due to an error: Post https://api-int.mgmt-spoke2.e2e.bos.redhat.com:6443/apis/authentication.k8s.io/v1/tokenreviews: write tcp [2620:52:0:1304::3]:50606->[2620:52:0:1304::3]:6443: use of closed network connection
Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: I0827 09:47:18.528456    2595 httplog.go:90] verb="GET" URI="/metrics" latency=345.237µs resp=401 UserAgent="Prometheus/2.15.2" srcIP="[fd99::2]:42934":
Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc00386dd40, 0x191, 0x439079a, 0xc, 0xc00571f7f8, 0x1)
Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: E0827 09:47:18.877840    2595 reflector.go:178] object-"openshift-apiserver"/"config": Failed to list *v1.ConfigMap: an error on the server ("") has prevented the request from succeeding (get configmaps)
Aug 27 09:47:19 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: I0827 09:47:19.020869    2595 prober.go:166] Exec-Probe Pod: ovnkube-master-g4n6r, Container: sbdb, Command: [/bin/bash -c set -xe
Aug 27 09:47:19 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: exec /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound  2>/dev/null | grep ${K8S_NODE_IP} | grep -v Address -q
Aug 27 09:47:19 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: E0827 09:47:19.068332    2595 reflector.go:178] object-"openshift-insights"/"openshift-insights-serving-cert": Failed to list *v1.Secret: an error on the server ("") has prevented the request from succeeding (get secrets)
Aug 27 09:47:19 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: I0827 09:47:19.110894    2595 exec.go:60] Exec probe response: "+ grep 2620:52:0:1304::7\n+ exec /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound\n+ grep -v Address -q\n"
```

Context:

- OCP Version: 4.5.z (tested in all versions and is fully reproducible)
- IP Stack: Full IPv6
- Installation Method: Bare Metal IPI


How reproducible:

- Deploy 4.5.z on BareMetal under IPv6 stack (we have an environment with 4.5.5 deployed if you're interested to reproduce it)
- If there are NotReady nodes at first, check the Kubelet logs on that node and see the error.
- If there is not any node in NotReady, we should wait some time and eventually will happen

Actual results:

- Nodes in NotReady state makes the cluster unstable

Expected results:

- All nodes in Ready state


Additional info:

- There are other errors that maybe are related:

```
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: E0827 11:12:33.530877  420869 server.go:253] Unable to authenticate the request due to an error: Post https://api-int.mgmt-hub.e2e.bos.redhat.com:6443/apis/authentication.k8s.io/v1/tokenreviews: write tcp [2620:52:0:1302::5]:52114->[2620:52:0:1302::3]:6443: use of closed network connection
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: I0827 11:12:33.530952  420869 httplog.go:90] verb="GET" URI="/metrics" latency=329.398µs resp=401 UserAgent="Prometheus/2.15.2" srcIP="[2620:52:0:1302::6]:55162":
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: goroutine 36777 [running]:
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc003990620, 0x191)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:225 +0xc8
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc003990620, 0x191)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:204 +0x35
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Response).WriteHeader(...)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/response.go:201
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc006ac9920, 0x191, 0x439079a, 0xc, 0xc001fc97f8, 0x1)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/response.go:181 +0x51
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/server.(*Server).InstallAuthFilter.func1(0xc0055827e0, 0xc006ac9920, 0xc0055826c0)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/server/server.go:254 +0x122
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*FilterChain).ProcessFilter(0xc0055826c0, 0xc0055827e0, 0xc006ac9920)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/filter.go:19 +0x65
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Container).HandleWithFilter.func1(0x4ba22a0, 0xc003990620, 0xc001fcf900)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/container.go:313 +0x268
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: net/http.HandlerFunc.ServeHTTP(0xc0008dcb40, 0x4ba22a0, 0xc003990620, 0xc001fcf900)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /usr/lib/golang/src/net/http/server.go:2007 +0x44
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: net/http.(*ServeMux).ServeHTTP(0xc000f50000, 0x4ba22a0, 0xc003990620, 0xc001fcf900)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /usr/lib/golang/src/net/http/server.go:2387 +0x1bd
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Container).ServeHTTP(0xc0009e2480, 0x4ba22a0, 0xc003990620, 0xc001fcf900)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/container.go:292 +0x4d
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x4ba6160, 0xc00176da40, 0xc001fcf800)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x2ca
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: net/http.HandlerFunc.ServeHTTP(0xc003440340, 0x4ba6160, 0xc00176da40, 0xc001fcf800)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /usr/lib/golang/src/net/http/server.go:2007 +0x44
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/server.(*Server).ServeHTTP(0xc00015a190, 0x4ba6160, 0xc00176da40, 0xc001fcf800)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/server/server.go:923 +0x60d
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: net/http.serverHandler.ServeHTTP(0xc000dacb60, 0x4ba6160, 0xc00176da40, 0xc001fcf800)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /usr/lib/golang/src/net/http/server.go:2802 +0xa4
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: net/http.(*conn).serve(0xc000744dc0, 0x4bd6a60, 0xc003f39b40)
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /usr/lib/golang/src/net/http/server.go:1890 +0x875
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: created by net/http.(*Server).Serve
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]:         /usr/lib/golang/src/net/http/server.go:2927 +0x38e
Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: logging error output: "Unauthorized"
```

```
Aug 27 12:14:32 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc002edb860, 0x191, 0x439079a, 0xc, 0xc0032777f8, 0x1)
Aug 27 12:14:32 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: E0827 12:14:32.952531  420869 reflector.go:178] object-"openshift-machine-config-operator"/"machine-config-operator-images": Failed to list *v1.ConfigMap: an error on the server ("") has prevented the request from succeeding (get configmaps)
Aug 27 12:14:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: I0827 12:14:33.166040  420869 prober.go:166] Exec-Probe Pod: ovs-node-dnj52, Container: ovs-daemons, Command: [/usr/share/openvswitch/scripts/ovs-ctl status]
Aug 27 12:14:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: E0827 12:14:33.235487  420869 reflector.go:178] object-"openshift-ovn-kubernetes"/"ovn-ca": Failed to list *v1.ConfigMap: an error on the server ("") has prevented the request from succeeding (get configmaps)
```

--- Additional comment from Juan Manuel Parrilla Madrid on 2020-08-27 12:24:02 UTC ---

I forgot to mention, if you restart the Kubelet the node comes alive again (from the OCP perspective) and start reporting the status but eventually gets in NotReady state again

--- Additional comment from Seth Jennings on 2020-08-27 14:48:35 UTC ---

This looks like a golang issue
https://github.com/kubernetes/kubernetes/issues/87615
https://github.com/golang/go/issues/40201

--- Additional comment from Russell Bryant on 2020-08-27 18:50:01 UTC ---

(In reply to Seth Jennings from comment #2)
> This looks like a golang issue
> https://github.com/kubernetes/kubernetes/issues/87615
> https://github.com/golang/go/issues/40201

Thanks.  I checked the kernel log on a host that was seeing this issue and observed a lot of this in the log, which seems to support this being the same issue as the one you linked.

[84057.832697] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[84057.861232] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[84057.988240] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[84058.010442] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[84129.485448] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[84129.526182] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[84219.127096] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[84219.156378] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[84350.926777] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[84350.963246] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[106613.643101] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[106613.677715] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[106614.409979] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[106614.430376] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[106614.783140] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[106614.824764] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[106615.445239] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[106615.473290] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[106793.931631] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[106793.968889] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

--- Additional comment from Russell Bryant on 2020-08-27 20:52:18 UTC ---

(In reply to Russell Bryant from comment #3)
> (In reply to Seth Jennings from comment #2)
> > This looks like a golang issue
> > https://github.com/kubernetes/kubernetes/issues/87615
> > https://github.com/golang/go/issues/40201
> 
> Thanks.  I checked the kernel log on a host that was seeing this issue and
> observed a lot of this in the log, which seems to support this being the
> same issue as the one you linked.
> 
> [84057.832697] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [84057.861232] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [84057.988240] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [84058.010442] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [84129.485448] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [84129.526182] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [84219.127096] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [84219.156378] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [84350.926777] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [84350.963246] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [106613.643101] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [106613.677715] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [106614.409979] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [106614.430376] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [106614.783140] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [106614.824764] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [106615.445239] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [106615.473290] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [106793.931631] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [106793.968889] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

but turns out that's not actually the network interface in use.  The beginning of when the "use of closed network connection" errors is no longer in the journal, so I don't think I can pinpoint what happened to trigger it.  Hopefully we can catch more logs around the time this starts next time since this seems easy to reproduce in this lab for some reason.

--- Additional comment from Juan Manuel Parrilla Madrid on 2020-08-28 12:17:28 UTC ---

Adding a log file from affected node where, we restart the kubelet and fails again (look for "Unauthorized")

--- Additional comment from Russell Bryant on 2020-08-28 12:50:14 UTC ---

I'm setting this to a high priority / severity since when this occurs, a Node is in a very broken state until kubelet gets restarted.

Comment 1 Juan Manuel Parrilla Madrid 2020-08-31 15:41:15 UTC
As you know, to maintain this environment in this state with the error replication and so on, will block us in some ways, then we could have this environment as it is for ~8 more bussiness days, from that date we will repurpose the envieonment for other focus. In order to change the state to replicate the issue will cost us like ~3-5 bussiness days, just to let you know ;)

Comment 2 Seth Jennings 2020-08-31 17:15:24 UTC
Why did this get cloned back to OCP Node from RHEL golang?  Are you asking that the kubelet work around golang?

Comment 3 Seth Jennings 2020-08-31 20:27:30 UTC
There is no need to keep the environment running for this bug.  We know the cause and it is unlikely only limited to the kubelet.  I imagine that any pod running hostNetwork also has this issue (SDN, MCD, etc).

This bug is not actionable.  Deferring to 4.7 for QE purposes when we start using a golang that contains the fix.

Comment 4 Seth Jennings 2020-09-11 16:00:14 UTC
Reducing to medium since this is a tracking bug for OCP.  golang has to pull in the fix.

Comment 14 Red Hat Bugzilla 2023-09-14 06:08:00 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.