Description of problem: OCP nodes goes into NotReady state after X hours after the deployment of the cluster. The environment is an IPv6 under Baremetal and these are the log traces of the node. Kubelet log from the affected node: ``` Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: E0827 09:47:18.442090 2595 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com": an error on the server ("") has prevented the request from succeeding (get nodes openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com) Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: E0827 09:47:18.528281 2595 webhook.go:111] Failed to make webhook authenticator request: Post https://api-int.mgmt-spoke2.e2e.bos.redhat.com:6443/apis/authentication.k8s.io/v1/tokenreviews: write tcp [2620:52:0:1304::3]:50606->[2620:52:0:1304::3]:6443: use of closed network connection Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: E0827 09:47:18.528356 2595 server.go:253] Unable to authenticate the request due to an error: Post https://api-int.mgmt-spoke2.e2e.bos.redhat.com:6443/apis/authentication.k8s.io/v1/tokenreviews: write tcp [2620:52:0:1304::3]:50606->[2620:52:0:1304::3]:6443: use of closed network connection Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: I0827 09:47:18.528456 2595 httplog.go:90] verb="GET" URI="/metrics" latency=345.237µs resp=401 UserAgent="Prometheus/2.15.2" srcIP="[fd99::2]:42934": Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc00386dd40, 0x191, 0x439079a, 0xc, 0xc00571f7f8, 0x1) Aug 27 09:47:18 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: E0827 09:47:18.877840 2595 reflector.go:178] object-"openshift-apiserver"/"config": Failed to list *v1.ConfigMap: an error on the server ("") has prevented the request from succeeding (get configmaps) Aug 27 09:47:19 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: I0827 09:47:19.020869 2595 prober.go:166] Exec-Probe Pod: ovnkube-master-g4n6r, Container: sbdb, Command: [/bin/bash -c set -xe Aug 27 09:47:19 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: exec /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound 2>/dev/null | grep ${K8S_NODE_IP} | grep -v Address -q Aug 27 09:47:19 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: E0827 09:47:19.068332 2595 reflector.go:178] object-"openshift-insights"/"openshift-insights-serving-cert": Failed to list *v1.Secret: an error on the server ("") has prevented the request from succeeding (get secrets) Aug 27 09:47:19 openshift-master-2.mgmt-spoke2.e2e.bos.redhat.com hyperkube[2595]: I0827 09:47:19.110894 2595 exec.go:60] Exec probe response: "+ grep 2620:52:0:1304::7\n+ exec /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound\n+ grep -v Address -q\n" ``` Context: - OCP Version: 4.5.z (tested in all versions and is fully reproducible) - IP Stack: Full IPv6 - Installation Method: Bare Metal IPI How reproducible: - Deploy 4.5.z on BareMetal under IPv6 stack (we have an environment with 4.5.5 deployed if you're interested to reproduce it) - If there are NotReady nodes at first, check the Kubelet logs on that node and see the error. - If there is not any node in NotReady, we should wait some time and eventually will happen Actual results: - Nodes in NotReady state makes the cluster unstable Expected results: - All nodes in Ready state Additional info: - There are other errors that maybe are related: ``` Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: E0827 11:12:33.530877 420869 server.go:253] Unable to authenticate the request due to an error: Post https://api-int.mgmt-hub.e2e.bos.redhat.com:6443/apis/authentication.k8s.io/v1/tokenreviews: write tcp [2620:52:0:1302::5]:52114->[2620:52:0:1302::3]:6443: use of closed network connection Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: I0827 11:12:33.530952 420869 httplog.go:90] verb="GET" URI="/metrics" latency=329.398µs resp=401 UserAgent="Prometheus/2.15.2" srcIP="[2620:52:0:1302::6]:55162": Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: goroutine 36777 [running]: Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc003990620, 0x191) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:225 +0xc8 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc003990620, 0x191) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:204 +0x35 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Response).WriteHeader(...) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/response.go:201 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc006ac9920, 0x191, 0x439079a, 0xc, 0xc001fc97f8, 0x1) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/response.go:181 +0x51 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/server.(*Server).InstallAuthFilter.func1(0xc0055827e0, 0xc006ac9920, 0xc0055826c0) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/server/server.go:254 +0x122 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*FilterChain).ProcessFilter(0xc0055826c0, 0xc0055827e0, 0xc006ac9920) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/filter.go:19 +0x65 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Container).HandleWithFilter.func1(0x4ba22a0, 0xc003990620, 0xc001fcf900) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/container.go:313 +0x268 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: net/http.HandlerFunc.ServeHTTP(0xc0008dcb40, 0x4ba22a0, 0xc003990620, 0xc001fcf900) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /usr/lib/golang/src/net/http/server.go:2007 +0x44 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: net/http.(*ServeMux).ServeHTTP(0xc000f50000, 0x4ba22a0, 0xc003990620, 0xc001fcf900) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /usr/lib/golang/src/net/http/server.go:2387 +0x1bd Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Container).ServeHTTP(0xc0009e2480, 0x4ba22a0, 0xc003990620, 0xc001fcf900) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/container.go:292 +0x4d Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x4ba6160, 0xc00176da40, 0xc001fcf800) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x2ca Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: net/http.HandlerFunc.ServeHTTP(0xc003440340, 0x4ba6160, 0xc00176da40, 0xc001fcf800) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /usr/lib/golang/src/net/http/server.go:2007 +0x44 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/server.(*Server).ServeHTTP(0xc00015a190, 0x4ba6160, 0xc00176da40, 0xc001fcf800) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /builddir/build/BUILD/openshift-git-0.a5dde9c/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/server/server.go:923 +0x60d Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: net/http.serverHandler.ServeHTTP(0xc000dacb60, 0x4ba6160, 0xc00176da40, 0xc001fcf800) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /usr/lib/golang/src/net/http/server.go:2802 +0xa4 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: net/http.(*conn).serve(0xc000744dc0, 0x4bd6a60, 0xc003f39b40) Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /usr/lib/golang/src/net/http/server.go:1890 +0x875 Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: created by net/http.(*Server).Serve Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: /usr/lib/golang/src/net/http/server.go:2927 +0x38e Aug 27 11:12:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: logging error output: "Unauthorized" ``` ``` Aug 27 12:14:32 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc002edb860, 0x191, 0x439079a, 0xc, 0xc0032777f8, 0x1) Aug 27 12:14:32 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: E0827 12:14:32.952531 420869 reflector.go:178] object-"openshift-machine-config-operator"/"machine-config-operator-images": Failed to list *v1.ConfigMap: an error on the server ("") has prevented the request from succeeding (get configmaps) Aug 27 12:14:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: I0827 12:14:33.166040 420869 prober.go:166] Exec-Probe Pod: ovs-node-dnj52, Container: ovs-daemons, Command: [/usr/share/openvswitch/scripts/ovs-ctl status] Aug 27 12:14:33 openshift-master-0.mgmt-hub.e2e.bos.redhat.com hyperkube[420869]: E0827 12:14:33.235487 420869 reflector.go:178] object-"openshift-ovn-kubernetes"/"ovn-ca": Failed to list *v1.ConfigMap: an error on the server ("") has prevented the request from succeeding (get configmaps) ```
I forgot to mention, if you restart the Kubelet the node comes alive again (from the OCP perspective) and start reporting the status but eventually gets in NotReady state again
This looks like a golang issue https://github.com/kubernetes/kubernetes/issues/87615 https://github.com/golang/go/issues/40201
(In reply to Seth Jennings from comment #2) > This looks like a golang issue > https://github.com/kubernetes/kubernetes/issues/87615 > https://github.com/golang/go/issues/40201 Thanks. I checked the kernel log on a host that was seeing this issue and observed a lot of this in the log, which seems to support this being the same issue as the one you linked. [84057.832697] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [84057.861232] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [84057.988240] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [84058.010442] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [84129.485448] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [84129.526182] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [84219.127096] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [84219.156378] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [84350.926777] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [84350.963246] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [106613.643101] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [106613.677715] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [106614.409979] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [106614.430376] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [106614.783140] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [106614.824764] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [106615.445239] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [106615.473290] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [106793.931631] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [106793.968889] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
(In reply to Russell Bryant from comment #3) > (In reply to Seth Jennings from comment #2) > > This looks like a golang issue > > https://github.com/kubernetes/kubernetes/issues/87615 > > https://github.com/golang/go/issues/40201 > > Thanks. I checked the kernel log on a host that was seeing this issue and > observed a lot of this in the log, which seems to support this being the > same issue as the one you linked. > > [84057.832697] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [84057.861232] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [84057.988240] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [84058.010442] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [84129.485448] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [84129.526182] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [84219.127096] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [84219.156378] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [84350.926777] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [84350.963246] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [106613.643101] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [106613.677715] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [106614.409979] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [106614.430376] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [106614.783140] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [106614.824764] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [106615.445239] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [106615.473290] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [106793.931631] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [106793.968889] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready but turns out that's not actually the network interface in use. The beginning of when the "use of closed network connection" errors is no longer in the journal, so I don't think I can pinpoint what happened to trigger it. Hopefully we can catch more logs around the time this starts next time since this seems easy to reproduce in this lab for some reason.
Created attachment 1712946 [details] Hub Kubelet Master-0 Adding a log file from affected node where, we restart the kubelet and fails again (look for "Unauthorized")
I'm setting this to a high priority / severity since when this occurs, a Node is in a very broken state until kubelet gets restarted.
If It's necessary the environment could be redeployed to reproduce the state and try to fix the bug.
As you know, to maintain this environment in this state with the error replication and so on, will block us in some ways, then we could have this environment as it is for ~8 more bussiness days, from that date we will repurpose the envieonment for other focus. In order to change the state to replicate the issue will cost us like ~3-5 bussiness days, just to let you know ;)
*** Bug 1872734 has been marked as a duplicate of this bug. ***
*** Bug 1875236 has been marked as a duplicate of this bug. ***
Russell Bryant works on/with the bare-metal team, where NIC resets are most likely to occur vs other platforms. He should be able to recreate the issue.
for testing if upstream fix addresses this issue https://github.com/openshift/kubernetes/pull/398
*** Bug 1873949 has been marked as a duplicate of this bug. ***
*** Bug 1893511 has been marked as a duplicate of this bug. ***
*** Bug 1899109 has been marked as a duplicate of this bug. ***
Hello, This bug is to get the fix backported to OCP 4.5, Could you please confirm the fix release date for OCP 4.5 ? Thanks in advance.
The BZ for OCP 4.5 is [1], and it's already fixed in OCP 4.5.27 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1907938
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
*** Bug 1934422 has been marked as a duplicate of this bug. ***
*** Bug 1953678 has been marked as a duplicate of this bug. ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days