Description of problem: On idle GCP OVN cluster (age 4hrs), ovnkube metrics pod went into crashloopback while panic logs were observed under kubeapiserver along with kube-controller-manager degradation [root@localhost ~]# oc get pods --all-namespaces | grep -i "crash\|error\|not" openshift-kube-apiserver kube-apiserver-qe-anusaxen134-84r9h-master-1.c.openshift-qe.internal 3/5 NotReady 5 4h13m openshift-kube-apiserver kube-apiserver-qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal 4/5 NotReady 1 4h9m openshift-kube-controller-manager kube-controller-manager-qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal 1/4 Error 11 4h17m openshift-ovn-kubernetes ovnkube-node-metrics-6swb6 0/1 CrashLoopBackOff 22 91m must-gather is located at http://file.bos.redhat.com/~anusaxen/must-gather.local.5684169925652152451.tar.gz kube-controller-manager and kube-apiserver logs http://file.bos.redhat.com/~anusaxen/kube_controller_manager http://file.bos.redhat.com/~anusaxen/kube-apiserver_logs ovnkube-node-metrics-6swb6 complains: I0908 19:04:06.600866 1 main.go:261] Reading certificate files I0908 19:04:06.601256 1 main.go:294] Starting TCP socket on :9103 F0908 19:04:06.653168 1 main.go:297] failed to listen on secure address: listen tcp :9103: bind: address already in use sh-4.2# netstat -tunapl | grep 9103 tcp 0 0 127.0.0.1:29103 0.0.0.0:* LISTEN 4808/ovnkube tcp 0 0 127.0.0.1:29103 127.0.0.1:35252 ESTABLISHED 4808/ovnkube tcp 0 0 127.0.0.1:35252 127.0.0.1:29103 ESTABLISHED 4469/kube-rbac-prox tcp6 0 0 :::9103 :::* LISTEN 4469/kube-rbac-prox <<<<<<<<<<<< tcp6 0 0 10.0.0.3:9103 10.0.32.3:41810 ESTABLISHED 4469/kube-rbac-prox <<<<<<<<<<<< tcp6 0 0 10.0.0.3:9103 10.0.32.2:55748 ESTABLISHED 4469/kube-rbac-pro <<<<<<<<<<<< Version-Release number of selected component (if applicable):4.6.0-0.nightly-2020-09-07-224533 How reproducible: Not always Steps to Reproduce: 1.Bring up gcp ovn cluster on 4.6 2. 3. Actual results:cluster starts degrading post few hrs Expected results:cluster should stay fine Additional info: kube-apiserver pod excerpt: /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/panic.go:969 +0x166 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x565de40, 0xc039d22a80, 0xc0594fb900) http2: panic serving 10.0.32.4:36746: killing connection/stream because serving request timed out and response had been started panic(0x3fddd40, 0xc0006f56f0) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/panic.go:969 +0x166 panic(0x3fddd40, 0xc0006f56f0) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/panic.go:969 +0x166 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x565de40, 0xc013e351e0, 0xc0934d2b00) panic(0x3dda300, 0x553e7a0) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/panic.go:969 +0x166 panic(0x3dda300, 0x553e7a0) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/panic.go:969 +0x166
Cluster info (should be available for next 36 hrs): https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/110822/artifact/workdir/install-dir/auth/kubeconfig
The issue with the port being occupied comes from crio creating containers without the previous process being dead: Sep 09 04:50:14 qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal crio[1686]: time="2020-09-09 04:50:14.242723124Z" level=info msg="Created container 17563ae07e2f797c1b1576eb802669e712a3452a7a6dec9be5319217aa85c4c7: openshift-ovn-kubernetes/ovnkube-node-metrics-6swb6/kube-rbac-proxy" id=7bd9dae5-2579-44ce-a9a6-d59b6a771532 name=/runtime.v1alpha2.RuntimeService/CreateContainer Sep 09 04:50:14 qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal crio[1686]: time="2020-09-09 04:50:14.309807658Z" level=info msg="Started container 17563ae07e2f797c1b1576eb802669e712a3452a7a6dec9be5319217aa85c4c7: openshift-ovn-kubernetes/ovnkube-node-metrics-6swb6/kube-rbac-proxy" id=62a8b0cf-d7a7-48c3-99bb-83c457de4624 name=/runtime.v1alpha2.RuntimeService/StartContainer Sep 09 04:55:21 qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal crio[1686]: time="2020-09-09 04:55:21.237750187Z" level=info msg="Created container 014952efa7e3bd7217d7360ea561ff142de330bd94d977636510d506e558c8ae: openshift-ovn-kubernetes/ovnkube-node-metrics-6swb6/kube-rbac-proxy" id=ea4c4b0b-94ad-4cdf-8b2a-add35444584e name=/runtime.v1alpha2.RuntimeService/CreateContainer Sep 09 04:55:21 qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal crio[1686]: time="2020-09-09 04:55:21.348862369Z" level=info msg="Started container 014952efa7e3bd7217d7360ea561ff142de330bd94d977636510d506e558c8ae: openshift-ovn-kubernetes/ovnkube-node-metrics-6swb6/kube-rbac-proxy" id=8aa93846-23f0-407b-87a9-a575fef92da9 name=/runtime.v1alpha2.RuntimeService/StartContainer Sep 09 04:55:22 qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal crio[1686]: time="2020-09-09 04:55:22.222697026Z" level=info msg="Removed container 17563ae07e2f797c1b1576eb802669e712a3452a7a6dec9be5319217aa85c4c7: openshift-ovn-kubernetes/ovnkube-node-metrics-6swb6/kube-rbac-proxy" id=599affb0-39d5-4cda-b273-814d94601bec name=/runtime.v1alpha2.RuntimeService/RemoveContainer <- removed after the following container is created This isn't really our problem, and cri-o is misbehaving because the cpu is saturated to an indecent extent. kube-apiserver, vswitchd and kubelet seem to be at fault of this excessive cpu usage. I keep my investigation ongoing.
Let's wait for the changes in shared gateway to be implemented (they're about to merge) and then let's see if the issue reproduces.
Anurag, is this still happening now (since the shared gateway change merged?)
@ben seems like its still happening as per https://bugzilla.redhat.com/show_bug.cgi?id=1881113
Apparently it's the same. These are symptoms, not the actual cause. *** This bug has been marked as a duplicate of bug 1881113 ***