Bug 1877100
| Summary: | [gcp] ovnkube node metrics pod crashing and kube-apiserver is panicking on longevity clusters | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Anurag saxena <anusaxen> |
| Component: | Networking | Assignee: | Juan Luis de Sousa-Valadas <jdesousa> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | aconstan, bbennett, huirwang, rbrattai, weliang, zzhao |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-22 11:07:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Anurag saxena
2020-09-08 20:10:33 UTC
Cluster info (should be available for next 36 hrs): https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/110822/artifact/workdir/install-dir/auth/kubeconfig The issue with the port being occupied comes from crio creating containers without the previous process being dead: Sep 09 04:50:14 qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal crio[1686]: time="2020-09-09 04:50:14.242723124Z" level=info msg="Created container 17563ae07e2f797c1b1576eb802669e712a3452a7a6dec9be5319217aa85c4c7: openshift-ovn-kubernetes/ovnkube-node-metrics-6swb6/kube-rbac-proxy" id=7bd9dae5-2579-44ce-a9a6-d59b6a771532 name=/runtime.v1alpha2.RuntimeService/CreateContainer Sep 09 04:50:14 qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal crio[1686]: time="2020-09-09 04:50:14.309807658Z" level=info msg="Started container 17563ae07e2f797c1b1576eb802669e712a3452a7a6dec9be5319217aa85c4c7: openshift-ovn-kubernetes/ovnkube-node-metrics-6swb6/kube-rbac-proxy" id=62a8b0cf-d7a7-48c3-99bb-83c457de4624 name=/runtime.v1alpha2.RuntimeService/StartContainer Sep 09 04:55:21 qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal crio[1686]: time="2020-09-09 04:55:21.237750187Z" level=info msg="Created container 014952efa7e3bd7217d7360ea561ff142de330bd94d977636510d506e558c8ae: openshift-ovn-kubernetes/ovnkube-node-metrics-6swb6/kube-rbac-proxy" id=ea4c4b0b-94ad-4cdf-8b2a-add35444584e name=/runtime.v1alpha2.RuntimeService/CreateContainer Sep 09 04:55:21 qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal crio[1686]: time="2020-09-09 04:55:21.348862369Z" level=info msg="Started container 014952efa7e3bd7217d7360ea561ff142de330bd94d977636510d506e558c8ae: openshift-ovn-kubernetes/ovnkube-node-metrics-6swb6/kube-rbac-proxy" id=8aa93846-23f0-407b-87a9-a575fef92da9 name=/runtime.v1alpha2.RuntimeService/StartContainer Sep 09 04:55:22 qe-anusaxen134-84r9h-master-2.c.openshift-qe.internal crio[1686]: time="2020-09-09 04:55:22.222697026Z" level=info msg="Removed container 17563ae07e2f797c1b1576eb802669e712a3452a7a6dec9be5319217aa85c4c7: openshift-ovn-kubernetes/ovnkube-node-metrics-6swb6/kube-rbac-proxy" id=599affb0-39d5-4cda-b273-814d94601bec name=/runtime.v1alpha2.RuntimeService/RemoveContainer <- removed after the following container is created This isn't really our problem, and cri-o is misbehaving because the cpu is saturated to an indecent extent. kube-apiserver, vswitchd and kubelet seem to be at fault of this excessive cpu usage. I keep my investigation ongoing. Let's wait for the changes in shared gateway to be implemented (they're about to merge) and then let's see if the issue reproduces. Anurag, is this still happening now (since the shared gateway change merged?) @ben seems like its still happening as per https://bugzilla.redhat.com/show_bug.cgi?id=1881113 Apparently it's the same. These are symptoms, not the actual cause. *** This bug has been marked as a duplicate of bug 1881113 *** |