Description of problem: 4.12 cluster becomes not responsive after running for about 25+ hours with kube-apiserver memory and cpu high Version-Release number of selected component (if applicable): 4.12.0-0.nightly-2022-08-02-225305 How reproducible: Always. Installed two clusters, both hit. Steps to Reproduce: 1. Install 4.12 ipi on gcp env, keep it running for about 26+ hours 2. Check the cpu and memory constantly, check oc command constantly Actual results: After about 26+ hours, cluster becomes not responsive: [xxia@laptop 2022-08-04 19:37:43 CST env]$ oc debug no/xxia-83-27k2f-master-1.c.openshift-qe.internal --dry-run=client -o yaml > debug.yaml; oc create -f debug.yaml Error from server (InternalError): error when creating "odebug.yaml": Internal error occurred: resource quota evaluation timed out [xxia@laptop 2022-08-04 22:11:54 CST env]$ oc get node The connection to the server api...qe.gcp.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [xxia@laptop 2022-08-04 22:12:02 CST env]$ oc get node Unable to connect to the server: dial tcp 104.154.237.128:6443: i/o timeout Expected results: Cluster keeps responsive. Additional info: Below are the repeatedly checks until the cluster becomes not responsive:[xxia@laptop 2022-08-04 10:33:30 CST env]$ oc get node NAME STATUS ROLES AGE VERSION xxia-83-27k2f-master-0.c.openshift-qe.internal Ready control-plane,master 17h v1.24.0+a9d6306 xxia-83-27k2f-master-1.c.openshift-qe.internal Ready control-plane,master 17h v1.24.0+a9d6306 xxia-83-27k2f-master-2.c.openshift-qe.internal Ready control-plane,master 17h v1.24.0+a9d6306 xxia-83-27k2f-worker-a-tsjst.c.openshift-qe.internal Ready worker 17h v1.24.0+a9d6306 xxia-83-27k2f-worker-b-88rzc.c.openshift-qe.internal Ready worker 17h v1.24.0+a9d6306 xxia-83-27k2f-worker-c-7lzh4.c.openshift-qe.internal Ready worker 17h v1.24.0+a9d6306 [xxia@laptop 2022-08-04 10:33:37 CST env]$ oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% xxia-83-27k2f-master-0.c.openshift-qe.internal 664m 18% 11550Mi 78% xxia-83-27k2f-master-1.c.openshift-qe.internal 622m 17% 12691Mi 86% xxia-83-27k2f-master-2.c.openshift-qe.internal 789m 22% 10445Mi 70% xxia-83-27k2f-worker-a-tsjst.c.openshift-qe.internal 275m 7% 3967Mi 26% xxia-83-27k2f-worker-b-88rzc.c.openshift-qe.internal 128m 3% 1748Mi 11% xxia-83-27k2f-worker-c-7lzh4.c.openshift-qe.internal 421m 12% 3766Mi 25% [xxia@laptop 2022-08-04 10:34:18 CST env]$ oc adm top pod -n openshift-kube-apiserver -l apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-xxia-83-27k2f-master-0.c.openshift-qe.internal 140m 2634Mi kube-apiserver-xxia-83-27k2f-master-1.c.openshift-qe.internal 197m 5630Mi kube-apiserver-xxia-83-27k2f-master-2.c.openshift-qe.internal 134m 3193Mi [xxia@laptop 2022-08-04 15:53:52 CST env]$ oc get node NAME STATUS ROLES AGE VERSION xxia-83-27k2f-master-0.c.openshift-qe.internal NotReady control-plane,master 22h v1.24.0+a9d6306 xxia-83-27k2f-master-1.c.openshift-qe.internal Ready control-plane,master 22h v1.24.0+a9d6306 xxia-83-27k2f-master-2.c.openshift-qe.internal Ready control-plane,master 22h v1.24.0+a9d6306 xxia-83-27k2f-worker-a-tsjst.c.openshift-qe.internal Ready worker 22h v1.24.0+a9d6306 xxia-83-27k2f-worker-b-88rzc.c.openshift-qe.internal Ready worker 22h v1.24.0+a9d6306 xxia-83-27k2f-worker-c-7lzh4.c.openshift-qe.internal Ready worker 22h v1.24.0+a9d6306 [xxia@laptop 2022-08-04 15:55:08 CST env]$ oc get node NAME STATUS ROLES AGE VERSION xxia-83-27k2f-master-0.c.openshift-qe.internal Ready control-plane,master 23h v1.24.0+a9d6306 xxia-83-27k2f-master-1.c.openshift-qe.internal Ready control-plane,master 23h v1.24.0+a9d6306 xxia-83-27k2f-master-2.c.openshift-qe.internal NotReady control-plane,master 23h v1.24.0+a9d6306 xxia-83-27k2f-worker-a-tsjst.c.openshift-qe.internal Ready worker 22h v1.24.0+a9d6306 xxia-83-27k2f-worker-b-88rzc.c.openshift-qe.internal Ready worker 22h v1.24.0+a9d6306 xxia-83-27k2f-worker-c-7lzh4.c.openshift-qe.internal Ready worker 22h v1.24.0+a9d6306 [xxia@laptop 2022-08-04 19:11:57 CST env]$ oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% xxia-83-27k2f-master-0.c.openshift-qe.internal 694m 19% 12427Mi 84% xxia-83-27k2f-master-1.c.openshift-qe.internal 1189m 33% 13542Mi 91% xxia-83-27k2f-master-2.c.openshift-qe.internal 572m 16% 12741Mi 86% xxia-83-27k2f-worker-a-tsjst.c.openshift-qe.internal 437m 12% 4794Mi 32% xxia-83-27k2f-worker-b-88rzc.c.openshift-qe.internal 130m 3% 1773Mi 12% xxia-83-27k2f-worker-c-7lzh4.c.openshift-qe.internal 472m 13% 4381Mi 29% [xxia@laptop 2022-08-04 19:13:11 CST env]$ oc adm top pod -A --sort-by memory | head -n 15 NAMESPACE NAME CPU(cores) MEMORY(bytes) openshift-kube-apiserver kube-apiserver-xxia-83-27k2f-master-2.c.openshift-qe.internal 210m 6840Mi openshift-kube-apiserver kube-apiserver-xxia-83-27k2f-master-0.c.openshift-qe.internal 43m 4228Mi openshift-kube-apiserver kube-apiserver-xxia-83-27k2f-master-1.c.openshift-qe.internal 79m 3257Mi openshift-etcd etcd-xxia-83-27k2f-master-2.c.openshift-qe.internal 172m 3167Mi openshift-etcd etcd-xxia-83-27k2f-master-0.c.openshift-qe.internal 130m 2880Mi openshift-monitoring prometheus-k8s-0 199m 2316Mi openshift-etcd etcd-xxia-83-27k2f-master-1.c.openshift-qe.internal 87m 2298Mi openshift-monitoring prometheus-k8s-1 199m 2274Mi openshift-monitoring prometheus-operator-77f59b757c-ghsn6 0m 1768Mi ... [xxia@laptop 2022-08-04 19:16:43 CST env]$ oc adm top pod -n openshift-kube-apiserver -l apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-xxia-83-27k2f-master-0.c.openshift-qe.internal 187m 3796Mi kube-apiserver-xxia-83-27k2f-master-1.c.openshift-qe.internal 78m 3257Mi kube-apiserver-xxia-83-27k2f-master-2.c.openshift-qe.internal 366m 6892Mi [xxia@laptop 2022-08-04 19:17:14 CST env]$ oc adm top pod -n openshift-kube-apiserver -l apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-xxia-83-27k2f-master-0.c.openshift-qe.internal 400m 3825Mi kube-apiserver-xxia-83-27k2f-master-1.c.openshift-qe.internal 116m 3252Mi kube-apiserver-xxia-83-27k2f-master-2.c.openshift-qe.internal 600m 7036Mi [xxia@laptop 2022-08-04 19:17:16 CST env]$ oc adm top pod -n openshift-kube-apiserver -l apiserver NAME CPU(cores) MEMORY(bytes) kube-apiserver-xxia-83-27k2f-master-0.c.openshift-qe.internal 400m 3825Mi kube-apiserver-xxia-83-27k2f-master-1.c.openshift-qe.internal 179m 3254Mi kube-apiserver-xxia-83-27k2f-master-2.c.openshift-qe.internal 477m 7637Mi [xxia@laptop 2022-08-04 19:17:28 CST env]$ oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% xxia-83-27k2f-master-0.c.openshift-qe.internal 989m 28% 12974Mi 87% xxia-83-27k2f-master-1.c.openshift-qe.internal 1326m 37% 13865Mi 94% xxia-83-27k2f-master-2.c.openshift-qe.internal 1404m 40% 13464Mi 91% xxia-83-27k2f-worker-a-tsjst.c.openshift-qe.internal 489m 13% 4755Mi 32% xxia-83-27k2f-worker-b-88rzc.c.openshift-qe.internal 141m 4% 1788Mi 12% xxia-83-27k2f-worker-c-7lzh4.c.openshift-qe.internal 468m 13% 4417Mi 29% [xxia@laptop 2022-08-04 19:36:06 CST env]$ oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% xxia-83-27k2f-master-0.c.openshift-qe.internal 796m 22% 11660Mi 79% xxia-83-27k2f-master-2.c.openshift-qe.internal 564m 16% 12510Mi 84% xxia-83-27k2f-worker-a-tsjst.c.openshift-qe.internal 471m 13% 4828Mi 32% xxia-83-27k2f-worker-b-88rzc.c.openshift-qe.internal 132m 3% 1778Mi 12% xxia-83-27k2f-worker-c-7lzh4.c.openshift-qe.internal 476m 13% 4381Mi 29% xxia-83-27k2f-master-1.c.openshift-qe.internal <unknown> <unknown> <unknown> <unknown> [xxia@laptop 2022-08-04 19:36:29 CST env]$ oc get node NAME STATUS ROLES AGE VERSION xxia-83-27k2f-master-0.c.openshift-qe.internal Ready control-plane,master 26h v1.24.0+a9d6306 xxia-83-27k2f-master-1.c.openshift-qe.internal Ready control-plane,master 26h v1.24.0+a9d6306 xxia-83-27k2f-master-2.c.openshift-qe.internal Ready control-plane,master 26h v1.24.0+a9d6306 xxia-83-27k2f-worker-a-tsjst.c.openshift-qe.internal Ready worker 26h v1.24.0+a9d6306 xxia-83-27k2f-worker-b-88rzc.c.openshift-qe.internal Ready worker 26h v1.24.0+a9d6306 xxia-83-27k2f-worker-c-7lzh4.c.openshift-qe.internal Ready worker 26h v1.24.0+a9d6306 [xxia@laptop 2022-08-04 19:37:32 CST env]$ oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% xxia-83-27k2f-master-0.c.openshift-qe.internal 2030m 57% 12268Mi 83% xxia-83-27k2f-master-2.c.openshift-qe.internal 701m 20% 13751Mi 93% xxia-83-27k2f-worker-a-tsjst.c.openshift-qe.internal 597m 17% 4816Mi 32% xxia-83-27k2f-worker-b-88rzc.c.openshift-qe.internal 204m 5% 1785Mi 12% xxia-83-27k2f-worker-c-7lzh4.c.openshift-qe.internal 590m 16% 4336Mi 29% xxia-83-27k2f-master-1.c.openshift-qe.internal <unknown> <unknown> <unknown> <unknown> [xxia@laptop 2022-08-04 22:11:54 CST env]$ oc get node The connection to the server api....gcp.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [xxia@laptop 2022-08-04 22:12:02 CST env]$ oc get node Unable to connect to the server: dial tcp 104.154....:6443: i/o timeout [xxia@laptop 2022-08-04 22:17:30 CST env]$ [xxia@laptop 2022-08-04 22:18:12 CST env]$ oc get node The connection to the server api....gcp.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
How many secrets are in the openshift-monitoring namespace? If many, then this is a likely duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2115527.
I did not check this when observing the non-responsiveness. I will try to reproduce again.
Tested older payload again, it has many many secrets in the openshift-monitoring namespace. I should have further debugged this : ) such that others can search words in my bug before filing bug 2115527 : ) . Tested 4.12.0-0.nightly-2022-08-10-034842, can't reproduce the non-responsiveness, it is fixed by the monitoring component. *** This bug has been marked as a duplicate of bug 2115527 ***