Description of problem: About 10-20% of the time, sno cluster will become unstable after reboot, with kube api service comes up and down. It eventually stabilizes, but total recovery time after reboot can take longer than 45m - 1h. For the other 80-90% of time, cluster recovery after reboot usually takes ~20 minutes on same system. Version-Release number of selected component (if applicable): ocp 4.8.15 How reproducible: 10-20% Steps to Reproduce: 1.SNO cluser with DU configs (4 mgmt cpus (=2 hyperthreaded cores), rt kernel enabled via PAO, rcupdate.rcu_normal_after_boot=0 kern arg added, etc). 2.Launch 40+ test pods that almost fill up rest of the cpus 3.Soft reboot system via systemctl reboot cmd and wait for all pods on system to recover. Actual results: cluster recovery can take very long time sometimes. See numbers below. Total recovery time in minutes in 2nd and 3rd reboots are taking very long time. OCP Iter pinable oc reachable workload recover cluster recover total minutes 4.8.15 1 296 46 600 224 19.48333333 4.8.15 2 272 48 857 1475 44.21666667 4.8.15 3 275 37 632 ? > 40m close to 1h 4.8.15 4 266 47 561 250 18.73333333 4.8.15 5 322 48 601 253 20.43333333 4.8.15 6 283 46 561 183 17.9 4.8.15 7 394 43 607 376 23.7 Expected results: cluster recovers within reasonable time. Additional info: Some timestamps for the 3rd reboot recovery, where cluster didn't stabilize yet 30 minutes after kubelet running. # reboot triggered: 2021/10/14 15:15:14 Exec command on master-0.cluster1.savanna.lab.eng.rdu2.redhat.com: [chroot /rootfs systemctl reboot] # kubelet is up 5 minutes after reboot triggered. kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/kubelet.service.d └─10-mco-default-madv.conf, 20-container-mount-namespace.conf, 20-logging.conf, 20-nodenet.conf Active: active (running) since Thu 2021-10-14 15:20:21 UTC; Main PID: 5364 (kubelet) Tasks: 254 (limit: 2256231) Memory: 565.1M CPU: 8min 21.347s CGroup: /system.slice/kubelet.service └─5364 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=> Oct 14 15:39:02 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: [-]has-synced failed: reason withheld Oct 14 15:39:02 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: [+]process-running ok Oct 14 15:39:02 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: healthz check failed Oct 14 15:39:02 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:39:02.166807 5364 prober.go:116] "Probe failed" probeType="Startup" pod="openshift-ingress/router-default-7b6f957698-fkp2> Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:39:03.167921 5364 patch_prober.go:29] interesting pod/router-default-7b6f957698-fkp2t container/router namespace/openshif> Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: [-]has-synced failed: reason withheld Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: [+]process-running ok Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: healthz check failed Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:39:03.168855 5364 prober.go:116] "Probe failed" probeType="Startup" pod="openshift-ingress/router-default-7b6f957698-fkp2> Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:39:03.744660 5364 request.go:665] Waited for 1.010621531s, retries: 1, retry-after: 1s - retry-reason: due to retryable e> lines 1-22/22 (END) .. Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:42:40.216117 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"network-opera> Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.217677 5364 status_manager.go:566] "Failed to get status for pod" podUID=e132eae4-c32b-4307-a6b2-d4890294523c pod="> Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.228959 5364 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-kube-apiserver-operator_kube-api> Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.229252 5364 kubelet.go:1982] "SyncLoop (PLEG): event for pod" pod="openshift-kube-apiserver-operator/kube-apiserver> Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.230023 5364 scope.go:111] "RemoveContainer" containerID="f0d165b386d2f7cc86b06b649edb00e361a2f1830f03472e06206f9e4e> Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:42:40.231880 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserve> Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.233715 5364 status_manager.go:566] "Failed to get status for pod" podUID=c4803a01-1861-4ece-b8fc-9642bcc4fa75 pod="> Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.290423 5364 scope.go:111] "RemoveContainer" containerID="e7245d09fbcd6752b79f7b1f3ae342fbf1fe62bb2c274e023cec7ba8f9> Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.291220 5364 status_manager.go:566] "Failed to get status for pod" podUID=725c20fb-5e01-4bcc-b7bb-25441b98738f pod="> Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.342978 5364 scope.go:111] "RemoveContainer" containerID="5120d0f6d664e2eedbf2dc8b22f324786d49132b36984280fe7c86b64b> lines 3-24/24 (END) --------------------------------------------------------------------------- Oct 14 15:46:56 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:56.791372 5364 scope.go:111] "RemoveContainer" containerID="92115a09a269b84acbfc7ba1837258607691472e616264ada5ce14ba4f> Oct 14 15:46:56 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:46:56.792189 5364 pod_workers.go:190] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"cluster-poli> Oct 14 15:46:56 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:56.793042 5364 status_manager.go:566] "Failed to get status for pod" podUID=e6257157-94e4-43aa-a359-be7bc7540793 pod="> Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:57.290241 5364 scope.go:111] "RemoveContainer" containerID="e5ba197a7d5eb6b25c632d04624cbc88d3881a2cabee03aae36300a6ed> Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:57.290728 5364 scope.go:111] "RemoveContainer" containerID="69b06abe3f9ccad702b2ce54dd46d65ea576670cc63154363254cb2041> Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:46:57.291339 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"local-storage> Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:46:57.291436 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"ingress-opera> Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:57.315758 5364 patch_prober.go:29] interesting pod/kube-apiserver-master-0.cluster1.savanna.lab.eng.rdu2.redhat.com co> Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:57.315983 5364 prober.go:116] "Probe failed" probeType="Readiness" pod="openshift-kube-apiserver/kube-apiserver-master> Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:46:57.383206 5364 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://api-int.master-> lines 1-22/22 (END) ... Oct 14 15:50:08 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:08.310669 5364 controller.go:144] failed to ensure lease exists, will retry in 200ms, error: Get "https://api-int.mast> Oct 14 15:50:08 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:08.512105 5364 controller.go:144] failed to ensure lease exists, will retry in 400ms, error: Get "https://api-int.mast> Oct 14 15:50:08 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:08.914039 5364 controller.go:144] failed to ensure lease exists, will retry in 800ms, error: Get "https://api-int.mast> Oct 14 15:50:09 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:09.698811 5364 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMet> Oct 14 15:50:09 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:09.721005 5364 controller.go:144] failed to ensure lease exists, will retry in 1.6s, error: Get "https://api-int.maste> Oct 14 15:50:10 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:50:10.291129 5364 scope.go:111] "RemoveContainer" containerID="3c7c9f7a4f62ab159e8eaf6d56a9417d9e0329933b151bba2d046e2b38> Oct 14 15:50:10 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:50:10.292656 5364 status_manager.go:566] "Failed to get status for pod" podUID=72920ab0-93f1-4f5d-89a4-8e63953ecefb pod="> Oct 14 15:50:10 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:50:10.293497 5364 scope.go:111] "RemoveContainer" containerID="d325b90f903232b3243d990855002d034a4f293539f03a03d6c8ae0bad> Oct 14 15:50:10 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:10.294017 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"ptp-operator\> Oct 14 15:50:10 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:10.294766 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"ingress-opera> -------- Oct 14 15:59:53 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:59:53.291870 5364 scope.go:111] "RemoveContainer" containerID="f1d4998b7524d37235f3319f46dc960e3a57a2a6770aeb53cb289c1065> Oct 14 15:59:53 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:59:53.293779 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"ingress-opera> Oct 14 15:59:53 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:59:53.294297 5364 scope.go:111] "RemoveContainer" containerID="8ac1b11f823cd93b4368e0863cf533f0b1ab3b8c8d6c30f8e0b251b465> Oct 14 15:59:53 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:59:53.295880 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"service-ca-op> Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:59:54.290852 5364 scope.go:111] "RemoveContainer" containerID="a02f36471ee3708e1acb4cf2cc7714648745f64adb77451608f188e919> Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:59:54.292341 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"openshift-api> Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:59:54.292567 5364 scope.go:111] "RemoveContainer" containerID="7c28f9cd43511125db48f456ee7e473a2cf405d8565745b4e1af5da177> Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:59:54.293329 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"openshift-con> Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:59:54.293521 5364 scope.go:111] "RemoveContainer" containerID="e1bdb0561c2464d172661437f190b7775d3c56c24ba24aea3e2bc40a08> Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:59:54.294283 5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"snapshot-cont>
No longer see this on 4.12. Suggest to close.