Bug 2014542 - SNO cluster sometimes can take long time to recover after reboot (>45minutes)
Summary: SNO cluster sometimes can take long time to recover after reboot (>45minutes)
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Telco Edge
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Bart Wensley
QA Contact: yliu1
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-15 14:13 UTC by yliu1
Modified: 2022-12-07 18:53 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-12-07 18:53:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description yliu1 2021-10-15 14:13:40 UTC
Description of problem:

About 10-20% of the time, sno cluster will become unstable after reboot, with kube api service comes up and down. It eventually stabilizes, but total recovery time after reboot can take longer than 45m - 1h.
 
For the other 80-90% of time, cluster recovery after reboot usually takes ~20 minutes on same system. 

Version-Release number of selected component (if applicable):
ocp 4.8.15

How reproducible:
10-20%


Steps to Reproduce:
1.SNO cluser with DU configs (4 mgmt cpus (=2 hyperthreaded cores), rt kernel enabled via PAO, rcupdate.rcu_normal_after_boot=0 kern arg added, etc).
2.Launch 40+ test pods that almost fill up rest of the cpus  
3.Soft reboot system via systemctl reboot cmd and wait for all pods on system to recover. 

Actual results:
cluster recovery can take very long time sometimes. See numbers below. Total recovery time in minutes in 2nd and 3rd reboots are taking very long time. 

OCP	Iter	pinable	oc reachable	workload recover	cluster recover	total minutes
4.8.15	1	296	46	600	224	19.48333333
4.8.15	2	272	48	857	1475	44.21666667
4.8.15	3	275	37	632	? > 40m	close to 1h
4.8.15	4	266	47	561	250	18.73333333
4.8.15	5	322	48	601	253	20.43333333
4.8.15	6	283	46	561	183	17.9
4.8.15	7	394	43	607	376	23.7


Expected results:
cluster recovers within reasonable time. 


Additional info:
Some timestamps for the 3rd reboot recovery, where cluster didn't stabilize yet 30 minutes after kubelet running. 

# reboot triggered:
2021/10/14 15:15:14 Exec command on master-0.cluster1.savanna.lab.eng.rdu2.redhat.com: [chroot /rootfs systemctl reboot]


# kubelet is up 5 minutes after reboot triggered. 

kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-mco-default-madv.conf, 20-container-mount-namespace.conf, 20-logging.conf, 20-nodenet.conf
   Active: active (running) since Thu 2021-10-14 15:20:21 UTC;
 Main PID: 5364 (kubelet)
    Tasks: 254 (limit: 2256231)
   Memory: 565.1M
      CPU: 8min 21.347s
   CGroup: /system.slice/kubelet.service
           └─5364 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=>

Oct 14 15:39:02 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: [-]has-synced failed: reason withheld
Oct 14 15:39:02 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: [+]process-running ok
Oct 14 15:39:02 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: healthz check failed
Oct 14 15:39:02 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:39:02.166807    5364 prober.go:116] "Probe failed" probeType="Startup" pod="openshift-ingress/router-default-7b6f957698-fkp2>
Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:39:03.167921    5364 patch_prober.go:29] interesting pod/router-default-7b6f957698-fkp2t container/router namespace/openshif>
Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: [-]has-synced failed: reason withheld
Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: [+]process-running ok
Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: healthz check failed
Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:39:03.168855    5364 prober.go:116] "Probe failed" probeType="Startup" pod="openshift-ingress/router-default-7b6f957698-fkp2>
Oct 14 15:39:03 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:39:03.744660    5364 request.go:665] Waited for 1.010621531s, retries: 1, retry-after: 1s - retry-reason: due to retryable e>
lines 1-22/22 (END)

.. 
Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:42:40.216117    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"network-opera>
Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.217677    5364 status_manager.go:566] "Failed to get status for pod" podUID=e132eae4-c32b-4307-a6b2-d4890294523c pod=">
Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.228959    5364 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-kube-apiserver-operator_kube-api>
Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.229252    5364 kubelet.go:1982] "SyncLoop (PLEG): event for pod" pod="openshift-kube-apiserver-operator/kube-apiserver>
Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.230023    5364 scope.go:111] "RemoveContainer" containerID="f0d165b386d2f7cc86b06b649edb00e361a2f1830f03472e06206f9e4e>
Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:42:40.231880    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserve>
Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.233715    5364 status_manager.go:566] "Failed to get status for pod" podUID=c4803a01-1861-4ece-b8fc-9642bcc4fa75 pod=">
Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.290423    5364 scope.go:111] "RemoveContainer" containerID="e7245d09fbcd6752b79f7b1f3ae342fbf1fe62bb2c274e023cec7ba8f9>
Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.291220    5364 status_manager.go:566] "Failed to get status for pod" podUID=725c20fb-5e01-4bcc-b7bb-25441b98738f pod=">
Oct 14 15:42:40 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:42:40.342978    5364 scope.go:111] "RemoveContainer" containerID="5120d0f6d664e2eedbf2dc8b22f324786d49132b36984280fe7c86b64b>
lines 3-24/24 (END)

---------------------------------------------------------------------------
Oct 14 15:46:56 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:56.791372    5364 scope.go:111] "RemoveContainer" containerID="92115a09a269b84acbfc7ba1837258607691472e616264ada5ce14ba4f>
Oct 14 15:46:56 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:46:56.792189    5364 pod_workers.go:190] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"cluster-poli>
Oct 14 15:46:56 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:56.793042    5364 status_manager.go:566] "Failed to get status for pod" podUID=e6257157-94e4-43aa-a359-be7bc7540793 pod=">
Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:57.290241    5364 scope.go:111] "RemoveContainer" containerID="e5ba197a7d5eb6b25c632d04624cbc88d3881a2cabee03aae36300a6ed>
Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:57.290728    5364 scope.go:111] "RemoveContainer" containerID="69b06abe3f9ccad702b2ce54dd46d65ea576670cc63154363254cb2041>
Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:46:57.291339    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"local-storage>
Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:46:57.291436    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"ingress-opera>
Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:57.315758    5364 patch_prober.go:29] interesting pod/kube-apiserver-master-0.cluster1.savanna.lab.eng.rdu2.redhat.com co>
Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:46:57.315983    5364 prober.go:116] "Probe failed" probeType="Readiness" pod="openshift-kube-apiserver/kube-apiserver-master>
Oct 14 15:46:57 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:46:57.383206    5364 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://api-int.master->
lines 1-22/22 (END)

...
Oct 14 15:50:08 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:08.310669    5364 controller.go:144] failed to ensure lease exists, will retry in 200ms, error: Get "https://api-int.mast>
Oct 14 15:50:08 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:08.512105    5364 controller.go:144] failed to ensure lease exists, will retry in 400ms, error: Get "https://api-int.mast>
Oct 14 15:50:08 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:08.914039    5364 controller.go:144] failed to ensure lease exists, will retry in 800ms, error: Get "https://api-int.mast>
Oct 14 15:50:09 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:09.698811    5364 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMet>
Oct 14 15:50:09 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:09.721005    5364 controller.go:144] failed to ensure lease exists, will retry in 1.6s, error: Get "https://api-int.maste>
Oct 14 15:50:10 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:50:10.291129    5364 scope.go:111] "RemoveContainer" containerID="3c7c9f7a4f62ab159e8eaf6d56a9417d9e0329933b151bba2d046e2b38>
Oct 14 15:50:10 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:50:10.292656    5364 status_manager.go:566] "Failed to get status for pod" podUID=72920ab0-93f1-4f5d-89a4-8e63953ecefb pod=">
Oct 14 15:50:10 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:50:10.293497    5364 scope.go:111] "RemoveContainer" containerID="d325b90f903232b3243d990855002d034a4f293539f03a03d6c8ae0bad>
Oct 14 15:50:10 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:10.294017    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"ptp-operator\>
Oct 14 15:50:10 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:50:10.294766    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"ingress-opera>
--------

Oct 14 15:59:53 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:59:53.291870    5364 scope.go:111] "RemoveContainer" containerID="f1d4998b7524d37235f3319f46dc960e3a57a2a6770aeb53cb289c1065>
Oct 14 15:59:53 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:59:53.293779    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"ingress-opera>
Oct 14 15:59:53 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:59:53.294297    5364 scope.go:111] "RemoveContainer" containerID="8ac1b11f823cd93b4368e0863cf533f0b1ab3b8c8d6c30f8e0b251b465>
Oct 14 15:59:53 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:59:53.295880    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"service-ca-op>
Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:59:54.290852    5364 scope.go:111] "RemoveContainer" containerID="a02f36471ee3708e1acb4cf2cc7714648745f64adb77451608f188e919>
Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:59:54.292341    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"openshift-api>
Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:59:54.292567    5364 scope.go:111] "RemoveContainer" containerID="7c28f9cd43511125db48f456ee7e473a2cf405d8565745b4e1af5da177>
Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:59:54.293329    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"openshift-con>
Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: I1014 15:59:54.293521    5364 scope.go:111] "RemoveContainer" containerID="e1bdb0561c2464d172661437f190b7775d3c56c24ba24aea3e2bc40a08>
Oct 14 15:59:54 master-0.cluster1.savanna.lab.eng.rdu2.redhat.com bash[5364]: E1014 15:59:54.294283    5364 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"snapshot-cont>

Comment 3 yliu1 2022-12-07 18:52:38 UTC
No longer see this on 4.12. Suggest to close.


Note You need to log in before you can comment on or make changes to this bug.