Bug 2005225
| Summary: | kubelet kills pod containers and does not restart them again | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Yang Yang <yanyang> | ||||
| Component: | Node | Assignee: | Harshal Patil <harpatil> | ||||
| Node sub component: | Kubelet | QA Contact: | MinLi <minmli> | ||||
| Status: | CLOSED DUPLICATE | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | high | CC: | aos-bugs, fkrepins, harpatil, maszulik, mfojtik, rphillips | ||||
| Version: | 4.9 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-10-11 21:21:39 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Flip check what's going on, from what I quickly checked the logs the situation was that cluster-policy-controller on master-1 lost leader election, but it should fail and restart immediately after that, but that didn't happen. Even though the kube-controller-manager status mentions cluster-policy-controller it is not the root cause. All of the containers in the kcm were SIGTERMed in similar way as cpc. The "leader election lost" statement is just a final defered call on sigterm. Actually the cpc on master-1 has not even become the leader yet. > 2021-09-17T02:30:14.905357066Z I0917 02:30:14.905306 1 leaderelection.go:243] attempting to acquire leader lease openshift-kube-controller-manager/cluster-policy-controller-lock... > 2021-09-17T02:30:15.328542644Z I0917 02:30:15.328462 1 cmd.go:97] Received SIGTERM or SIGINT signal, shutting down controller. > 2021-09-17T02:30:15.328542644Z W0917 02:30:15.328533 1 leaderelection.go:75] leader election lost containers 1,2,4 exit with code 0 on SIGTERM, but container 3 (kube-controller-manager-cert-syncer) exits with code 2 on SIGTERM. The kcm operator seem to be working okay. It has created the install pod to update the revision to 9. The install pod succeeds and creates the kcm pod. But few seconds after that all of the kcm pod containers get killed. kcm operator is stuck waiting for the kcm pod to come up. > 2021-09-17T05:47:07.662030792Z I0917 05:47:07.658627 1 installer_controller.go:508] "yangyang0917-tv2fd-master-1.c.openshift-qe.internal" is in transition to 9, but has not made progress because static pod is pending also the pods on master-0 and master-3 updated succesfully so there is something wrong only with a pod on master-1 > nodeStatuses: > - currentRevision: 9 > nodeName: yangyang0917-tv2fd-master-0.c.openshift-qe.internal > - currentRevision: 9 > nodeName: yangyang0917-tv2fd-master-2.c.openshift-qe.internal > - currentRevision: 8 > nodeName: yangyang0917-tv2fd-master-1.c.openshift-qe.internal > targetRevision: 9 Nevertheless I could not reproduce this behaviour with multiple clean 4.9.0-0.nightly clusters. Even when I forcibly kill the containers myself, they always come back up. This is probably a problem with kubelet. The pod has no running containers even though the phase is "Running" and restartPolicy "Always". The containers have not been restarted (restartCount 0). Relevant kubelet logs: > Sep 17 02:30:15.257810 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: I0917 02:30:15.257765 1501 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal" podUID=12911eaa-e060-4a5f-92c9-473726e70f9b containerName="kube-controller-manager" containerID="cri-o://81a8a634c2493f43cee899f1ca3cdc659d84308dbdecda65a274d4bfb6257876" gracePeriod=30 > Sep 17 02:30:15.261481 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: I0917 02:30:15.258615 1501 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal" podUID=12911eaa-e060-4a5f-92c9-473726e70f9b containerName="kube-controller-manager-recovery-controller" containerID="cri-o://e49e0bc55d1c593a839a2c055a44db1e6b7c446007d9b654d74b025151ea97ff" gracePeriod=30 > Sep 17 02:30:15.279031 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: I0917 02:30:15.259054 1501 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal" podUID=12911eaa-e060-4a5f-92c9-473726e70f9b containerName="kube-controller-manager-cert-syncer" containerID="cri-o://8e7753bc7f36c9434b724b5bdd123e3197ac2ce7fd79d8a00e574c0e2fac3c8a" gracePeriod=30 > Sep 17 02:30:15.279031 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: I0917 02:30:15.259107 1501 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal" podUID=12911eaa-e060-4a5f-92c9-473726e70f9b containerName="cluster-policy-controller" containerID="cri-o://019a65309dc11f66daf67fbcc132560c252d31eeca5f96d3732261a429f249a2" gracePeriod=30 > Sep 17 02:30:15.279284 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: I0917 02:30:15.259405 1501 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-kube-controller-manager/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal" event=&{ID:12911eaa-e060-4a5f-92c9-473726e70f9b Type:ContainerStarted Data:e49e0bc55d1c593a839a2c055a44db1e6b7c446007d9b654d74b025151ea97ff} > Sep 17 02:30:15.536505 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.536348 1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod12911eaa_e060_4a5f_92c9_473726e70f9b.slice/crio-conmon-019a65309dc11f66daf67fbcc132560c252d31eeca5f96d3732261a429f249a2.scope because of race condition > Sep 17 02:30:15.723469 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.723410 1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod12911eaa_e060_4a5f_92c9_473726e70f9b.slice/crio-conmon-8e7753bc7f36c9434b724b5bdd123e3197ac2ce7fd79d8a00e574c0e2fac3c8a.scope because of race condition > Sep 17 02:30:15.768907 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.766714 1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod12911eaa_e060_4a5f_92c9_473726e70f9b.slice/crio-conmon-81a8a634c2493f43cee899f1ca3cdc659d84308dbdecda65a274d4bfb6257876.scope because of race condition > Sep 17 02:30:15.840517 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.838948 1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-poddc4a77da_eccd_4298_8d4b_c97ec71f0f36.slice/crio-a02908a05f35c27e2c91ed5f75c987357001f771696fa6b9139218bf601f7ed8.scope because of race condition > Sep 17 02:30:15.859523 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.858590 1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-poddc4a77da_eccd_4298_8d4b_c97ec71f0f36.slice/crio-725b71634643dd2c0c63ed2fe9ae77559995783202dfa89a72b22b2ca751d245.scope because of race condition > Sep 17 02:30:15.903340 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.903280 1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-poddc4a77da_eccd_4298_8d4b_c97ec71f0f36.slice because of race condition moving to kubelet @maszulik we also have an option to implement a workaround in our operator: https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/controller/installer/installer_controller.go#L693 symptoms of https://bugzilla.redhat.com/show_bug.cgi?id=2005350 are similar and might be hitting the same issue (In reply to Filip Krepinsky from comment #6) > @maszulik we also have an option to implement a workaround in our > operator: > https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/ > controller/installer/installer_controller.go#L693 Filip if you're interested sync with Stefan and go for it. I see he added there that TODO. +1 will try to find some time for it Hi Filip, I didn't get why this issue is moved to node team? Hi Harshal, because even though the Pod spec is in order, the containers are not running. Also kubelet logs show errors which suggests it is an kubelet issue (or are there some other components for kubelet that this should be target to?). Please take a look at must-gather.local.6367732445344693908 logs/yaml ( https://bugzilla.redhat.com/attachment.cgi?id=1823730 ) status of the Pod is running and all containers were terminated apiVersion: v1 kind: Pod metadata: name: kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal namespace: openshift-kube-controller-manager status: conditions: - lastProbeTime: null lastTransitionTime: "2021-09-17T02:19:07Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2021-09-17T02:30:16Z" message: 'containers with unready status: [kube-controller-manager cluster-policy-controller kube-controller-manager-cert-syncer kube-controller-manager-recovery-controller]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2021-09-17T02:30:16Z" message: 'containers with unready status: [kube-controller-manager cluster-policy-controller kube-controller-manager-cert-syncer kube-controller-manager-recovery-controller]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2021-09-17T02:19:07Z" status: "True" type: PodScheduled hostIP: 10.0.0.5 phase: Running podIP: 10.0.0.5 podIPs: - ip: 10.0.0.5 qosClass: Burstable startTime: "2021-09-17T02:19:07Z" |
Created attachment 1823730 [details] Must gather log bundle Description of problem: One kube-controller-manager pod is completed but not running. # oc get co kube-controller-manager kube-controller-manager 4.9.0-0.nightly-2021-09-16-215330 True True True 89m StaticPodsDegraded: pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal container "cluster-policy-controller" is terminated: Completed: ... # oc get po -n openshift-kube-controller-manager | grep kube-controller-manager pod/kube-controller-manager-yangyang0917-tv2fd-master-0.c.openshift-qe.internal 4/4 Running 0 24m pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal 0/4 Completed 0 24m pod/kube-controller-manager-yangyang0917-tv2fd-master-2.c.openshift-qe.internal 4/4 Running 0 24m Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-09-16-215330 How reproducible: 1/1 Steps to Reproduce: 1. Install a 4.9 cluster 2. 3. Actual results: KCM is degraded because on kcm pod is not running Expected results: KCM is running well Additional info: ClusterID: a09defc5-c8b4-44e3-9c67-7a4665e1f12f ClusterVersion: Stable at "4.9.0-0.nightly-2021-09-16-215330" ClusterOperators: clusteroperator/kube-controller-manager is degraded because StaticPodsDegraded: pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal container "cluster-policy-controller" is terminated: Completed: StaticPodsDegraded: pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal container "kube-controller-manager" is terminated: Completed: StaticPodsDegraded: pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal container "kube-controller-manager-cert-syncer" is terminated: Error: I0917 02:30:14.208811 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-controller-manager", Name:"kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "CertSyncController" resync interval is set to 0s which might lead to client request throttling StaticPodsDegraded: I0917 02:30:14.211038 1 base_controller.go:67] Waiting for caches to sync for CertSyncController StaticPodsDegraded: I0917 02:30:14.211102 1 observer_polling.go:159] Starting file observer StaticPodsDegraded: I0917 02:30:14.312019 1 base_controller.go:73] Caches are synced for CertSyncController StaticPodsDegraded: I0917 02:30:14.312156 1 base_controller.go:110] Starting #1 worker of CertSyncController controller ... StaticPodsDegraded: I0917 02:30:14.312313 1 certsync_controller.go:66] Syncing configmaps: [{aggregator-client-ca false} {client-ca false} {trusted-ca-bundle true}] StaticPodsDegraded: I0917 02:30:14.312707 1 certsync_controller.go:170] Syncing secrets: [{kube-controller-manager-client-cert-key false} {csr-signer false}] StaticPodsDegraded: I0917 02:30:15.267817 1 certsync_controller.go:66] Syncing configmaps: [{aggregator-client-ca false} {client-ca false} {trusted-ca-bundle true}] StaticPodsDegraded: I0917 02:30:15.268250 1 certsync_controller.go:170] Syncing secrets: [{kube-controller-manager-client-cert-key false} {csr-signer false}] StaticPodsDegraded: StaticPodsDegraded: pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal container "kube-controller-manager-recovery-controller" is terminated: Completed: