Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2005225

Summary: kubelet kills pod containers and does not restart them again
Product: OpenShift Container Platform Reporter: Yang Yang <yanyang>
Component: NodeAssignee: Harshal Patil <harpatil>
Node sub component: Kubelet QA Contact: MinLi <minmli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, fkrepins, harpatil, maszulik, mfojtik, rphillips
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-11 21:21:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Must gather log bundle none

Description Yang Yang 2021-09-17 07:38:21 UTC
Created attachment 1823730 [details]
Must gather log bundle

Description of problem:

One kube-controller-manager pod is completed but not running.

# oc get co kube-controller-manager
kube-controller-manager                    4.9.0-0.nightly-2021-09-16-215330   True        True          True       89m     StaticPodsDegraded: pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal container "cluster-policy-controller" is terminated: Completed: ...


# oc get po -n openshift-kube-controller-manager | grep kube-controller-manager

pod/kube-controller-manager-yangyang0917-tv2fd-master-0.c.openshift-qe.internal   4/4     Running     0          24m
pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal   0/4     Completed   0          24m
pod/kube-controller-manager-yangyang0917-tv2fd-master-2.c.openshift-qe.internal   4/4     Running     0          24m



Version-Release number of selected component (if applicable):

4.9.0-0.nightly-2021-09-16-215330

How reproducible:
1/1

Steps to Reproduce:
1. Install a 4.9 cluster
2.
3.

Actual results:
KCM is degraded because on kcm pod is not running

Expected results:
KCM is running well


Additional info:

ClusterID: a09defc5-c8b4-44e3-9c67-7a4665e1f12f
ClusterVersion: Stable at "4.9.0-0.nightly-2021-09-16-215330"
ClusterOperators:
	clusteroperator/kube-controller-manager is degraded because StaticPodsDegraded: pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal container "cluster-policy-controller" is terminated: Completed: 
StaticPodsDegraded: pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal container "kube-controller-manager" is terminated: Completed: 
StaticPodsDegraded: pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal container "kube-controller-manager-cert-syncer" is terminated: Error: I0917 02:30:14.208811       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-controller-manager", Name:"kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "CertSyncController" resync interval is set to 0s which might lead to client request throttling
StaticPodsDegraded: I0917 02:30:14.211038       1 base_controller.go:67] Waiting for caches to sync for CertSyncController
StaticPodsDegraded: I0917 02:30:14.211102       1 observer_polling.go:159] Starting file observer
StaticPodsDegraded: I0917 02:30:14.312019       1 base_controller.go:73] Caches are synced for CertSyncController 
StaticPodsDegraded: I0917 02:30:14.312156       1 base_controller.go:110] Starting #1 worker of CertSyncController controller ...
StaticPodsDegraded: I0917 02:30:14.312313       1 certsync_controller.go:66] Syncing configmaps: [{aggregator-client-ca false} {client-ca false} {trusted-ca-bundle true}]
StaticPodsDegraded: I0917 02:30:14.312707       1 certsync_controller.go:170] Syncing secrets: [{kube-controller-manager-client-cert-key false} {csr-signer false}]
StaticPodsDegraded: I0917 02:30:15.267817       1 certsync_controller.go:66] Syncing configmaps: [{aggregator-client-ca false} {client-ca false} {trusted-ca-bundle true}]
StaticPodsDegraded: I0917 02:30:15.268250       1 certsync_controller.go:170] Syncing secrets: [{kube-controller-manager-client-cert-key false} {csr-signer false}]
StaticPodsDegraded: 
StaticPodsDegraded: pod/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal container "kube-controller-manager-recovery-controller" is terminated: Completed:

Comment 3 Maciej Szulik 2021-09-20 13:27:47 UTC
Flip check what's going on, from what I quickly checked the logs the situation was that cluster-policy-controller on master-1 lost leader election, but it should fail and restart immediately after that, but that didn't happen.

Comment 4 Filip Krepinsky 2021-09-21 20:21:22 UTC
Even though the kube-controller-manager status mentions cluster-policy-controller it is not the root cause.

All of the containers in the kcm were SIGTERMed	in similar way as cpc. The "leader election lost" statement is just a final defered call on sigterm. Actually the cpc on master-1 has not even become the leader yet.

> 2021-09-17T02:30:14.905357066Z I0917 02:30:14.905306       1 leaderelection.go:243] attempting to acquire leader lease openshift-kube-controller-manager/cluster-policy-controller-lock...
> 2021-09-17T02:30:15.328542644Z I0917 02:30:15.328462       1 cmd.go:97] Received SIGTERM or SIGINT signal, shutting down controller.                                          
> 2021-09-17T02:30:15.328542644Z W0917 02:30:15.328533       1 leaderelection.go:75] leader election lost

containers 1,2,4 exit with code 0 on SIGTERM, but container 3 (kube-controller-manager-cert-syncer) exits with code 2 on SIGTERM.

The kcm operator seem to be working okay. It has created the install pod to update the revision to 9. The install pod succeeds and creates the kcm pod. But few seconds after that all of the kcm pod containers get killed. kcm operator is stuck waiting for the kcm pod to come up.

> 2021-09-17T05:47:07.662030792Z I0917 05:47:07.658627       1 installer_controller.go:508] "yangyang0917-tv2fd-master-1.c.openshift-qe.internal" is in transition to 9, but has not made progress because static pod is pending

also the pods on master-0 and master-3 updated succesfully so there is something wrong only with a pod on master-1

>  nodeStatuses:
>  - currentRevision: 9
>    nodeName: yangyang0917-tv2fd-master-0.c.openshift-qe.internal
>  - currentRevision: 9
>    nodeName: yangyang0917-tv2fd-master-2.c.openshift-qe.internal
>  - currentRevision: 8
>    nodeName: yangyang0917-tv2fd-master-1.c.openshift-qe.internal
>    targetRevision: 9

Nevertheless I could not reproduce this behaviour with multiple clean 4.9.0-0.nightly clusters. Even when I forcibly kill the containers myself, they always come back up. 

This is probably a problem with kubelet. The pod has no running containers even though the phase is "Running" and restartPolicy "Always". The containers have not been restarted (restartCount 0).


Relevant kubelet logs:

> Sep 17 02:30:15.257810 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: I0917 02:30:15.257765    1501 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal" podUID=12911eaa-e060-4a5f-92c9-473726e70f9b containerName="kube-controller-manager" containerID="cri-o://81a8a634c2493f43cee899f1ca3cdc659d84308dbdecda65a274d4bfb6257876" gracePeriod=30
> Sep 17 02:30:15.261481 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: I0917 02:30:15.258615    1501 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal" podUID=12911eaa-e060-4a5f-92c9-473726e70f9b containerName="kube-controller-manager-recovery-controller" containerID="cri-o://e49e0bc55d1c593a839a2c055a44db1e6b7c446007d9b654d74b025151ea97ff" gracePeriod=30
> Sep 17 02:30:15.279031 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: I0917 02:30:15.259054    1501 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal" podUID=12911eaa-e060-4a5f-92c9-473726e70f9b containerName="kube-controller-manager-cert-syncer" containerID="cri-o://8e7753bc7f36c9434b724b5bdd123e3197ac2ce7fd79d8a00e574c0e2fac3c8a" gracePeriod=30
> Sep 17 02:30:15.279031 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: I0917 02:30:15.259107    1501 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-controller-manager/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal" podUID=12911eaa-e060-4a5f-92c9-473726e70f9b containerName="cluster-policy-controller" containerID="cri-o://019a65309dc11f66daf67fbcc132560c252d31eeca5f96d3732261a429f249a2" gracePeriod=30
> Sep 17 02:30:15.279284 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: I0917 02:30:15.259405    1501 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-kube-controller-manager/kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal" event=&{ID:12911eaa-e060-4a5f-92c9-473726e70f9b Type:ContainerStarted Data:e49e0bc55d1c593a839a2c055a44db1e6b7c446007d9b654d74b025151ea97ff}
> Sep 17 02:30:15.536505 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.536348    1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod12911eaa_e060_4a5f_92c9_473726e70f9b.slice/crio-conmon-019a65309dc11f66daf67fbcc132560c252d31eeca5f96d3732261a429f249a2.scope because of race condition
> Sep 17 02:30:15.723469 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.723410    1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod12911eaa_e060_4a5f_92c9_473726e70f9b.slice/crio-conmon-8e7753bc7f36c9434b724b5bdd123e3197ac2ce7fd79d8a00e574c0e2fac3c8a.scope because of race condition
> Sep 17 02:30:15.768907 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.766714    1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod12911eaa_e060_4a5f_92c9_473726e70f9b.slice/crio-conmon-81a8a634c2493f43cee899f1ca3cdc659d84308dbdecda65a274d4bfb6257876.scope because of race condition
> Sep 17 02:30:15.840517 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.838948    1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-poddc4a77da_eccd_4298_8d4b_c97ec71f0f36.slice/crio-a02908a05f35c27e2c91ed5f75c987357001f771696fa6b9139218bf601f7ed8.scope because of race condition
> Sep 17 02:30:15.859523 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.858590    1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-poddc4a77da_eccd_4298_8d4b_c97ec71f0f36.slice/crio-725b71634643dd2c0c63ed2fe9ae77559995783202dfa89a72b22b2ca751d245.scope because of race condition
> Sep 17 02:30:15.903340 yangyang0917-tv2fd-master-1.c.openshift-qe.internal hyperkube[1501]: W0917 02:30:15.903280    1501 manager.go:701] Error getting data for container /kubepods.slice/kubepods-poddc4a77da_eccd_4298_8d4b_c97ec71f0f36.slice because of race condition

Comment 5 Filip Krepinsky 2021-09-21 20:26:07 UTC
moving to kubelet

Comment 6 Filip Krepinsky 2021-09-21 20:29:03 UTC
@maszulik we also have an option to implement a workaround in our operator: https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/controller/installer/installer_controller.go#L693

Comment 7 Filip Krepinsky 2021-09-21 21:37:18 UTC
symptoms of https://bugzilla.redhat.com/show_bug.cgi?id=2005350 are similar and might be hitting the same issue

Comment 8 Maciej Szulik 2021-09-23 10:21:55 UTC
(In reply to Filip Krepinsky from comment #6)
> @maszulik we also have an option to implement a workaround in our
> operator:
> https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/
> controller/installer/installer_controller.go#L693

Filip if you're interested sync with Stefan and go for it. I see he added there that TODO.

Comment 9 Filip Krepinsky 2021-09-23 20:23:28 UTC
+1 will try to find some time for it

Comment 10 Harshal Patil 2021-10-01 09:48:33 UTC
Hi Filip,

I didn't get why this issue is moved to node team?

Comment 11 Filip Krepinsky 2021-10-01 19:09:31 UTC
Hi Harshal,

because even though the Pod spec is in order, the containers are not running. Also kubelet logs show errors which suggests it is an kubelet issue (or are there some other components for kubelet that this should be target to?).

Please take a look at must-gather.local.6367732445344693908 logs/yaml ( https://bugzilla.redhat.com/attachment.cgi?id=1823730 ) 

status of the Pod is running and all containers were terminated

apiVersion: v1
kind: Pod
metadata:
  name: kube-controller-manager-yangyang0917-tv2fd-master-1.c.openshift-qe.internal
  namespace: openshift-kube-controller-manager
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-09-17T02:19:07Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-09-17T02:30:16Z"
    message: 'containers with unready status: [kube-controller-manager cluster-policy-controller
      kube-controller-manager-cert-syncer kube-controller-manager-recovery-controller]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-09-17T02:30:16Z"
    message: 'containers with unready status: [kube-controller-manager cluster-policy-controller
      kube-controller-manager-cert-syncer kube-controller-manager-recovery-controller]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-09-17T02:19:07Z"
    status: "True"
    type: PodScheduled
  hostIP: 10.0.0.5
  phase: Running
  podIP: 10.0.0.5
  podIPs:
  - ip: 10.0.0.5
  qosClass: Burstable
  startTime: "2021-09-17T02:19:07Z"