Description of problem: I see that one of the installer pod in kube-scheduler is in terminating state while there is another similar pod which got spun up and runs fine on master-2. But kube-scheduler still thinks that the pod is not running; it stays in a degraded state. After I performed a force delete using the command ‘oc delete pod <pod_name> --grace-period=0 --force of the pod which is in terminating state Kube-scheduler recovered from degraded to normal state. 09-30 15:16:53.743 Failed 09-30 15:16:53.743 Bad clusteroperators not in health state: 09-30 15:16:53.743 kube-scheduler 4.8.9 True False True 26m 09-30 15:16:53.743 Using oc describe to check status of bad core clusteroperators ... 09-30 15:16:53.743 Name: kube-scheduler 09-30 15:16:53.743 Status: 09-30 15:16:53.743 Conditions: 09-30 15:16:53.743 Last Transition Time: 2021-09-30T09:30:50Z 09-30 15:16:53.743 Message: InstallerPodContainerWaitingDegraded: Pod "installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" on node "pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" container "installer" is waiting since 2021-09-30 09:23:43 +0000 UTC because ContainerCreating 09-30 15:16:53.743 Reason: InstallerPodContainerWaiting_ContainerCreating 09-30 15:16:53.743 Status: True 09-30 15:16:53.743 Type: Degraded 09-30 15:16:53.743 Last Transition Time: 2021-09-30T09:30:04Z 09-30 15:16:53.743 Message: NodeInstallerProgressing: 3 nodes are at revision 7 09-30 15:16:53.743 Reason: AsExpected 09-30 15:16:53.743 Status: False 09-30 15:16:53.743 Type: Progressing 09-30 15:16:53.743 Last Transition Time: 2021-09-30T09:20:23Z 09-30 15:16:53.744 Message: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 7 09-30 15:16:53.744 Reason: AsExpected 09-30 15:16:53.744 Status: True 09-30 15:16:53.744 Type: Available Version-Release number of selected component (if applicable): 4.8.9 How Reproducible: Hit it Once Steps to Reproduce: Install cluster with version 4.8.9 Actual Results: I see that kube-scheduler goes to degraded state even after having the right number of installer pods running. Force deleting the installer pod brings back the scheduler to right state. Expected Results: Kube-scheduler should not go to degraded state or we should provide a mechanism where if the right number of installer pods are running for the scheduler to work, we should delete remaining pods which are stuck in terminating state.
Log for must-gather link: http://virt-openshift-05.lab.eng.nay.redhat.com/pdhamdhe/must-gather.tar.gz
Thank you for the must-gather. From must-gather.local.5320996955247938316/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d34d8f0bd94598f3d245fe274b79fca3926996e3b371d48b72f524c1908b034f/host_service_logs/masters/kubelet_service.log ``` Sep 30 09:23:43.275498 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.275437 1845 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal] Sep 30 09:23:43.442155 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.442109 1845 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kubelet-dir\" (UniqueName: \"kubernetes.io/host-path/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kubelet-dir\") pod \"installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal\" (UID: \"1a3dd6f4-9504-4096-bf4a-e9d8be863f1e\") " Sep 30 09:23:43.442451 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.442212 1845 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kube-api-access\") pod \"installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal\" (UID: \"1a3dd6f4-9504-4096-bf4a-e9d8be863f1e\") " Sep 30 09:23:43.543297 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.543119 1845 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kubelet-dir\" (UniqueName: \"kubernetes.io/host-path/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kubelet-dir\") pod \"installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal\" (UID: \"1a3dd6f4-9504-4096-bf4a-e9d8be863f1e\") " Sep 30 09:23:43.543297 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.543239 1845 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kube-api-access\") pod \"installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal\" (UID: \"1a3dd6f4-9504-4096-bf4a-e9d8be863f1e\") " Sep 30 09:23:43.543690 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.543636 1845 operation_generator.go:698] MountVolume.SetUp succeeded for volume "kubelet-dir" (UniqueName: "kubernetes.io/host-path/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kubelet-dir") pod "installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" (UID: "1a3dd6f4-9504-4096-bf4a-e9d8be863f1e") Sep 30 09:23:43.631979 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.631922 1845 kubelet.go:1951] "SyncLoop UPDATE" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal] Sep 30 09:23:44.240195 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:44.240153 1845 operation_generator.go:698] MountVolume.SetUp succeeded for volume "kube-api-access" (UniqueName: "kubernetes.io/projected/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kube-api-access") pod "installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" (UID: "1a3dd6f4-9504-4096-bf4a-e9d8be863f1e") Sep 30 09:23:44.506328 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:44.506173 1845 kuberuntime_manager.go:460] "No sandbox for pod can be found. Need to start a new one" pod="openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" Sep 30 09:23:44.507180 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:23:44.507159 1845 kubelet_pods.go:412] "Hostname for pod was too long, truncated it" podName="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" hostnameMaxLen=63 truncatedHostname="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.intern" Sep 30 09:23:45.106450 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:45.102496 1845 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal] Sep 30 09:23:46.324453 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:46.320768 1845 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal] Sep 30 09:23:47.280550 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:23:47.280373 1845 kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF" pod="openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" Sep 30 09:24:25.232409 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:24:25.230819 1845 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal_1a3dd6f4-9504-4096-bf4a-e9d8be863f1e: no such file or directory" pod="openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" Sep 30 09:24:37.912549 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:24:37.912480 1845 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal_1a3dd6f4-9504-4096-bf4a-e9d8be863f1e: no such file or directory" pod="openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" Sep 30 09:24:38.526327 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:24:38.526235 1845 kubelet_pods.go:412] "Hostname for pod was too long, truncated it" podName="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" hostnameMaxLen=63 truncatedHostname="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.intern" Sep 30 09:24:40.729439 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:24:40.729244 1845 kubelet_pods.go:1285] "Killing unwanted pod" podName="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" Sep 30 09:24:41.573186 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:24:41.572940 1845 kubelet_pods.go:412] "Hostname for pod was too long, truncated it" podName="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" hostnameMaxLen=63 truncatedHostname="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.intern" ``` From must-gather.local.5320996955247938316/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d34d8f0bd94598f3d245fe274b79fca3926996e3b371d48b72f524c1908b034f/host_service_logs/masters/crio_service.log: ``` Sep 30 09:23:44.510415 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:44.508106562Z" level=info msg="Running pod sandbox: openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal/POD" id=40757908-0b8b-44d9-bdaf-2227962f574b name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Sep 30 09:23:44.658591 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:44.658509784Z" level=info msg="Got pod network &{Name:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal Namespace:openshift-kube-scheduler ID:d384431413d02e1dfc7faad276117201805c860e4c8e36069f4c15fcf886211c UID:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e NetNS:/var/run/netns/89501363-5641-4645-82d7-9c526292efea Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Sep 30 09:23:44.658591 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:44.658595967Z" level=info msg="Adding pod openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal to CNI network \"multus-cni-network\" (type=multus)" Sep 30 09:23:46.301252 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: 2021-09-30T09:23:46Z [verbose] Add: openshift-kube-scheduler:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e:ovn-kubernetes(ovn-kubernetes):eth0 {"cniVersion":"0.4.0","interfaces":[{"name":"d384431413d02e1","mac":"1a:b2:0a:f4:3d:39"},{"name":"eth0","mac":"0a:58:0a:81:00:13","sandbox":"/var/run/netns/89501363-5641-4645-82d7-9c526292efea"}],"ips":[{"version":"4","interface":1,"address":"10.129.0.19/23","gateway":"10.129.0.1"}],"dns":{}} Sep 30 09:23:46.301252 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: I0930 09:23:46.274036 38712 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-scheduler", Name:"installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal", UID:"1a3dd6f4-9504-4096-bf4a-e9d8be863f1e", APIVersion:"v1", ResourceVersion:"19377", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.129.0.19/23] from ovn-kubernetes Sep 30 09:23:46.301921 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:46.301849626Z" level=info msg="Got pod network &{Name:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal Namespace:openshift-kube-scheduler ID:d384431413d02e1dfc7faad276117201805c860e4c8e36069f4c15fcf886211c UID:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e NetNS:/var/run/netns/89501363-5641-4645-82d7-9c526292efea Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Sep 30 09:23:46.302136 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:46.302099602Z" level=info msg="Checking pod openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal for CNI network multus-cni-network (type=multus)" Sep 30 09:23:46.794221 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:46.793570590Z" level=info msg="Got pod network &{Name:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal Namespace:openshift-kube-scheduler ID:d384431413d02e1dfc7faad276117201805c860e4c8e36069f4c15fcf886211c UID:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e NetNS:/var/run/netns/89501363-5641-4645-82d7-9c526292efea Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Sep 30 09:23:46.794221 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:46.793888445Z" level=info msg="Deleting pod openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal from CNI network \"multus-cni-network\" (type=multus)" Sep 30 09:23:47.245188 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: 2021-09-30T09:23:46Z [verbose] Del: openshift-kube-scheduler:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e:ovn-kubernetes:eth0 {"cniVersion":"0.4.0","dns":{},"ipam":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxage":5,"logfile-maxbackups":5,"logfile-maxsize":100,"name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay"} ``` The installer_state_controller.go examines all installer pods (independently of the revision number). If at least one of them is in Pending state for at least 5 minutes, the InstallerPodContainerWaitingDegraded condition is set to True. Whenever a new revision installer pod is created, its predecessor is asked to be deleted. Checking the timestamps from the kubelet and crio logs: ``` Sep 30 09:23:45.106450 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:45.102496 1845 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal] Sep 30 09:23:46.324453 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:46.320768 1845 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal] Sep 30 09:23:46.794221 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:46.793888445Z" level=info msg="Deleting pod openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal from CNI network \"multus-cni-network\" (type=multus)" Sep 30 09:23:47.245188 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: 2021-09-30T09:23:46Z [verbose] Del: openshift-kube-scheduler:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e:ovn-kubernetes:eth0 {"cniVersion":"0.4.0","dns":{},"ipam":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxage":5,"logfile-maxbackups":5,"logfile-maxsize":100,"name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay"} Sep 30 09:24:40.729439 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:24:40.729244 1845 kubelet_pods.go:1285] "Killing unwanted pod" podName="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" ``` Compared to the reported data: ``` 09-30 15:16:53.743 Last Transition Time: 2021-09-30T09:30:50Z 09-30 15:16:53.743 Message: InstallerPodContainerWaitingDegraded: Pod "installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" on node "pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" container "installer" is waiting since 2021-09-30 09:23:43 +0000 UTC because ContainerCreating ``` The installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal pod was asked to be deleted and to be killed before 09:24:40.740749. Yet, the reported data at 2021-09-30T09:30:50Z still claims the pod is there. Also, the kubelet/crio logs does not says anything about (un)successfully deleting/removing/killing the pod. Two outcomes from the investigation: - The installer controller pod could be updated to only take into account the latest installer pods. On the other hand it will be harder to detect the reported scenario. - The container is clearly not removed -> it can be a known issue on the container runtime and already fixed in the latest 4.8 release. Re-assigning to the Node team for further examination.
FYI > The installer controller pod could be updated to only take into account the latest installer pods. On the other hand it will be harder to detect the reported scenario. Already reported in https://bugzilla.redhat.com/show_bug.cgi?id=1986385
*** This bug has been marked as a duplicate of bug 1986003 ***