2009646 – kube-scheduler goes to degraded state with error “InstallerPodContainerWaitingDegraded: Pod "installer-5" on node "master-2.c.openshift-qe.internal" container "installer" is waiting since 2021-09-30 09:23:43 +0000 UTC because ContainerCreating”

Bug 2009646 - kube-scheduler goes to degraded state with error “InstallerPodContainerWaitingDegraded: Pod "installer-5" on node "master-2.c.openshift-qe.internal" container "installer" is waiting since 2021-09-30 09:23:43 +0000 UTC because ContainerCreating”

Summary: kube-scheduler goes to degraded state with error “InstallerPodContainerWaitin...

Keywords:
Status:	CLOSED DUPLICATE of bug 1986003
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-01 07:01 UTC by RamaKasturi
Modified:	2021-11-04 21:27 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-04 21:27:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description RamaKasturi 2021-10-01 07:01:05 UTC

Description of problem:
I see that one of the installer pod in kube-scheduler is in terminating state while there is another similar pod which got spun up and runs fine on master-2. But kube-scheduler still thinks that the pod is not running; it stays in a degraded state. After I performed a force delete using the command ‘oc delete pod <pod_name> --grace-period=0 --force of the pod which is in terminating state Kube-scheduler recovered from degraded to normal state.

09-30 15:16:53.743 Failed
09-30 15:16:53.743 Bad clusteroperators not in health state:
09-30 15:16:53.743 kube-scheduler                             4.8.9   True   False   True    26m
09-30 15:16:53.743 Using oc describe to check status of bad core clusteroperators ...
09-30 15:16:53.743 Name: kube-scheduler
09-30 15:16:53.743 Status:
09-30 15:16:53.743   Conditions:
09-30 15:16:53.743     Last Transition Time:  2021-09-30T09:30:50Z
09-30 15:16:53.743     Message:               InstallerPodContainerWaitingDegraded: Pod "installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" on node "pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" container "installer" is waiting since 2021-09-30 09:23:43 +0000 UTC because ContainerCreating
09-30 15:16:53.743     Reason:                InstallerPodContainerWaiting_ContainerCreating
09-30 15:16:53.743     Status:                True
09-30 15:16:53.743     Type:                  Degraded
09-30 15:16:53.743     Last Transition Time:  2021-09-30T09:30:04Z
09-30 15:16:53.743     Message:               NodeInstallerProgressing: 3 nodes are at revision 7
09-30 15:16:53.743     Reason:                AsExpected
09-30 15:16:53.743     Status:                False
09-30 15:16:53.743     Type:                  Progressing
09-30 15:16:53.743     Last Transition Time:  2021-09-30T09:20:23Z
09-30 15:16:53.744     Message:               StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 7
09-30 15:16:53.744     Reason:                AsExpected
09-30 15:16:53.744     Status:                True
09-30 15:16:53.744     Type:                  Available


Version-Release number of selected component (if applicable):
4.8.9

How Reproducible:
Hit it Once

Steps to Reproduce:
Install cluster with version 4.8.9

Actual Results:
I see that kube-scheduler goes to degraded state even after having the right number of installer pods running. Force deleting the installer pod brings back the scheduler to right state.

Expected Results:
Kube-scheduler should not go to degraded state or we should provide a mechanism where if the right number of installer pods are running for the scheduler to work, we should delete remaining pods which are stuck in terminating state.

Comment 1 RamaKasturi 2021-10-01 07:01:35 UTC

Log for must-gather link: http://virt-openshift-05.lab.eng.nay.redhat.com/pdhamdhe/must-gather.tar.gz

Comment 2 Jan Chaloupka 2021-10-14 11:11:55 UTC

Thank you for the must-gather. From must-gather.local.5320996955247938316/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d34d8f0bd94598f3d245fe274b79fca3926996e3b371d48b72f524c1908b034f/host_service_logs/masters/kubelet_service.log
```
Sep 30 09:23:43.275498 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.275437    1845 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal]
Sep 30 09:23:43.442155 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.442109    1845 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kubelet-dir\" (UniqueName: \"kubernetes.io/host-path/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kubelet-dir\") pod \"installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal\" (UID: \"1a3dd6f4-9504-4096-bf4a-e9d8be863f1e\") "
Sep 30 09:23:43.442451 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.442212    1845 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kube-api-access\") pod \"installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal\" (UID: \"1a3dd6f4-9504-4096-bf4a-e9d8be863f1e\") "
Sep 30 09:23:43.543297 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.543119    1845 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kubelet-dir\" (UniqueName: \"kubernetes.io/host-path/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kubelet-dir\") pod \"installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal\" (UID: \"1a3dd6f4-9504-4096-bf4a-e9d8be863f1e\") "
Sep 30 09:23:43.543297 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.543239    1845 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kube-api-access\" (UniqueName: \"kubernetes.io/projected/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kube-api-access\") pod \"installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal\" (UID: \"1a3dd6f4-9504-4096-bf4a-e9d8be863f1e\") "
Sep 30 09:23:43.543690 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.543636    1845 operation_generator.go:698] MountVolume.SetUp succeeded for volume "kubelet-dir" (UniqueName: "kubernetes.io/host-path/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kubelet-dir") pod "installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" (UID: "1a3dd6f4-9504-4096-bf4a-e9d8be863f1e")
Sep 30 09:23:43.631979 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:43.631922    1845 kubelet.go:1951] "SyncLoop UPDATE" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal]
Sep 30 09:23:44.240195 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:44.240153    1845 operation_generator.go:698] MountVolume.SetUp succeeded for volume "kube-api-access" (UniqueName: "kubernetes.io/projected/1a3dd6f4-9504-4096-bf4a-e9d8be863f1e-kube-api-access") pod "installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" (UID: "1a3dd6f4-9504-4096-bf4a-e9d8be863f1e")
Sep 30 09:23:44.506328 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:44.506173    1845 kuberuntime_manager.go:460] "No sandbox for pod can be found. Need to start a new one" pod="openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal"
Sep 30 09:23:44.507180 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:23:44.507159    1845 kubelet_pods.go:412] "Hostname for pod was too long, truncated it" podName="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" hostnameMaxLen=63 truncatedHostname="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.intern"
Sep 30 09:23:45.106450 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:45.102496    1845 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal]
Sep 30 09:23:46.324453 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:46.320768    1845 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal]
Sep 30 09:23:47.280550 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:23:47.280373    1845 kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF" pod="openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal"
Sep 30 09:24:25.232409 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:24:25.230819    1845 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal_1a3dd6f4-9504-4096-bf4a-e9d8be863f1e: no such file or directory" pod="openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal"
Sep 30 09:24:37.912549 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:24:37.912480    1845 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal_1a3dd6f4-9504-4096-bf4a-e9d8be863f1e: no such file or directory" pod="openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal"
Sep 30 09:24:38.526327 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:24:38.526235    1845 kubelet_pods.go:412] "Hostname for pod was too long, truncated it" podName="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" hostnameMaxLen=63 truncatedHostname="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.intern"
Sep 30 09:24:40.729439 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:24:40.729244    1845 kubelet_pods.go:1285] "Killing unwanted pod" podName="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal"
Sep 30 09:24:41.573186 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: E0930 09:24:41.572940    1845 kubelet_pods.go:412] "Hostname for pod was too long, truncated it" podName="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" hostnameMaxLen=63 truncatedHostname="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.intern"
```

From must-gather.local.5320996955247938316/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d34d8f0bd94598f3d245fe274b79fca3926996e3b371d48b72f524c1908b034f/host_service_logs/masters/crio_service.log:
```
Sep 30 09:23:44.510415 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:44.508106562Z" level=info msg="Running pod sandbox: openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal/POD" id=40757908-0b8b-44d9-bdaf-2227962f574b name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Sep 30 09:23:44.658591 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:44.658509784Z" level=info msg="Got pod network &{Name:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal Namespace:openshift-kube-scheduler ID:d384431413d02e1dfc7faad276117201805c860e4c8e36069f4c15fcf886211c UID:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e NetNS:/var/run/netns/89501363-5641-4645-82d7-9c526292efea Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Sep 30 09:23:44.658591 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:44.658595967Z" level=info msg="Adding pod openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal to CNI network \"multus-cni-network\" (type=multus)"
Sep 30 09:23:46.301252 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: 2021-09-30T09:23:46Z [verbose] Add: openshift-kube-scheduler:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e:ovn-kubernetes(ovn-kubernetes):eth0 {"cniVersion":"0.4.0","interfaces":[{"name":"d384431413d02e1","mac":"1a:b2:0a:f4:3d:39"},{"name":"eth0","mac":"0a:58:0a:81:00:13","sandbox":"/var/run/netns/89501363-5641-4645-82d7-9c526292efea"}],"ips":[{"version":"4","interface":1,"address":"10.129.0.19/23","gateway":"10.129.0.1"}],"dns":{}}
Sep 30 09:23:46.301252 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: I0930 09:23:46.274036   38712 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-scheduler", Name:"installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal", UID:"1a3dd6f4-9504-4096-bf4a-e9d8be863f1e", APIVersion:"v1", ResourceVersion:"19377", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.129.0.19/23] from ovn-kubernetes
Sep 30 09:23:46.301921 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:46.301849626Z" level=info msg="Got pod network &{Name:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal Namespace:openshift-kube-scheduler ID:d384431413d02e1dfc7faad276117201805c860e4c8e36069f4c15fcf886211c UID:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e NetNS:/var/run/netns/89501363-5641-4645-82d7-9c526292efea Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Sep 30 09:23:46.302136 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:46.302099602Z" level=info msg="Checking pod openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal for CNI network multus-cni-network (type=multus)"
Sep 30 09:23:46.794221 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:46.793570590Z" level=info msg="Got pod network &{Name:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal Namespace:openshift-kube-scheduler ID:d384431413d02e1dfc7faad276117201805c860e4c8e36069f4c15fcf886211c UID:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e NetNS:/var/run/netns/89501363-5641-4645-82d7-9c526292efea Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Sep 30 09:23:46.794221 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:46.793888445Z" level=info msg="Deleting pod openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal from CNI network \"multus-cni-network\" (type=multus)"
Sep 30 09:23:47.245188 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: 2021-09-30T09:23:46Z [verbose] Del: openshift-kube-scheduler:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e:ovn-kubernetes:eth0 {"cniVersion":"0.4.0","dns":{},"ipam":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxage":5,"logfile-maxbackups":5,"logfile-maxsize":100,"name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay"}
```

The installer_state_controller.go examines all installer pods (independently of the revision number). If at least one of them is in Pending state for at least 5 minutes, the InstallerPodContainerWaitingDegraded condition is set to True. Whenever a new revision installer pod is created, its predecessor is asked to be deleted. Checking the timestamps from the kubelet and crio logs:
```
Sep 30 09:23:45.106450 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:45.102496    1845 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal]
Sep 30 09:23:46.324453 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:23:46.320768    1845 kubelet.go:1960] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal]
Sep 30 09:23:46.794221 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: time="2021-09-30 09:23:46.793888445Z" level=info msg="Deleting pod openshift-kube-scheduler_installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal from CNI network \"multus-cni-network\" (type=multus)"
Sep 30 09:23:47.245188 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal crio[1810]: 2021-09-30T09:23:46Z [verbose] Del: openshift-kube-scheduler:installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal:1a3dd6f4-9504-4096-bf4a-e9d8be863f1e:ovn-kubernetes:eth0 {"cniVersion":"0.4.0","dns":{},"ipam":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxage":5,"logfile-maxbackups":5,"logfile-maxsize":100,"name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay"}
Sep 30 09:24:40.729439 pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal hyperkube[1845]: I0930 09:24:40.729244    1845 kubelet_pods.go:1285] "Killing unwanted pod" podName="installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal"
```

Compared to the reported data:
```
09-30 15:16:53.743     Last Transition Time:  2021-09-30T09:30:50Z
09-30 15:16:53.743     Message:               InstallerPodContainerWaitingDegraded: Pod "installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" on node "pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal" container "installer" is waiting since 2021-09-30 09:23:43 +0000 UTC because ContainerCreating
```

The installer-5-pdhamdhe-30gcp-7hmff-master-2.c.openshift-qe.internal pod was asked to be deleted and to be killed before 09:24:40.740749. Yet, the reported data at 2021-09-30T09:30:50Z still claims the pod is there. Also, the kubelet/crio logs does not says anything about (un)successfully deleting/removing/killing the pod.

Two outcomes from the investigation:
- The installer controller pod could be updated to only take into account the latest installer pods. On the other hand it will be harder to detect the reported scenario.
- The container is clearly not removed -> it can be a known issue on the container runtime and already fixed in the latest 4.8 release.

Re-assigning to the Node team for further examination.

Comment 3 Jan Chaloupka 2021-10-14 11:15:53 UTC

FYI

> The installer controller pod could be updated to only take into account the latest installer pods. On the other hand it will be harder to detect the reported scenario.

Already reported in https://bugzilla.redhat.com/show_bug.cgi?id=1986385

Comment 6 Ryan Phillips 2021-11-04 21:27:27 UTC


*** This bug has been marked as a duplicate of bug 1986003 ***

Note You need to log in before you can comment on or make changes to this bug.