Bug 1986869

Summary: kube-scheduler stuck in container create
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: NodeAssignee: Sascha Grunert <sgrunert>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED WORKSFORME Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aos-bugs, sgrunert
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-05 13:08:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben Parees 2021-07-28 13:34:53 UTC
Description of problem:
This job run failed:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1420224114823532544

due to:
Jul 28 04:08:08.259: FAIL: Some cluster operators are not ready: kube-scheduler (Degraded=True InstallerPodContainerWaiting_ContainerCreating: InstallerPodContainerWaitingDegraded: Pod "installer-2-ip-10-0-143-174.us-west-2.compute.internal" on node "ip-10-0-143-174.us-west-2.compute.internal" container "installer" is waiting since 2021-07-28 03:42:24 +0000 UTC because ContainerCreating) 

As a side note, the install succeeded and makes no mention of any unready operators, which seems odd unless this operator went ready during the install and then went unready again by the time the test was checking its status.


Version-Release number of selected component (if applicable):
registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-07-24-211147 

How reproducible:
flake

Actual results:
operator not ready/container did not start

Expected results:
container should start

Additional Info:
starting w/ node team since it's a container failed to start issue.

Comment 1 Sascha Grunert 2021-08-11 08:16:43 UTC
Looking into the journal logs of the node: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1420224114823532544/artifacts/e2e-aws-canary/gather-extra/artifacts/nodes/ip-10-0-143-174.us-west-2.compute.internal/journal

Some observations:
- It looks like that the kube-scheduler installer probe failed directly after its creation (L5587): 

> Jul 28 03:42:33.469429 ip-10-0-143-174 hyperkube[1366]: I0728 03:42:33.469398    1366 prober.go:116] "Probe failed" probeType="Readiness" pod="openshift-etcd/etcd-quorum-guard-6c9f5857b6-v85sf" podUID=8b444fb5-c23f-4fc3-96f6-5e1dacce567e containerName="guard" probeResult=failure output=""


This results in an immediate "SyncLoop DELETE" of the installer.
CRI-O seems to remove the infra container multiple times in a row:

> Jul 28 03:42:34.902115 ip-10-0-143-174 crio[1311]: time="2021-07-28 03:42:34.902074309Z" level=info msg="RunSandbox: removing infra container 2eee3d5bb235f89a26336f5fccafdff2cb242e619ce01fee1a0fceeafc75828b" id=a4bbd088-7af3-4e46-a475-3e544adb0afd name=/runtime.v1alpha2.RuntimeService/RunPodSandbox

There seems to be a cleanup issue, because from that point on hyperkube will complain that it's not able to remove the cgroup:

> Jul 28 03:42:35.225518 ip-10-0-143-174 hyperkube[1366]: time="2021-07-28T03:42:35Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/blkio/kubepods.slice/kubepods-podf063da52_e315_4b0c_86a4_02b9e8163785.slice/crio-2eee3d5bb235f89a26336f5fccafdff2cb242e619ce01fee1a0fceeafc75828b.scope: device or resource busy"
> Jul 28 03:42:35.225689 ip-10-0-143-174 hyperkube[1366]: time="2021-07-28T03:42:35Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-podf063da52_e315_4b0c_86a4_02b9e8163785.slice/crio-2eee3d5bb235f89a26336f5fccafdff2cb242e619ce01fee1a0fceeafc75828b.scope: device or resource busy"
> Jul 28 03:42:35.225786 ip-10-0-143-174 hyperkube[1366]: time="2021-07-28T03:42:35Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/perf_event/kubepods.slice/kubepods-podf063da52_e315_4b0c_86a4_02b9e8163785.slice/crio-2eee3d5bb235f89a26336f5fccafdff2cb242e619ce01fee1a0fceeafc75828b.scope: device or resource busy"


Ben, can we enable `log_level = "debug"` for CRI-O and re-run the test until such a flake occurs? I think we have to make sure that the resource cleanup is not an issue here.

Comment 2 Sascha Grunert 2021-08-11 09:00:51 UTC
Ah and please increase the kubelet verbosity to --v=4, too. Thank you!

Comment 3 Ben Parees 2021-08-11 15:18:20 UTC
This came from CI, it's not a cluster i can set options on.  But you can definitely create a PR that makes the debug changes you want and run tests against that PR(you don't have to merge it) to get the debug you need.  If you need multiple PRs (different repos) you can also use cluster-bot to run tests across a payload that incorporates multiple PRs.

Comment 4 Sascha Grunert 2021-08-12 08:15:53 UTC
Another rough guess is that the recent runc changes may fix the issues. We should check if the issue still occurs with 4.9 / runc 1.0.2 (not released yet).

Ref:
- https://github.com/opencontainers/runc/issues/3149
- https://github.com/opencontainers/runc/pull/3143
- https://github.com/kubernetes/kubernetes/issues/104280

Comment 5 Sascha Grunert 2021-09-06 10:38:48 UTC
Upstream patch for 1.21/4.8 got merged a few days ago: https://github.com/kubernetes/kubernetes/pull/104530

Kubernetes 1.21.5 is planned for 2021-09-15. After that we should get the patch into OpenShift as well.