Bug 1986869
| Summary: | kube-scheduler stuck in container create | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> |
| Component: | Node | Assignee: | Sascha Grunert <sgrunert> |
| Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> |
| Status: | CLOSED WORKSFORME | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | aos-bugs, sgrunert |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-05 13:08:13 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ben Parees
2021-07-28 13:34:53 UTC
Looking into the journal logs of the node: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1420224114823532544/artifacts/e2e-aws-canary/gather-extra/artifacts/nodes/ip-10-0-143-174.us-west-2.compute.internal/journal Some observations: - It looks like that the kube-scheduler installer probe failed directly after its creation (L5587): > Jul 28 03:42:33.469429 ip-10-0-143-174 hyperkube[1366]: I0728 03:42:33.469398 1366 prober.go:116] "Probe failed" probeType="Readiness" pod="openshift-etcd/etcd-quorum-guard-6c9f5857b6-v85sf" podUID=8b444fb5-c23f-4fc3-96f6-5e1dacce567e containerName="guard" probeResult=failure output="" This results in an immediate "SyncLoop DELETE" of the installer. CRI-O seems to remove the infra container multiple times in a row: > Jul 28 03:42:34.902115 ip-10-0-143-174 crio[1311]: time="2021-07-28 03:42:34.902074309Z" level=info msg="RunSandbox: removing infra container 2eee3d5bb235f89a26336f5fccafdff2cb242e619ce01fee1a0fceeafc75828b" id=a4bbd088-7af3-4e46-a475-3e544adb0afd name=/runtime.v1alpha2.RuntimeService/RunPodSandbox There seems to be a cleanup issue, because from that point on hyperkube will complain that it's not able to remove the cgroup: > Jul 28 03:42:35.225518 ip-10-0-143-174 hyperkube[1366]: time="2021-07-28T03:42:35Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/blkio/kubepods.slice/kubepods-podf063da52_e315_4b0c_86a4_02b9e8163785.slice/crio-2eee3d5bb235f89a26336f5fccafdff2cb242e619ce01fee1a0fceeafc75828b.scope: device or resource busy" > Jul 28 03:42:35.225689 ip-10-0-143-174 hyperkube[1366]: time="2021-07-28T03:42:35Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-podf063da52_e315_4b0c_86a4_02b9e8163785.slice/crio-2eee3d5bb235f89a26336f5fccafdff2cb242e619ce01fee1a0fceeafc75828b.scope: device or resource busy" > Jul 28 03:42:35.225786 ip-10-0-143-174 hyperkube[1366]: time="2021-07-28T03:42:35Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/perf_event/kubepods.slice/kubepods-podf063da52_e315_4b0c_86a4_02b9e8163785.slice/crio-2eee3d5bb235f89a26336f5fccafdff2cb242e619ce01fee1a0fceeafc75828b.scope: device or resource busy" Ben, can we enable `log_level = "debug"` for CRI-O and re-run the test until such a flake occurs? I think we have to make sure that the resource cleanup is not an issue here. Ah and please increase the kubelet verbosity to --v=4, too. Thank you! This came from CI, it's not a cluster i can set options on. But you can definitely create a PR that makes the debug changes you want and run tests against that PR(you don't have to merge it) to get the debug you need. If you need multiple PRs (different repos) you can also use cluster-bot to run tests across a payload that incorporates multiple PRs. Another rough guess is that the recent runc changes may fix the issues. We should check if the issue still occurs with 4.9 / runc 1.0.2 (not released yet). Ref: - https://github.com/opencontainers/runc/issues/3149 - https://github.com/opencontainers/runc/pull/3143 - https://github.com/kubernetes/kubernetes/issues/104280 Upstream patch for 1.21/4.8 got merged a few days ago: https://github.com/kubernetes/kubernetes/pull/104530 Kubernetes 1.21.5 is planned for 2021-09-15. After that we should get the patch into OpenShift as well. |