Bug 1698543
| Summary: | FailedCreatePodContainer "Delegation not available for unit type" | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Seth Jennings <sjenning> | |
| Component: | Node | Assignee: | Mrunal Patel <mpatel> | |
| Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 4.1.0 | CC: | aos-bugs, ccoleman, gblomqui, jokerman, mifiedle, mmccomas, mpatel, mrobson, rphillips, scuppett, wking, wsun, xtian | |
| Target Milestone: | --- | Keywords: | Reopened | |
| Target Release: | 4.2.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1755991 (view as bug list) | Environment: | ||
| Last Closed: | 2019-10-16 06:28:05 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1755991 | |||
| Bug Blocks: | ||||
|
Description
Seth Jennings
2019-04-10 14:57:14 UTC
Setting as blocker, until we can figure out what is going on, and self assigning since the env is local to my lab network. The the worker to which the pod is scheduled
pod_workers.go:186] Error syncing pod 959cca01-5ba1-11e9-8a21-fa163e8d1880 ("prometheus-adapter-586d9bb8f-qc7x9_openshift-monitoring(959cca01-5ba1-11e9-8a21-fa163e8d1880)"), skipping: failed to ensure that the pod: 959cca01-5ba1-11e9-8a21-fa163e8d1880 cgroups exist and are correctly applied: failed to create container for [kubepods besteffort pod959cca01-5ba1-11e9-8a21-fa163e8d1880] : Delegation not available for unit type
$ systemctl cat kubepods.slice
# /run/systemd/transient/kubepods.slice
# This is a transient unit file, created programmatically via the systemd API. Do not edit.
[Unit]
Description=libcontainer container kubepods.slice
Wants=-.slice
[Slice]
MemoryAccounting=yes
CPUAccounting=yes
BlockIOAccounting=yes
[Unit]
DefaultDependencies=no
[Slice]
MemoryLimit=6245400576
CPUShares=6144
$ systemctl cat kubepods-besteffort.slice
# /run/systemd/transient/kubepods-besteffort.slice
# This is a transient unit file, created programmatically via the systemd API. Do not edit.
[Unit]
Description=libcontainer container kubepods-besteffort.slice
Wants=kubepods.slice
[Slice]
MemoryAccounting=yes
CPUAccounting=yes
BlockIOAccounting=yes
[Unit]
DefaultDependencies=no
[Slice]
CPUShares=2
$ pwd
/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice
$ ls -dl kubepods-besteffort-pod*
drwxr-xr-x. 6 root root 0 Apr 8 20:31 kubepods-besteffort-pod0cb754e8_5a3d_11e9_94af_fa163e8d1880.slice
drwxr-xr-x. 6 root root 0 Apr 8 20:24 kubepods-besteffort-pod54fb195d_5a3c_11e9_a111_fa163ed3be9e.slice
drwxr-xr-x. 6 root root 0 Apr 8 20:31 kubepods-besteffort-pod59be8341_5a3d_11e9_94af_fa163e8d1880.slice
drwxr-xr-x. 6 root root 0 Apr 8 20:42 kubepods-besteffort-podc626e673_5a3e_11e9_97d0_fa163eeb4ce2.slice
drwxr-xr-x. 6 root root 0 Apr 8 20:21 kubepods-besteffort-podd7d9101c_5a3b_11e9_a111_fa163ed3be9e.slice
expected kubepods-besteffort-pod959cca01-5ba1-11e9-8a21-fa163e8d1880.slice to exist
I found out that this is an issue local to the worker. This did prevent the worker from starting any new pods. Restarting the kubelet cleared the issue. It seem like when the UseSystemd() is called, is it setting hasDelegateSlice true. The code to set it to false depends on two particular errors being return by systemd over dbus. It is possible that call is returning an error other than the two we expect. Removing blocker for now since this is the only time I've ever seen this, but would like to root cause this. Ryan, can you work with runc upstream to try to invert the detection logic in UseSystemd() to assume false and then set true if the unit/slice/scope creation is successful. Seems like that is the more resilient thing to do. Looking into it... Might have to do with the backwards compatibility. This discussion appears to have the answer: https://bugzilla.redhat.com/show_bug.cgi?id=1558425#c24 The vendored runc may need a bump within openshift. I think our code matches upstream except for https://github.com/opencontainers/runc/pull/1978 bz1158425 tracked when we didn't have detection thus, on newer systemd's where Delegate had been removed, we were having issues. PR: https://github.com/openshift/origin/pull/22532 I don't have a great way to test this one. Closed that PR. the opencontainers/runc glide dependency needs the patch and bump... Yes, but we also need a new patch upstream to invert the logic the tests hasDelegateSlice such that it assumes false and only sets to true if it can create a slice with the Delegate property. So we need to get that patch upstream, the bump in k8s and origin. Just recreated this in a 4.2 cluster that was crashed and then recovered (standard DR recovery test). This completely broke recovery (because the etcd-signer pod won't schedule) and would have failed in the test. A number of other pods on that node are in the same state (one of the 2 masters that were recreated) Cluster is broken until user takes action, setting it to urgent. We should be able to take out these checks - see https://github.com/opencontainers/runc/pull/2117/commits/518c855833c4920f9901f47f2a520425a33cceb4. Will keep an eye on it and look for occurrences. It happens after reboots the most, so upgrade tests would be likely to trigger it. I'm going to add some checks to upgrade to verify that everything is green afterwards. The origin PR for bumping up runc/libcontainer changes (https://github.com/openshift/origin/pull/23860) isn't merged yet. Moving back to MODIFIED. More successful upgrade tests to 4.2.0-0.nightly-2019-10-04-015220 No events with "Delegation not available for unit type" seen. I think this is the best we can apart from watching for it in CI. Moving to VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |