1567229 – [f28] kubelet.service fails to start: Failed to create "/kubepods" cgroup

Bug 1567229 - [f28] kubelet.service fails to start: Failed to create "/kubepods" cgroup

Summary: [f28] kubelet.service fails to start: Failed to create "/kubepods" cgroup

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	runc
Sub Component:
Version:	28
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Jan Chaloupka
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	RejectedFreezeException
Depends On:	1558425
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-13 15:33 UTC by Lokesh Mandvekar
Modified:	2018-04-27 04:02 UTC (History)
CC List:	33 users (show)
Fixed In Version:	runc-1.0.0-22.gitf753f30.fc28
Clone Of:	1558425
Environment:
Last Closed:	2018-04-27 04:02:42 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Lokesh Mandvekar 2018-04-13 15:33:17 UTC

+++ This bug was initially created as a clone of Bug #1558425 +++

Description of problem: In current Fedora 28, kubelet.service fails to start:

Started Kubernetes Kubelet Server.
I0320 04:25:35.639143    3613 server.go:182] Version: v1.9.3
I0320 04:25:35.639739    3613 feature_gate.go:226] feature gates: &{map[]}
W0320 04:25:35.656218    3613 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
I0320 04:25:35.663914    3613 plugins.go:101] No cloud provider specified.
I0320 04:25:35.695457    3613 server.go:428] --cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /
I0320 04:25:35.696133    3613 container_manager_linux.go:242] container manager verified user specified cgroup-root exists: /
I0320 04:25:35.696228    3613 container_manager_linux.go:247] Creating Container Manager object based on Node Config: {RuntimeCgroupsNa>
I0320 04:25:35.696425    3613 container_manager_linux.go:266] Creating device plugin manager: false
I0320 04:25:35.696563    3613 kubelet.go:313] Watching apiserver
W0320 04:25:35.708014    3613 kubelet_network.go:139] Hairpin mode set to "promiscuous-bridge" but kubenet is not enabled, falling back>
I0320 04:25:35.709303    3613 kubelet.go:571] Hairpin mode set to "hairpin-veth"
I0320 04:25:35.711523    3613 client.go:80] Connecting to docker on unix:///var/run/docker.sock
I0320 04:25:35.711655    3613 client.go:109] Start docker client with request timeout=2m0s
W0320 04:25:35.716276    3613 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
I0320 04:25:35.727211    3613 docker_service.go:232] Docker cri networking managed by kubernetes.io/no-op
I0320 04:25:35.739755    3613 docker_service.go:237] Docker Info: &{ID:OX6T:X64L:HMXL:4B7X:NMCA:T6M3:AXIS:FWIV:WKIS:UGF5:BA7L:QSZQ Cont>
I0320 04:25:35.740025    3613 docker_service.go:250] Setting cgroupDriver to systemd
I0320 04:25:35.785358    3613 remote_runtime.go:43] Connecting to runtime service unix:///var/run/dockershim.sock
I0320 04:25:35.810820    3613 kuberuntime_manager.go:186] Container runtime docker initialized, version: 1.13.1, apiVersion: 1.26.0
I0320 04:25:35.834825    3613 server.go:755] Started kubelet
E0320 04:25:35.837863    3613 kubelet.go:1275] Image garbage collection failed once. Stats initialization may not have completed yet: f>
I0320 04:25:35.838939    3613 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
I0320 04:25:35.841976    3613 server.go:129] Starting to listen on 127.0.0.1:10250
I0320 04:25:35.844064    3613 server.go:299] Adding debug handlers to kubelet server.
E0320 04:25:35.887124    3613 node_container_manager.go:51] Failed to create "/kubepods" cgroup
F0320 04:25:35.887275    3613 kubelet.go:1364] Failed to start ContainerManager Delegation not available for unit type
kubelet.service: Main process exited, code=exited, status=255/n/a



Version-Release number of selected component (if applicable):

kubernetes-node-1.9.3-1.fc28.x86_64

How reproducible: Always


Steps to Reproduce:
1. Install kubernetes on current Fedora 28: dnf install kubernetes
2. Set up Kubernetes; in the Cockpit test VMs we use this script:  https://github.com/cockpit-project/cockpit/blob/master/bots/images/scripts/lib/kubernetes.setup
3. systemctl start kubelet.service

--- Additional comment from Jeffrey C. Ollie on 2018-03-28 23:19:47 EDT ---

This seems to be related:

https://github.com/kubernetes/kubernetes/issues/61474

--- Additional comment from Jason Montleon on 2018-04-02 16:00:34 EDT ---

I commented on the upstream issue:

 It looks like ControllerManager is a slice and slices can no longer Delegate.

Here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cgroup_manager_linux.go#L43-L50

I added .can_delegate = true, at https://github.com/systemd/systemd/blob/master/src/core/slice.c#L376 and rebuilt/reinstalled systemd 238 packages and after that I was able run oc cluster up successfully.

It's possible this is due to this commit in systemd, although the comment on it causes me to have doubts: https://github.com/systemd/systemd/commit/1d9cc8768f173b25757c01aa0d4c7be7cd7116bc

--- Additional comment from Micah Abbott on 2018-04-11 10:41:08 EDT ---

This also appears to break the ability to do 'oc cluster up' on Fedora 28.

One can workaround it by changing the 'cgroupdriver' that 'docker' uses (hat tip to Jason Brooks):

# cp /usr/lib/systemd/system/docker.service /etc/systemd/system/
# sed -i 's/cgroupdriver=systemd/cgroupdriver=cgroupfs/' /etc/systemd/system/docker.service
# systemctl daemon-reload
# systemctl restart docker

--- Additional comment from Fedora Blocker Bugs Application on 2018-04-11 10:42:53 EDT ---

Proposed as a Freeze Exception for 28-final by Fedora user miabbott using the blocker tracking app because:

 This bug is blocking the ability for users to run Kubernetes on Fedora 28.  This affects users that are spinning up a Kubernetes cluster manually, using the 'openshift-ansible' playbook to spin up an OpenShift cluster, or using the 'oc cluster up' method for launching an OpenShift cluster.

--- Additional comment from Dusty Mabe on 2018-04-11 10:49:29 EDT ---

should this bug be moved to the runc component?

--- Additional comment from Tomasz Torcz on 2018-04-11 10:51:07 EDT ---

Dusty seems so, as the required patch (https://github.com/opencontainers/runc/pull/1776) is against runc.

--- Additional comment from Dusty Mabe on 2018-04-11 15:06:42 EDT ---

so it turns out that we need to fix this in *both* runc *and* docker because docker has its own vendored version of runc as well. So updating runc by itself won't fix it for most people since most people are still using docker. We'll need them both updated.

I'm going to change the component to docker, but we need runc as well I think.

--- Additional comment from Filipe Brandenburger on 2018-04-11 16:44:04 EDT ---

This gets fixed in libcontainer by this PR:
https://github.com/opencontainers/runc/pull/1776

I'll import that into Kubernetes vendored libcontainer once it's merged into runc.

Cheers!
Filipe

--- Additional comment from Lorenzo Dalrio on 2018-04-12 08:31:56 EDT ---

(In reply to Micah Abbott from comment #3)
> This also appears to break the ability to do 'oc cluster up' on Fedora 28.
> 
> One can workaround it by changing the 'cgroupdriver' that 'docker' uses (hat
> tip to Jason Brooks):
> 
> # cp /usr/lib/systemd/system/docker.service /etc/systemd/system/
> # sed -i 's/cgroupdriver=systemd/cgroupdriver=cgroupfs/'
> /etc/systemd/system/docker.service
> # systemctl daemon-reload
> # systemctl restart docker


This can be done overriding docker unit like this:

# systemctl edit docker.service

On the editor just paste this:

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd-current \
          --add-runtime oci=/usr/libexec/docker/docker-runc-current \
          --default-runtime=oci \
          --authorization-plugin=rhel-push-plugin \
          --containerd /run/containerd.sock \
          --exec-opt native.cgroupdriver=cgroupfs \
          --userland-proxy-path=/usr/libexec/docker/docker-proxy-current \
          --init-path=/usr/libexec/docker/docker-init-current \
          --seccomp-profile=/etc/docker/seccomp.json \
          $OPTIONS \
          $DOCKER_STORAGE_OPTIONS \
          $DOCKER_NETWORK_OPTIONS \
          $ADD_REGISTRY \
          $BLOCK_REGISTRY \
          $INSECURE_REGISTRY \
          $REGISTRIES

To rollback just delete this file

/etc/systemd/system/docker.service.d/override.conf

--- Additional comment from Antonio Murdaca on 2018-04-12 09:08:19 EDT ---

to fix this, other than the runc/docker patch, we also need a kube fix (see https://github.com/opencontainers/runc/pull/1776#issuecomment-380571191)

so it's not gonna be fixed by just runc/docker back ports

--- Additional comment from Filipe Brandenburger on 2018-04-12 11:24:45 EDT ---

(In reply to Micah Abbott from comment #3)
> This also appears to break the ability to do 'oc cluster up' on Fedora 28.
> 
> One can workaround it by changing the 'cgroupdriver' that 'docker' uses (hat
> tip to Jason Brooks):
> 
> # cp /usr/lib/systemd/system/docker.service /etc/systemd/system/
> # sed -i 's/cgroupdriver=systemd/cgroupdriver=cgroupfs/'
> /etc/systemd/system/docker.service
> # systemctl daemon-reload
> # systemctl restart docker

In my testing (RHEL 7), changing the cgroup-driver of Docker broke "oci-register-machine", so it also required changing /etc/oci-register-machine.conf to set "disable : true". I'd personally avoid changing the cgroup driver, though.

(In reply to Antonio Murdaca from comment #10)
> to fix this, other than the runc/docker patch, we also need a kube fix (see
> https://github.com/opencontainers/runc/pull/1776#issuecomment-380571191)
> 
> so it's not gonna be fixed by just runc/docker back ports

Yes. opencontainers/runc#1776 has just been merged... So I'm gonna push a sync to that into Kubernetes code base. Planning to update master and release-1.10, but I could go back to 1.9 if you need it (and looks like you do, so I'll do that too...)

Cheers,
Filipe

--- Additional comment from Filipe Brandenburger on 2018-04-12 18:32:46 EDT ---

https://github.com/kubernetes/kubernetes/pull/61926 updated to include all the relevant PRs into Kubernetes vendored libcontainer.

Once that one is merged, I'll prepare cherry-picks into 1.10 and 1.9 branches.

Cheers,
Filipe

--- Additional comment from Fedora Update System on 2018-04-13 03:03:53 EDT ---

docker-1.13.1-52.git89b0e65.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-3c62c7e959

Comment 1 Fedora Update System 2018-04-13 19:53:41 UTC

runc-1.0.0-22.gitf753f30.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-16dae9acf2

Comment 2 Fedora Update System 2018-04-15 02:26:25 UTC

runc-1.0.0-22.gitf753f30.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-16dae9acf2

Comment 3 František Zatloukal 2018-04-17 10:15:21 UTC

Discussed during the 2018-04-16 blocker review meeting: [1]

The decision to punt (delay decision) was made:

"Following 1558425, we don't want to delay *too* long on this, but it's a fairly complex area and it doesn't feel like folks have all the consequences of this entirely worked out yet, so we would like to wait a few days to see if a clearer pictures emerges and then perhaps vote async (in bugzilla comments) on this one"

[1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2018-04-16/f28-blocker-review.2018-04-16-16.00.log.txt

Comment 4 Filipe Brandenburger 2018-04-17 16:09:42 UTC

I just wanted to point out that the latest bugfix for this issue is here:

https://github.com/opencontainers/runc/pull/1781

I need some code reviews for that and I think some people in this thread are well positioned for that.

Also, here is a summary of the current status (and how we got here):

https://github.com/opencontainers/runc/issues/1780

Once opencontainers/runc#1781 is in, I can update kubernetes/kubernetes#61926 to merge that into Kubernetes codebase and backport it to 1.10 and 1.9, which should fix the issue for the Fedora package when brought into that codebase too.

Cheers,
Filipe

Comment 5 Patrick Uiterwijk 2018-04-17 22:21:22 UTC

Agreed with Adam, -1 FE in favor of the systemd revert.

Comment 6 Mohan Boddu 2018-04-17 22:37:05 UTC

-1 FE

https://bugzilla.redhat.com/show_bug.cgi?id=1558425#c18

Comment 7 Adam Williamson 2018-04-17 22:51:58 UTC

For the record, Patrick agrees with https://bugzilla.redhat.com/show_bug.cgi?id=1558425#c16 :) As written there, I'm -1 in favour of https://bugzilla.redhat.com/show_bug.cgi?id=1568594 instead. So that's -3, setting rejected.

Comment 8 Fedora Update System 2018-04-27 04:02:42 UTC

runc-1.0.0-22.gitf753f30.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.

adimania
admiller
amurdaca
awilliam
bbaude
dustymabe
dwalsh
eparis
extras-qa
filbranden
fkluknav
fzatlouk
ichavero
jbrooks
jcajka
jchaloup
jdanek
jeff
jmontleo
joe
lorenzo.dalrio
lsm5
marianne
mboddu
miabbott
mpitt
nalin
nhorman
santiago
TicoTimo
tomek
tstclair
vbatts