+++ This bug was initially created as a clone of Bug #1558425 +++ Description of problem: In current Fedora 28, kubelet.service fails to start: Started Kubernetes Kubelet Server. I0320 04:25:35.639143 3613 server.go:182] Version: v1.9.3 I0320 04:25:35.639739 3613 feature_gate.go:226] feature gates: &{map[]} W0320 04:25:35.656218 3613 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d I0320 04:25:35.663914 3613 plugins.go:101] No cloud provider specified. I0320 04:25:35.695457 3613 server.go:428] --cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to / I0320 04:25:35.696133 3613 container_manager_linux.go:242] container manager verified user specified cgroup-root exists: / I0320 04:25:35.696228 3613 container_manager_linux.go:247] Creating Container Manager object based on Node Config: {RuntimeCgroupsNa> I0320 04:25:35.696425 3613 container_manager_linux.go:266] Creating device plugin manager: false I0320 04:25:35.696563 3613 kubelet.go:313] Watching apiserver W0320 04:25:35.708014 3613 kubelet_network.go:139] Hairpin mode set to "promiscuous-bridge" but kubenet is not enabled, falling back> I0320 04:25:35.709303 3613 kubelet.go:571] Hairpin mode set to "hairpin-veth" I0320 04:25:35.711523 3613 client.go:80] Connecting to docker on unix:///var/run/docker.sock I0320 04:25:35.711655 3613 client.go:109] Start docker client with request timeout=2m0s W0320 04:25:35.716276 3613 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d I0320 04:25:35.727211 3613 docker_service.go:232] Docker cri networking managed by kubernetes.io/no-op I0320 04:25:35.739755 3613 docker_service.go:237] Docker Info: &{ID:OX6T:X64L:HMXL:4B7X:NMCA:T6M3:AXIS:FWIV:WKIS:UGF5:BA7L:QSZQ Cont> I0320 04:25:35.740025 3613 docker_service.go:250] Setting cgroupDriver to systemd I0320 04:25:35.785358 3613 remote_runtime.go:43] Connecting to runtime service unix:///var/run/dockershim.sock I0320 04:25:35.810820 3613 kuberuntime_manager.go:186] Container runtime docker initialized, version: 1.13.1, apiVersion: 1.26.0 I0320 04:25:35.834825 3613 server.go:755] Started kubelet E0320 04:25:35.837863 3613 kubelet.go:1275] Image garbage collection failed once. Stats initialization may not have completed yet: f> I0320 04:25:35.838939 3613 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach I0320 04:25:35.841976 3613 server.go:129] Starting to listen on 127.0.0.1:10250 I0320 04:25:35.844064 3613 server.go:299] Adding debug handlers to kubelet server. E0320 04:25:35.887124 3613 node_container_manager.go:51] Failed to create "/kubepods" cgroup F0320 04:25:35.887275 3613 kubelet.go:1364] Failed to start ContainerManager Delegation not available for unit type kubelet.service: Main process exited, code=exited, status=255/n/a Version-Release number of selected component (if applicable): kubernetes-node-1.9.3-1.fc28.x86_64 How reproducible: Always Steps to Reproduce: 1. Install kubernetes on current Fedora 28: dnf install kubernetes 2. Set up Kubernetes; in the Cockpit test VMs we use this script: https://github.com/cockpit-project/cockpit/blob/master/bots/images/scripts/lib/kubernetes.setup 3. systemctl start kubelet.service --- Additional comment from Jeffrey C. Ollie on 2018-03-28 23:19:47 EDT --- This seems to be related: https://github.com/kubernetes/kubernetes/issues/61474 --- Additional comment from Jason Montleon on 2018-04-02 16:00:34 EDT --- I commented on the upstream issue: It looks like ControllerManager is a slice and slices can no longer Delegate. Here: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cgroup_manager_linux.go#L43-L50 I added .can_delegate = true, at https://github.com/systemd/systemd/blob/master/src/core/slice.c#L376 and rebuilt/reinstalled systemd 238 packages and after that I was able run oc cluster up successfully. It's possible this is due to this commit in systemd, although the comment on it causes me to have doubts: https://github.com/systemd/systemd/commit/1d9cc8768f173b25757c01aa0d4c7be7cd7116bc --- Additional comment from Micah Abbott on 2018-04-11 10:41:08 EDT --- This also appears to break the ability to do 'oc cluster up' on Fedora 28. One can workaround it by changing the 'cgroupdriver' that 'docker' uses (hat tip to Jason Brooks): # cp /usr/lib/systemd/system/docker.service /etc/systemd/system/ # sed -i 's/cgroupdriver=systemd/cgroupdriver=cgroupfs/' /etc/systemd/system/docker.service # systemctl daemon-reload # systemctl restart docker --- Additional comment from Fedora Blocker Bugs Application on 2018-04-11 10:42:53 EDT --- Proposed as a Freeze Exception for 28-final by Fedora user miabbott using the blocker tracking app because: This bug is blocking the ability for users to run Kubernetes on Fedora 28. This affects users that are spinning up a Kubernetes cluster manually, using the 'openshift-ansible' playbook to spin up an OpenShift cluster, or using the 'oc cluster up' method for launching an OpenShift cluster. --- Additional comment from Dusty Mabe on 2018-04-11 10:49:29 EDT --- should this bug be moved to the runc component? --- Additional comment from Tomasz Torcz on 2018-04-11 10:51:07 EDT --- Dusty seems so, as the required patch (https://github.com/opencontainers/runc/pull/1776) is against runc. --- Additional comment from Dusty Mabe on 2018-04-11 15:06:42 EDT --- so it turns out that we need to fix this in *both* runc *and* docker because docker has its own vendored version of runc as well. So updating runc by itself won't fix it for most people since most people are still using docker. We'll need them both updated. I'm going to change the component to docker, but we need runc as well I think. --- Additional comment from Filipe Brandenburger on 2018-04-11 16:44:04 EDT --- This gets fixed in libcontainer by this PR: https://github.com/opencontainers/runc/pull/1776 I'll import that into Kubernetes vendored libcontainer once it's merged into runc. Cheers! Filipe --- Additional comment from Lorenzo Dalrio on 2018-04-12 08:31:56 EDT --- (In reply to Micah Abbott from comment #3) > This also appears to break the ability to do 'oc cluster up' on Fedora 28. > > One can workaround it by changing the 'cgroupdriver' that 'docker' uses (hat > tip to Jason Brooks): > > # cp /usr/lib/systemd/system/docker.service /etc/systemd/system/ > # sed -i 's/cgroupdriver=systemd/cgroupdriver=cgroupfs/' > /etc/systemd/system/docker.service > # systemctl daemon-reload > # systemctl restart docker This can be done overriding docker unit like this: # systemctl edit docker.service On the editor just paste this: [Service] ExecStart= ExecStart=/usr/bin/dockerd-current \ --add-runtime oci=/usr/libexec/docker/docker-runc-current \ --default-runtime=oci \ --authorization-plugin=rhel-push-plugin \ --containerd /run/containerd.sock \ --exec-opt native.cgroupdriver=cgroupfs \ --userland-proxy-path=/usr/libexec/docker/docker-proxy-current \ --init-path=/usr/libexec/docker/docker-init-current \ --seccomp-profile=/etc/docker/seccomp.json \ $OPTIONS \ $DOCKER_STORAGE_OPTIONS \ $DOCKER_NETWORK_OPTIONS \ $ADD_REGISTRY \ $BLOCK_REGISTRY \ $INSECURE_REGISTRY \ $REGISTRIES To rollback just delete this file /etc/systemd/system/docker.service.d/override.conf --- Additional comment from Antonio Murdaca on 2018-04-12 09:08:19 EDT --- to fix this, other than the runc/docker patch, we also need a kube fix (see https://github.com/opencontainers/runc/pull/1776#issuecomment-380571191) so it's not gonna be fixed by just runc/docker back ports --- Additional comment from Filipe Brandenburger on 2018-04-12 11:24:45 EDT --- (In reply to Micah Abbott from comment #3) > This also appears to break the ability to do 'oc cluster up' on Fedora 28. > > One can workaround it by changing the 'cgroupdriver' that 'docker' uses (hat > tip to Jason Brooks): > > # cp /usr/lib/systemd/system/docker.service /etc/systemd/system/ > # sed -i 's/cgroupdriver=systemd/cgroupdriver=cgroupfs/' > /etc/systemd/system/docker.service > # systemctl daemon-reload > # systemctl restart docker In my testing (RHEL 7), changing the cgroup-driver of Docker broke "oci-register-machine", so it also required changing /etc/oci-register-machine.conf to set "disable : true". I'd personally avoid changing the cgroup driver, though. (In reply to Antonio Murdaca from comment #10) > to fix this, other than the runc/docker patch, we also need a kube fix (see > https://github.com/opencontainers/runc/pull/1776#issuecomment-380571191) > > so it's not gonna be fixed by just runc/docker back ports Yes. opencontainers/runc#1776 has just been merged... So I'm gonna push a sync to that into Kubernetes code base. Planning to update master and release-1.10, but I could go back to 1.9 if you need it (and looks like you do, so I'll do that too...) Cheers, Filipe --- Additional comment from Filipe Brandenburger on 2018-04-12 18:32:46 EDT --- https://github.com/kubernetes/kubernetes/pull/61926 updated to include all the relevant PRs into Kubernetes vendored libcontainer. Once that one is merged, I'll prepare cherry-picks into 1.10 and 1.9 branches. Cheers, Filipe --- Additional comment from Fedora Update System on 2018-04-13 03:03:53 EDT --- docker-1.13.1-52.git89b0e65.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-3c62c7e959
runc-1.0.0-22.gitf753f30.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-16dae9acf2
runc-1.0.0-22.gitf753f30.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-16dae9acf2
Discussed during the 2018-04-16 blocker review meeting: [1] The decision to punt (delay decision) was made: "Following 1558425, we don't want to delay *too* long on this, but it's a fairly complex area and it doesn't feel like folks have all the consequences of this entirely worked out yet, so we would like to wait a few days to see if a clearer pictures emerges and then perhaps vote async (in bugzilla comments) on this one" [1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2018-04-16/f28-blocker-review.2018-04-16-16.00.log.txt
I just wanted to point out that the latest bugfix for this issue is here: https://github.com/opencontainers/runc/pull/1781 I need some code reviews for that and I think some people in this thread are well positioned for that. Also, here is a summary of the current status (and how we got here): https://github.com/opencontainers/runc/issues/1780 Once opencontainers/runc#1781 is in, I can update kubernetes/kubernetes#61926 to merge that into Kubernetes codebase and backport it to 1.10 and 1.9, which should fix the issue for the Fedora package when brought into that codebase too. Cheers, Filipe
Agreed with Adam, -1 FE in favor of the systemd revert.
-1 FE https://bugzilla.redhat.com/show_bug.cgi?id=1558425#c18
For the record, Patrick agrees with https://bugzilla.redhat.com/show_bug.cgi?id=1558425#c16 :) As written there, I'm -1 in favour of https://bugzilla.redhat.com/show_bug.cgi?id=1568594 instead. So that's -3, setting rejected.
runc-1.0.0-22.gitf753f30.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.