Description of problem: After an upgrade to the free-int registry, the docker-registry is in a crash loop backoff. Details added in attachments. Version-Release number of selected component (if applicable): oc v3.7.0-0.191.0 How reproducible: The condition has persisted on the cluster through an upgrade.
Created attachment 1347550 [details] Describe and yaml for pods
Additional information: https://bugzilla.redhat.com/show_bug.cgi?id=1484679#c6
So this is wicked one. After looking in the kernel code on how a write to cpu.cfs_quota_us can even return -EINVAL, I discovered that the set path does an upward walk of the cgroup tree to make sure the the quota is not greater than the quota at any higher level of the tree. Looking at the node: $ oc get node ip-172-31-49-31.ec2.internal -o yaml ... allocatable: alpha.kubernetes.io/nvidia-gpu: "0" cpu: "3" <---- memory: 13776096Ki pods: "80" capacity: alpha.kubernetes.io/nvidia-gpu: "0" cpu: "4" memory: 16266464Ki pods: "80" ... Looking at the docker-registry pod: Limits: cpu: 3725m <---- memory: 2G Requests: cpu: 186m memory: 1G Note the pod limit being above node allocatable. I think that some higher level (pod-level cgroup?) is forcing the quota to be <= node allocatable, causing the set at the container-level to fail. Justin, could you reduce the cpu limit on the docker-registry pods to 2 and see if that doesn't fix it?
I don't think this has anything to do with node allocatable getting enforced via cfs quota at higher levels. Unfortunately the pods are no longer in this state. In an attempt to check another part of the theory, they were restarted and are running fine now. I think the cause is that the container-level cgroup is setting the quota value before the pod-level cgroup somehow. Looking at the code now. That being said, this seems to be easy to recover from when it occurs. Just deleting the pods to force sandbox recreation seems to do it.
I have tried to reproduce locally to no avail. There must be some snowflake about the free-int config/env that is hitting this. However, without being able to examine the cgroups of a sandbox in this state, I really don't have a path forward. Closing for now, but if this is encountered on the next free-int upgrade, please reopen so I can examine the cgroups.
Buckle up! This is a fun one! tl;dr this is an issue in systemd, that we'll likely have to hack around in runc. $ oc get pod oso-rhel7-zagg-web-6-8wjcd -o yaml ... resources: limits: cpu: 2953m memory: 1512Mi requests: cpu: 147m memory: 756Mi ... containerStatuses: - containerID: docker://b686057013a37c4764b7ecac27e9434164946af5e551075ffb6b7a898f475621 image: registry.ops.openshift.com/ops/oso-rhel7-zagg-web:stg imageID: docker-pullable://registry.ops.openshift.com/ops/oso-rhel7-zagg-web@sha256:f771cc8e879f2a116ed40998bb46073d93a752e6844c289670b642dc93f58045 lastState: terminated: containerID: docker://b686057013a37c4764b7ecac27e9434164946af5e551075ffb6b7a898f475621 exitCode: 128 finishedAt: 2017-11-15T19:14:45Z message: 'invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:327: setting cgroup config for procHooks process caused \\\"failed to write 295300 to cpu.cfs_quota_us: write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddb841b6d_ca29_11e7_9644_02306c0cdc4b.slice/docker-b686057013a37c4764b7ecac27e9434164946af5e551075ffb6b7a898f475621.scope/cpu.cfs_quota_us: invalid argument\\\"\"\n"' reason: ContainerCannotRun startedAt: 2017-11-15T19:14:45Z ... Thanks to Derek, possessor of root on free-stg for helping me run this down. The key is the resources.limits.cpu set to 2953m. Namely the fact that number is >1000 and doesn't end in 0. runc uses a systemd dbus property called CPUQuotaPerSecUSec to tell systemd how to set the underlying cpu.cfs_quota_us. https://github.com/systemd/systemd/issues/5939#issuecomment-300841218 However, the systemd unit spec does not have this property. The unit spec uses CPUQuota, which is a _integer_ percentage of CPU. systemd internal converts CPUQuotaPerSecUSec to CPUQuota. runc is setting the CPUQuotaPerSecUSec to 295300. systemd internally converts this to CPUQuota=295%. $ systemctl cat kubepods-burstable-poddb841b6d_ca29_11e7_9644_02306c0cdc4b.slice # /run/systemd/system/kubepods-burstable-poddb841b6d_ca29_11e7_9644_02306c0cdc4b.slice ... [Slice] CPUQuota=295% Note the truncation of the "3". systemd proceeds to set the cpu.cfs_quota_us on the pod level cgroup to 295000. When docker creates the container level cgroup, it manipulates the cgroup directly, not via systemd. It tries to write 295300 into cpu.cfs_quota_us and gets "invalid argument" because the quota is greater than the parent (i.e. pod level) cgroup. A workaround for this in the short term is to either use whole CPU limits or set a limit in millicores that ends in 0. The fix will probably be to hack around the logic in systemd using runc by rounding 295300 to 296000 (up to the nearest 1000), knowing that systemd is going to convert it to an integer percentage of CPU.
runc PR: https://github.com/opencontainers/runc/pull/1651
Origin master: https://github.com/openshift/origin/pull/17348 Origin 3.7: https://github.com/openshift/origin/pull/17349 OCP 3.7: https://github.com/openshift/ose/pull/927
I try to reproduce as comment 6 but can't get the can't running error, could you help check what's wrong with my step? thanks [root@ip-172-18-0-155 tmp]# oc version oc v3.7.7 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-0-155.ec2.internal:8443 openshift v3.7.7 kubernetes v1.7.6+a08f5eeb62 [root@ip-172-18-0-155 tmp]# cat pod.yaml apiVersion: v1 kind: Pod metadata: labels: name: hello-pod name: hello-pod spec: containers: - image: "docker.io/deshuai/hello-pod:latest" imagePullPolicy: IfNotPresent name: hello-pod ports: - containerPort: 8080 protocol: TCP resources: limits: cpu: 2953m memory: 1512Mi requests: cpu: 147m memory: 756Mi securityContext: capabilities: {} privileged: false terminationMessagePath: /dev/termination-log volumeMounts: - mountPath: /tmp name: tmp dnsPolicy: ClusterFirst restartPolicy: Always serviceAccount: "" volumes: - emptyDir: {} name: tmp [root@ip-172-18-0-155 tmp]# oc rsh -n test hello-pod sh-4.2# cat /sys/fs/cgroup/cpuacct,cpu/cpu.cfs_quota_us 295300 //no node [root@ip-172-18-2-116 ~]# cat /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod309b7de5_cf69_11e7_ac2f_0edc8525bec4.slice/docker-5dd8abc95c12c31eada976843fd3387ea073ccf603fedc3c69c856e4b857b71c.scope/cpu.cfs_quota_us 295300 [root@ip-172-18-2-116 ~]# systemctl cat kubepods-burstable-pod309b7de5_cf69_11e7_ac2f_0edc8525bec4.slice | grep CPUQuota # /run/systemd/system/kubepods-burstable-pod309b7de5_cf69_11e7_ac2f_0edc8525bec4.slice.d/50-CPUQuota.conf CPUQuota=295%
I am also unable to recreate locally on openshift v3.7.6. There must be a configuration or race that opens the window for this. Trying to figure out what it is.
*** Bug 1521178 has been marked as a duplicate of this bug. ***
free-int has been upgraded to v3.8.18 (online version 3.6.0.83) and registry's running well: # oc get pod -n default NAME READY STATUS RESTARTS AGE docker-registry-36-65m8l 1/1 Running 1 2d docker-registry-36-vmsmz 1/1 Running 1 2d # oc get dc/docker-registry -n default NAME REVISION DESIRED CURRENT TRIGGERED BY docker-registry 36 2 2 config
Based on comment 4 and comment 5, this issue can't be reproduced on OCP. So move this bug to verified. Will trigger this issue in bug 1532530 ONLY for Online.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0113