Bug 1509467 - [free-int] docker-registry in crash loop backoff due to RunContainerError
Summary: [free-int] docker-registry in crash loop backoff due to RunContainerError
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.7.z
Assignee: Seth Jennings
QA Contact: Bing Li
: 1521178 (view as bug list)
Depends On:
Blocks: 1532530
TreeView+ depends on / blocked
Reported: 2017-11-03 20:52 UTC by Justin Pierce
Modified: 2018-08-06 12:54 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Fixes an interaction between runc and systemd where sandbox creation fails when pod specifies CPU limit in 1000s of millicores and the last digit is not 0.
Clone Of:
: 1532530 (view as bug list)
Last Closed: 2018-01-23 17:58:09 UTC
Target Upstream Version:
jupierce: needinfo-

Attachments (Terms of Use)
Describe and yaml for pods (24.27 KB, text/plain)
2017-11-03 20:53 UTC, Justin Pierce
no flags Details

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0113 0 normal SHIPPED_LIVE OpenShift Container Platform 3.7 and 3.6 bug fix and enhancement update 2018-01-23 22:55:59 UTC

Description Justin Pierce 2017-11-03 20:52:37 UTC
Description of problem:
After an upgrade to the free-int registry, the docker-registry is in a crash loop backoff. Details added in attachments. 

Version-Release number of selected component (if applicable):
oc v3.7.0-0.191.0

How reproducible:
The condition has persisted on the cluster through an upgrade.

Comment 1 Justin Pierce 2017-11-03 20:53:18 UTC
Created attachment 1347550 [details]
Describe and yaml for pods

Comment 2 Justin Pierce 2017-11-03 20:53:58 UTC
Additional information:

Comment 3 Seth Jennings 2017-11-03 21:57:54 UTC
So this is wicked one.

After looking in the kernel code on how a write to cpu.cfs_quota_us can even return -EINVAL, I discovered that the set path does an upward walk of the cgroup tree to make sure the the quota is not greater than the quota at any higher level of the tree.

Looking at the node:

$ oc get node ip-172-31-49-31.ec2.internal -o yaml
    alpha.kubernetes.io/nvidia-gpu: "0"
    cpu: "3"  <----
    memory: 13776096Ki
    pods: "80"
    alpha.kubernetes.io/nvidia-gpu: "0"
    cpu: "4"
    memory: 16266464Ki
    pods: "80"

Looking at the docker-registry pod:

      cpu:	3725m <----
      memory:	2G
      cpu:	186m
      memory:	1G

Note the pod limit being above node allocatable.

I think that some higher level (pod-level cgroup?) is forcing the quota to be <= node allocatable, causing the set at the container-level to fail.

Justin, could you reduce the cpu limit on the docker-registry pods to 2 and see if that doesn't fix it?

Comment 4 Seth Jennings 2017-11-06 16:56:33 UTC
I don't think this has anything to do with node allocatable getting enforced via cfs quota at higher levels.

Unfortunately the pods are no longer in this state.  In an attempt to check another part of the theory, they were restarted and are running fine now.

I think the cause is that the container-level cgroup is setting the quota value before the pod-level cgroup somehow.  Looking at the code now.

That being said, this seems to be easy to recover from when it occurs.  Just deleting the pods to force sandbox recreation seems to do it.

Comment 5 Seth Jennings 2017-11-06 20:20:10 UTC
I have tried to reproduce locally to no avail.  There must be some snowflake about the free-int config/env that is hitting this.  However, without being able to examine the cgroups of a sandbox in this state, I really don't have a path forward.

Closing for now, but if this is encountered on the next free-int upgrade, please reopen so I can examine the cgroups.

Comment 6 Seth Jennings 2017-11-15 20:07:35 UTC
Buckle up! This is a fun one!

tl;dr this is an issue in systemd, that we'll likely have to hack around in runc.

$ oc get pod oso-rhel7-zagg-web-6-8wjcd -o yaml
        cpu: 2953m
        memory: 1512Mi
        cpu: 147m
        memory: 756Mi
  - containerID: docker://b686057013a37c4764b7ecac27e9434164946af5e551075ffb6b7a898f475621
    image: registry.ops.openshift.com/ops/oso-rhel7-zagg-web:stg
    imageID: docker-pullable://registry.ops.openshift.com/ops/oso-rhel7-zagg-web@sha256:f771cc8e879f2a116ed40998bb46073d93a752e6844c289670b642dc93f58045
        containerID: docker://b686057013a37c4764b7ecac27e9434164946af5e551075ffb6b7a898f475621
        exitCode: 128
        finishedAt: 2017-11-15T19:14:45Z
        message: 'invalid header field value "oci runtime error: container_linux.go:247:
          starting container process caused \"process_linux.go:327: setting cgroup
          config for procHooks process caused \\\"failed to write 295300 to cpu.cfs_quota_us:
          write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddb841b6d_ca29_11e7_9644_02306c0cdc4b.slice/docker-b686057013a37c4764b7ecac27e9434164946af5e551075ffb6b7a898f475621.scope/cpu.cfs_quota_us:
          invalid argument\\\"\"\n"'
        reason: ContainerCannotRun
        startedAt: 2017-11-15T19:14:45Z

Thanks to Derek, possessor of root on free-stg for helping me run this down.

The key is the resources.limits.cpu set to 2953m.  Namely the fact that number is >1000 and doesn't end in 0.

runc uses a systemd dbus property called CPUQuotaPerSecUSec to tell systemd how to set the underlying cpu.cfs_quota_us.


However, the systemd unit spec does not have this property.  The unit spec uses CPUQuota, which is a _integer_ percentage of CPU.  systemd internal converts CPUQuotaPerSecUSec to CPUQuota.

runc is setting the CPUQuotaPerSecUSec to 295300.  systemd internally converts this to CPUQuota=295%.
$ systemctl cat kubepods-burstable-poddb841b6d_ca29_11e7_9644_02306c0cdc4b.slice
# /run/systemd/system/kubepods-burstable-poddb841b6d_ca29_11e7_9644_02306c0cdc4b.slice

Note the truncation of the "3".  systemd proceeds to set the cpu.cfs_quota_us on the pod level cgroup to 295000.

When docker creates the container level cgroup, it manipulates the cgroup directly, not via systemd.  It tries to write 295300 into cpu.cfs_quota_us and gets "invalid argument" because the quota is greater than the parent (i.e. pod level) cgroup.

A workaround for this in the short term is to either use whole CPU limits or set a limit in millicores that ends in 0.

The fix will probably be to hack around the logic in systemd using runc by rounding 295300 to 296000 (up to the nearest 1000), knowing that systemd is going to convert it to an integer percentage of CPU.

Comment 7 Seth Jennings 2017-11-15 20:36:43 UTC
runc PR:

Comment 9 DeShuai Ma 2017-11-22 10:09:19 UTC
I try to reproduce as comment 6 but can't get the can't running error, could you help check what's wrong with my step? thanks

[root@ip-172-18-0-155 tmp]# oc version
oc v3.7.7
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-0-155.ec2.internal:8443
openshift v3.7.7
kubernetes v1.7.6+a08f5eeb62

[root@ip-172-18-0-155 tmp]# cat pod.yaml 
apiVersion: v1
kind: Pod
    name: hello-pod
  name: hello-pod
    - image: "docker.io/deshuai/hello-pod:latest"
      imagePullPolicy: IfNotPresent
      name: hello-pod
        - containerPort: 8080
          protocol: TCP
          cpu: 2953m
          memory: 1512Mi
          cpu: 147m
          memory: 756Mi
        capabilities: {}
        privileged: false
      terminationMessagePath: /dev/termination-log
        - mountPath: /tmp
          name: tmp
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  serviceAccount: ""
    - emptyDir: {}
      name: tmp
[root@ip-172-18-0-155 tmp]# oc rsh -n test hello-pod
sh-4.2# cat /sys/fs/cgroup/cpuacct,cpu/cpu.cfs_quota_us

//no node
[root@ip-172-18-2-116 ~]# cat /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod309b7de5_cf69_11e7_ac2f_0edc8525bec4.slice/docker-5dd8abc95c12c31eada976843fd3387ea073ccf603fedc3c69c856e4b857b71c.scope/cpu.cfs_quota_us
[root@ip-172-18-2-116 ~]# systemctl cat kubepods-burstable-pod309b7de5_cf69_11e7_ac2f_0edc8525bec4.slice | grep CPUQuota
# /run/systemd/system/kubepods-burstable-pod309b7de5_cf69_11e7_ac2f_0edc8525bec4.slice.d/50-CPUQuota.conf

Comment 10 Seth Jennings 2017-11-27 16:40:23 UTC
I am also unable to recreate locally on openshift v3.7.6. There must be a configuration or race that opens the window for this.  Trying to figure out what it is.

Comment 11 Ben Parees 2017-12-07 15:13:29 UTC
*** Bug 1521178 has been marked as a duplicate of this bug. ***

Comment 13 Bing Li 2017-12-18 04:15:19 UTC
free-int has been upgraded to v3.8.18 (online version and registry's running well: 

# oc get pod -n default
NAME                          READY     STATUS    RESTARTS   AGE
docker-registry-36-65m8l      1/1       Running   1          2d
docker-registry-36-vmsmz      1/1       Running   1          2d

# oc get dc/docker-registry -n default
docker-registry   36         2         2         config

Comment 16 Bing Li 2018-01-10 02:14:46 UTC
Based on comment 4 and comment 5, this issue can't be reproduced on OCP. So move this bug to verified.

Will trigger this issue in bug 1532530 ONLY for Online.

Comment 19 errata-xmlrpc 2018-01-23 17:58:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.