Description of problem: When use invalid sysctls value create a pod, after create the pod keep in ContainerCreating. Describe the pod has some start container failed error. Version-Release number of selected component (if applicable): fork_ami_openshift3_clusterinfra_public_178_299 How reproducible: Always Steps to Reproduce: 1.Create a pod with invalid sysctls [root@dhcp-128-7 dma]# oc create -f https://raw.githubusercontent.com/mdshuai/testfile-openshift/master/sysctls/pod-sysctl-safe-invalid2.yaml pod "hello-pod" created 2.Check the pod status [root@dhcp-128-7 dma]# oc get pod NAME READY STATUS RESTARTS AGE hello-pod 0/1 ContainerCreating 0 46s [root@dhcp-128-7 dma]# oc describe pod/hello-pod Name: hello-pod Namespace: dma Security Policy: restricted Node: ip-172-18-9-215.ec2.internal/172.18.9.215 Start Time: Fri, 02 Sep 2016 16:46:35 +0800 Labels: name=hello-pod Status: Pending IP: Controllers: <none> Containers: hello-pod: Container ID: Image: docker.io/deshuai/hello-pod:latest Image ID: Port: 8080/TCP State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Volume Mounts: /tmp from tmp (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-fmesj (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: default-token-fmesj: Type: Secret (a volume populated by a Secret) SecretName: default-token-fmesj QoS Tier: BestEffort Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 58s 58s 1 {default-scheduler } Normal Scheduled Successfully assigned hello-pod to ip-172-18-9-215.ec2.internal 57s 57s 1 {kubelet ip-172-18-9-215.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: Cannot start container ef35dbe22593c96a612f946e341cdc2433e40de426abcd43255b0f9bd012b709: [9] System error: could not synchronise with container process" 40s 40s 1 {kubelet ip-172-18-9-215.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: Cannot start container 38a8423b88d94f592128ae8c710f50661fb8022a2163901f1b823dc1a8205122: [9] System error: could not synchronise with container process" 26s 26s 1 {kubelet ip-172-18-9-215.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: Cannot start container 5296d48fd48a5a545bc50d46ebbbac442f7f2e312db915740382d3aca490ed2f: [9] System error: could not synchronise with container process" 13s 13s 1 {kubelet ip-172-18-9-215.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: Cannot start container 6c7d3494de8f8d1824cfd4581ec85a0872b7df1b88d2e652ff2e1ccf5e61d716: [9] System error: could not synchronise with container process" <invalid> <invalid> 1 {kubelet ip-172-18-9-215.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: Cannot start container 7414ea605935a61da91e3b60a7b4d4264a11c3a2c6c74b5b822f53296a238e0d: [9] System error: could not synchronise with container process" <invalid> <invalid> 1 {kubelet ip-172-18-9-215.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: Cannot start container bcf230787a723d283354bef90b1ce46e761080fa11e60b535f449012ea6ccc5a: [9] System error: could not synchronise with container process" <invalid> <invalid> 1 {kubelet ip-172-18-9-215.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: Cannot start container da23fdc91937aa24dd12f4f674c23e59907b5b4a8e2aaf1d865b3614bbf6d47c: [9] System error: could not synchronise with container process" Actual results: 1.Create pod successfully 2.Pod keeps in ContainerCreating Expected results: 1.Create pod with validation to check the value is invalid, create pod failed. 2.If we think step1 should be successfully, the pod status should be error not keep in ContainerCreating. Additional info:
This behaves as expected. There is no (and cannot be) verification of sysctl values. Values are partially checked by the kernel and Docker will fail to start a container if the kernel rejects the value. The only thing we can do here: make Docker return better error messages or pass them through to the user. In fact, it looks like the second is what we are missing: docker run -it --sysctl=kernel.shm_rmid_forced=hello ubuntu /bin/bash -c "sysctl kernel.shm_rmid_forced" write /proc/sys/kernel/shm_rmid_forced: invalid argument docker: Error response from daemon: Cannot start container a1bc7c9dee79e00732ebf7d48a7eaa1e79232a700115476fba72425371532238: [9] System error: could not synchronise with container process.
To get this sorted out we need a better error message in Docker. Here is a BZ issue for that: https://bugzilla.redhat.com/show_bug.cgi?id=1386569
Will you fix this bug in 3.3.1 release or 3.4?
We depend on the Docker fix for https://bugzilla.redhat.com/show_bug.cgi?id=1386569 to fix this in origin/kubernetes. As 3.3.1 deadline is at the end of this week, this looks improbable.
After discussing on IRC this does not sound like a blocker for 3.3.1 so I'm moving target to 3.4 and attaching to the 3.4 release as items that can be fixed are fixed there.
In 3.4 pod still in 'ContainerCreating'. Don't know why changed to ON_QA openshift v3.4.0.16+cc70b72 [root@ip-172-18-7-27 ~]# oc get pod hello-pod NAME READY STATUS RESTARTS AGE hello-pod 0/1 ContainerCreating 0 1m [root@ip-172-18-7-27 ~]# oc describe pod hello-pod Name: hello-pod Namespace: default Security Policy: anyuid Node: ip-172-18-2-117.ec2.internal/172.18.2.117 Start Time: Thu, 27 Oct 2016 22:01:03 -0400 Labels: name=hello-pod Status: Pending IP: Controllers: <none> Containers: hello-pod: Container ID: Image: docker.io/deshuai/hello-pod:latest Image ID: Port: 8080/TCP State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Volume Mounts: /tmp from tmp (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-khrla (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: default-token-khrla: Type: Secret (a volume populated by a Secret) SecretName: default-token-khrla QoS Class: BestEffort Tolerations: <none> Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1m 1m 1 {default-scheduler } Normal Scheduled Successfully assigned hello-pod to ip-172-18-2-117.ec2.internal 1m 12s 9 {kubelet ip-172-18-2-117.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"oci runtime error: write /proc/sys/kernel/shm_rmid_forced: invalid argument\"}"
As https://bugzilla.redhat.com/show_bug.cgi?id=1386569 seems to be fixed in Docker upstream, I will implement the corresponding kubernetes error check upstream. If we decide to backport the Docker patch, we can also backport the Kubernetes patch later.
While we have the proper error message with Docker 1.12, the kubelet right now does not allow to fail a pod immediately when the Pod initialization fails. This is by design and not related to sysctls.