Description of problem: Create a pod with safe sysctls 'net.ipv4.tcp_max_syn_backlog' or 'net.ipv4.tcp_syncookies', the pod always failed. And the infra container has error: "open /proc/sys/net/ipv4/tcp_max_syn_backlog: no such file or directory" or "open /proc/sys/net/ipv4/tcp_syncookies: no such file or directory". Version-Release number of selected component (if applicable): fork_ami_openshift3_clusterinfra_public_178_299 How reproducible: Always Steps to Reproduce: 1.Create a pod with safe sysctls 'net.ipv4.tcp_max_syn_backlog' or 'net.ipv4.tcp_syncookies' oc create -f https://raw.githubusercontent.com/mdshuai/testfile-openshift/master/sysctls/pod-sysctl-safe.yaml 2.Check the pod status [root@ip-172-18-0-79 dma]# oc logs hello-pod Error from server: container "hello-pod" in pod "hello-pod" is waiting to start: ContainerCreating [root@ip-172-18-0-79 dma]# oc get pod NAME READY STATUS RESTARTS AGE hello-pod 0/1 ContainerCreating 0 17s [root@ip-172-18-0-79 dma]# [root@ip-172-18-0-79 dma]# oc describe pod hello-pod Name: hello-pod Namespace: dma Security Policy: anyuid Node: ip-172-18-0-79.ec2.internal/172.18.0.79 Start Time: Mon, 05 Sep 2016 05:19:11 -0400 Labels: name=hello-pod Status: Pending IP: Controllers: <none> Containers: hello-pod: Container ID: Image: docker.io/deshuai/hello-pod:latest Image ID: Port: 8080/TCP State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Volume Mounts: /tmp from tmp (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-l29uk (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: default-token-l29uk: Type: Secret (a volume populated by a Secret) SecretName: default-token-l29uk QoS Tier: BestEffort Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 25s 25s 1 {default-scheduler } Normal Scheduled Successfully assigned hello-pod to ip-172-18-0-79.ec2.internal 24s 24s 1 {kubelet ip-172-18-0-79.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: Cannot start container 7054eff9d1fc4d5d18cc5097263293421e19ba7c980f5b33f25df7481c30b718: [9] System error: could not synchronise with container process" 12s 12s 1 {kubelet ip-172-18-0-79.ec2.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: Cannot start container e4fd3f95a3a0e74425128482f895fe20669cb3467e90f8436b2469734fecd567: [9] System error: could not synchronise with container process" 3.Check the infra container [root@ip-172-18-0-79 dma]# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 1898e7d3864a openshift/origin-pod:v1.3.0-alpha.3 "/pod" 9 seconds ago Created k8s_POD.f98c0130_hello-pod_dma_ce3d8845-7349-11e6-91fb-0e95fd557f75_236c2de8 [root@ip-172-18-0-79 dma]# docker logs 1898e7d3864a open /proc/sys/net/ipv4/tcp_max_syn_backlog: no such file or directory Actual results: 2.Pod can't running Expected results: 2.Pod is running Additional info:
net.ipv4.tcp_syncookies works fine here on the fedora kernel. Will check on the RHEL kernel.
Double checked with recent CentOS VM: net.ipv4.tcp_syncookies is not namespaced.
I will check what we can do about better error behavior in the docker runtime. I guess (without a lot of Kernel knowledge in the kubelet code) the we can do is to fail hard with a good error message. Right now, the kubelet retries a number of times when a container creation fails. That's not a good user experience.
Here is the kernel patch for tcp_syncookies: https://github.com/torvalds/linux/commit/12ed8244ed8b31b023ea6d2851fd8b15f2999e9b Summary: available in >=4.6
net.ipv4.tcp_max_syn_backlog does not belong onto the whitelist as it is not namespaced in any of today's kernel. Having it whitelisted does no harm though because the sysctl is just not available under /proc/sys in the container. Here is the upstream fix: https://github.com/kubernetes/kubernetes/pull/32072
About better error behavior: the error message about the failed sysctl is part of the stdout/err of the container in Docker 1.10, not of the actual container status. Compare: $ docker -l=debug run -d --sysctl=kernel.shm_rmid_forced=hello ubuntu /bin/bash -c "sysctl kernel.shm_rmid_forced" 09813c9c38cf071e0581c38acccce149331d88729bdb0a47f586ca73a92c7221 docker: Error response from daemon: Cannot start container 09813c9c38cf071e0581c38acccce149331d88729bdb0a47f586ca73a92c7221: [9] System error: could not synchronise with container process. $ docker logs 09813c9c38cf071e0581c38acccce149331d88729bdb0a47f586ca73a92c7221 write /proc/sys/kernel/shm_rmid_forced: invalid argument With Docker 1.12 the situation is better: $ docker -l=debug run -d --sysctl=kernel.shm_rmid_forced=hello busybox /bi> 9483cb34b2559e26a96db01c29727c11afd7d5461588c607992119355507f541 docker: Error response from daemon: oci runtime error: write /proc/sys/kernel/shm_rmid_forced: invalid argument.
Bad news (even with Docker 1.12) is that we don't have any mechanism right now to react on specific container creation errors and fail a pod.
net.ipv4.tcp_syncookies is same issue, could you double check.
Summing up all above in one comment: The following applies to all failing sysctls: - if the kernel does not namespace a sysctl ("no such file or directory") or reject a sysctl value ("invalid argument"), Docker 1.10 will show this in the container logs only. Kubernetes/OpenShift will not see this output. Moreover it will continue to launch the container (with a backoff) creating multiple events. - With Docker 1.12 the situation is a bit better as Docker reports the very sysctl error in the container launch error message. Then at least the user can see that it was a sysctl error. The actual failure reason for all the sysctls mentioned is slightly different: - net.ipv4.tcp_max_syn_backlog is not namespaced on any kernel and should be remove from the whitelist (https://github.com/kubernetes/kubernetes/pull/32072) - net.ipv4.tcp_syncookies is namespaced in Kernels >= 4.6, but not in RHEL's enterprise kernel. If customers request this, here is the kernel patch to backport by the kernel team: https://github.com/torvalds/linux/commit/12ed8244ed8b31b023ea6d2851fd8b15f2999e9b - kernel.shm_rmid_forced=hello fails because the kernel validates the value "hello" and rejects it.
For net.ipv4.tcp_syncookies support need document Kernels >= 4.6 in openshift-doc.
Still has issue, Don't know why changed to ON_QA.
Backport of the net.ipv4.tcp_max_syn_backlog removal to 3.3.x: https://github.com/openshift/ose/pull/441
https://github.com/openshift/openshift-docs/pull/3144 is the docs PR that will soon contain the release note that net.ipv4.tcp_syncookies is not namespaced in the RHEL kernel. The ose PR will fix net.ipv4.tcp_max_syn_backlog in 3.3.x. It is already fixed in 3.4.
Moving to MODIFIED as the fix for net.ipv4.tcp_max_syn_backlog is in 3.4 already. As described above, we will add a release note about net.ipv4.tcp_syncookies. DeShuai Ma - when verifying this bz, please only verify that net.ipv4.tcp_max_syn_backlog is no longer in the whitelist.
This has been merged into ose and is in OSE v3.4.0.19 or newer.
Move to ON_QA according to comment 17
Test on openshift v3.4.0.21+ca4702d Verify the bug. "net.ipv4.tcp_max_syn_backlog" is not whitelisted [root@dhcp-128-7 dma]# oc get pod NAME READY STATUS RESTARTS AGE hello-pod 0/1 SysctlForbidden 0 <invalid> [root@dhcp-128-7 dma]# oc describe pod hello-pod Name: hello-pod Namespace: dma Security Policy: restricted Node: weshi-3.centralus.cloudapp.azure.com/ Start Time: Fri, 04 Nov 2016 15:15:37 +0800 Labels: name=hello-pod Status: Failed Reason: SysctlForbidden Message: Pod forbidden sysctl: "net.ipv4.tcp_max_syn_backlog" not whitelisted IP: Controllers: <none> Containers: hello-pod: Image: docker.io/deshuai/hello-pod:latest Port: 8080/TCP Volume Mounts: /tmp from tmp (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-jakux (ro) Environment Variables: <none> Volumes: tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: default-token-jakux: Type: Secret (a volume populated by a Secret) SecretName: default-token-jakux QoS Class: BestEffort Tolerations: <none> Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- <invalid> <invalid> 1 {kubelet weshi-3.centralus.cloudapp.azure.com} Warning SysctlForbidden forbidden sysctl: "net.ipv4.tcp_max_syn_backlog" not whitelisted
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066