Description of problem: Seem like https://github.com/openshift/ose/commit/612dc5117a96e262764c3b0e574ef224252413f7 introduce this bug, pls see the following details. Version-Release number of selected component (if applicable): atomic-openshift-3.5.0.49-1.git.0.c8e072a.el7.x86_64 How reproducible: Always Steps to Reproduce: 1. Launch instances on openstack 2. set up env via openshift-ansible installer 3. Actual results: router is failed to be deployed. # oc get pod router-1-zs593 0/1 CrashLoopBackOff 35 1h # oc describe po router-1-zs593 <--snip--> 1h 4m 27 {kubelet openshift-133.xxx} spec.containers{router} Normal Created (events with common reason combined) 1h 4m 27 {kubelet openshift-133.xxx} spec.containers{router} Normal Started (events with common reason combined) 1h 4m 38 {kubelet openshift-133.xxx} spec.containers{router} Warning Unhealthy Liveness probe failed: dial tcp 10.14.6.133:1936: getsockopt: no route to host 1h 4m 5 {kubelet openshift-133.xxx} spec.containers{router} Warning Unhealthy Readiness probe failed: Get http://localhost:1936/healthz: dial tcp [::1]:1936: getsockopt: connection refused 1h 4m 27 {kubelet openshift-133.xxx} spec.containers{router} Normal Killing (events with common reason combined) <--snip--> # oc get hostsubnet NAME HOST HOST IP SUBNET openshift-133.xxx openshift-133.xxx 10.14.6.133 10.129.0.0/23 openshift-137.xxx openshift-137.xxx 10.14.6.137 10.128.0.0/23 For openshift-133.xxx, its external IP is 10.14.6.133, its internal IP is 192.168.2.108. # iptables -L -n|more Chain INPUT (policy ACCEPT) num target prot opt source destination 1 KUBE-NODEPORT-NON-LOCAL all -- 0.0.0.0/0 0.0.0.0/0 /* Ensure that non-local NodePort traffic can flow */ 2 KUBE-FIREWALL all -- 0.0.0.0/0 0.0.0.0/0 3 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 /* traffic from docker */ 4 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 /* traffic from SDN */ 5 ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 multiport dports 4789 /* 001 vxlan incoming */ 6 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED 7 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 8 INPUT_direct all -- 0.0.0.0/0 0.0.0.0/0 9 INPUT_ZONES_SOURCE all -- 0.0.0.0/0 0.0.0.0/0 10 INPUT_ZONES all -- 0.0.0.0/0 0.0.0.0/0 11 DROP all -- 0.0.0.0/0 0.0.0.0/0 ctstate INVALID 12 REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited If delete the 12th line in the above iptables, 10.14.6.133:1936 will be accessible, router will be deployed successfully. # oc get dc router -o yaml <--snip--> livenessProbe: failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 tcpSocket: port: 1936 timeoutSeconds: 1 <--snip--> Expected results: router is deployed successfully Additional info: In 3.4, there is no such issue, because 3.4 haproxy router is using the following healthz probe: <--snip--> livenessProbe: failureThreshold: 3 httpGet: host: localhost path: /healthz port: 1936 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 <--snip--> It is accessing localhost:1936 which will not be blocked by iptables.
Rolled back fix for 1405440 PR 13331
PR 13331 MERGED
bmeng This a rollback of a fix that didn't work properly. The original resolution documented increasing maxconn=20000 to work around this problem.
Pull request: https://github.com/openshift/origin/pull/13331
This has been merged into ocp and is in OCP v3.5.0.52 or newer.
Verified this bug on v3.5.0.52 the router pod works well Check the Liveness and Readiness are using http-get by localhost: Liveness: http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884