Bug 1430729

Summary: [3.5] router fail to be deployed due to /healthz probe is blocked by iptables.
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: NetworkingAssignee: Phil Cameron <pcameron>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, atragler, bmeng, mleitner, sukulkar, tdawson, weshi, wmeng
Version: 3.5.0   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-12 19:14:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Johnny Liu 2017-03-09 12:45:02 UTC
Description of problem:
Seem like https://github.com/openshift/ose/commit/612dc5117a96e262764c3b0e574ef224252413f7 introduce this bug, pls see the following details.

Version-Release number of selected component (if applicable):
atomic-openshift-3.5.0.49-1.git.0.c8e072a.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Launch instances on openstack
2. set up env via openshift-ansible installer
3. 

Actual results:
router is failed to be deployed.
# oc get pod
router-1-zs593             0/1       CrashLoopBackOff   35         1h

# oc describe po router-1-zs593
<--snip-->
  1h	4m	27	{kubelet openshift-133.xxx}	spec.containers{router}	Normal	Created		(events with common reason combined)
  1h	4m	27	{kubelet openshift-133.xxx}	spec.containers{router}	Normal	Started		(events with common reason combined)
  1h	4m	38	{kubelet openshift-133.xxx}	spec.containers{router}	Warning	Unhealthy	Liveness probe failed: dial tcp 10.14.6.133:1936: getsockopt: no route to host
  1h	4m	5	{kubelet openshift-133.xxx}	spec.containers{router}	Warning	Unhealthy	Readiness probe failed: Get http://localhost:1936/healthz: dial tcp [::1]:1936: getsockopt: connection refused
  1h	4m	27	{kubelet openshift-133.xxx}	spec.containers{router}	Normal	Killing		(events with common reason combined)
<--snip-->

# oc get hostsubnet
NAME                               HOST                               HOST IP       SUBNET
openshift-133.xxx   openshift-133.xxx   10.14.6.133   10.129.0.0/23
openshift-137.xxx   openshift-137.xxx   10.14.6.137   10.128.0.0/23

For openshift-133.xxx, its external IP is 10.14.6.133, its internal IP is 192.168.2.108.

# iptables -L -n|more
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination         
1    KUBE-NODEPORT-NON-LOCAL  all  --  0.0.0.0/0            0.0.0.0/0            /* Ensure that non-local NodePort traffic can flow */
2    KUBE-FIREWALL  all  --  0.0.0.0/0            0.0.0.0/0           
3    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            /* traffic from docker */
4    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            /* traffic from SDN */
5    ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            multiport dports 4789 /* 001 vxlan incoming */
6    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
7    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
8    INPUT_direct  all  --  0.0.0.0/0            0.0.0.0/0           
9    INPUT_ZONES_SOURCE  all  --  0.0.0.0/0            0.0.0.0/0           
10   INPUT_ZONES  all  --  0.0.0.0/0            0.0.0.0/0           
11   DROP       all  --  0.0.0.0/0            0.0.0.0/0            ctstate INVALID
12   REJECT     all  --  0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited

If delete the 12th line in the above iptables, 10.14.6.133:1936 will be accessible, router will be deployed successfully.

# oc get dc router -o yaml
<--snip-->
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 1936
          timeoutSeconds: 1
<--snip-->


Expected results:
router is deployed successfully

Additional info:
In 3.4, there is no such issue, because 3.4 haproxy router is using the following healthz probe:
<--snip-->
        livenessProbe:
          failureThreshold: 3
          httpGet:
            host: localhost
            path: /healthz
            port: 1936
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
<--snip-->

It is accessing localhost:1936 which will not be blocked by iptables.

Comment 1 Phil Cameron 2017-03-09 22:26:12 UTC
Rolled back fix for 1405440
PR 13331

Comment 2 Phil Cameron 2017-03-13 13:42:08 UTC
PR 13331 MERGED

Comment 3 Phil Cameron 2017-03-13 13:46:14 UTC
bmeng
This a rollback of a fix that didn't work properly. The original resolution documented increasing maxconn=20000 to work around this problem.

Comment 4 Phil Cameron 2017-03-14 13:30:49 UTC
Pull request:
https://github.com/openshift/origin/pull/13331

Comment 5 Troy Dawson 2017-03-14 14:22:44 UTC
This has been merged into ocp and is in OCP v3.5.0.52 or newer.

Comment 7 zhaozhanqi 2017-03-15 02:29:44 UTC
Verified this bug on v3.5.0.52

the router pod works well

Check the Liveness and Readiness are using http-get by localhost:

    Liveness:		http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:		http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3

Comment 9 errata-xmlrpc 2017-04-12 19:14:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884