Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1430729

Summary:	[3.5] router fail to be deployed due to /healthz probe is blocked by iptables.
Product:	OpenShift Container Platform	Reporter:	Johnny Liu <jialiu>
Component:	Networking	Assignee:	Phil Cameron <pcameron>
Networking sub component:	router	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, atragler, bmeng, mleitner, sukulkar, tdawson, weshi, wmeng
Version:	3.5.0
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-04-12 19:14:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Johnny Liu 2017-03-09 12:45:02 UTC

Description of problem:
Seem like https://github.com/openshift/ose/commit/612dc5117a96e262764c3b0e574ef224252413f7 introduce this bug, pls see the following details.

Version-Release number of selected component (if applicable):
atomic-openshift-3.5.0.49-1.git.0.c8e072a.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Launch instances on openstack
2. set up env via openshift-ansible installer
3. 

Actual results:
router is failed to be deployed.
# oc get pod
router-1-zs593             0/1       CrashLoopBackOff   35         1h

# oc describe po router-1-zs593
<--snip-->
  1h	4m	27	{kubelet openshift-133.xxx}	spec.containers{router}	Normal	Created		(events with common reason combined)
  1h	4m	27	{kubelet openshift-133.xxx}	spec.containers{router}	Normal	Started		(events with common reason combined)
  1h	4m	38	{kubelet openshift-133.xxx}	spec.containers{router}	Warning	Unhealthy	Liveness probe failed: dial tcp 10.14.6.133:1936: getsockopt: no route to host
  1h	4m	5	{kubelet openshift-133.xxx}	spec.containers{router}	Warning	Unhealthy	Readiness probe failed: Get http://localhost:1936/healthz: dial tcp [::1]:1936: getsockopt: connection refused
  1h	4m	27	{kubelet openshift-133.xxx}	spec.containers{router}	Normal	Killing		(events with common reason combined)
<--snip-->

# oc get hostsubnet
NAME                               HOST                               HOST IP       SUBNET
openshift-133.xxx   openshift-133.xxx   10.14.6.133   10.129.0.0/23
openshift-137.xxx   openshift-137.xxx   10.14.6.137   10.128.0.0/23

For openshift-133.xxx, its external IP is 10.14.6.133, its internal IP is 192.168.2.108.

# iptables -L -n|more
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination         
1    KUBE-NODEPORT-NON-LOCAL  all  --  0.0.0.0/0            0.0.0.0/0            /* Ensure that non-local NodePort traffic can flow */
2    KUBE-FIREWALL  all  --  0.0.0.0/0            0.0.0.0/0           
3    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            /* traffic from docker */
4    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            /* traffic from SDN */
5    ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            multiport dports 4789 /* 001 vxlan incoming */
6    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
7    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
8    INPUT_direct  all  --  0.0.0.0/0            0.0.0.0/0           
9    INPUT_ZONES_SOURCE  all  --  0.0.0.0/0            0.0.0.0/0           
10   INPUT_ZONES  all  --  0.0.0.0/0            0.0.0.0/0           
11   DROP       all  --  0.0.0.0/0            0.0.0.0/0            ctstate INVALID
12   REJECT     all  --  0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited

If delete the 12th line in the above iptables, 10.14.6.133:1936 will be accessible, router will be deployed successfully.

# oc get dc router -o yaml
<--snip-->
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 1936
          timeoutSeconds: 1
<--snip-->


Expected results:
router is deployed successfully

Additional info:
In 3.4, there is no such issue, because 3.4 haproxy router is using the following healthz probe:
<--snip-->
        livenessProbe:
          failureThreshold: 3
          httpGet:
            host: localhost
            path: /healthz
            port: 1936
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
<--snip-->

It is accessing localhost:1936 which will not be blocked by iptables.

Comment 1 Phil Cameron 2017-03-09 22:26:12 UTC

Rolled back fix for 1405440
PR 13331

Comment 2 Phil Cameron 2017-03-13 13:42:08 UTC

PR 13331 MERGED

Comment 3 Phil Cameron 2017-03-13 13:46:14 UTC

bmeng
This a rollback of a fix that didn't work properly. The original resolution documented increasing maxconn=20000 to work around this problem.

Comment 4 Phil Cameron 2017-03-14 13:30:49 UTC

Pull request:
https://github.com/openshift/origin/pull/13331

Comment 5 Troy Dawson 2017-03-14 14:22:44 UTC

This has been merged into ocp and is in OCP v3.5.0.52 or newer.

Comment 7 zhaozhanqi 2017-03-15 02:29:44 UTC

Verified this bug on v3.5.0.52

the router pod works well

Check the Liveness and Readiness are using http-get by localhost:

    Liveness:		http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:		http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3

Comment 9 errata-xmlrpc 2017-04-12 19:14:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884