Bug 1430729 - [3.5] router fail to be deployed due to /healthz probe is blocked by iptables.
Summary: [3.5] router fail to be deployed due to /healthz probe is blocked by iptables.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Phil Cameron
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-09 12:45 UTC by Johnny Liu
Modified: 2022-08-04 22:20 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-04-12 19:14:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0884 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 22:50:07 UTC

Description Johnny Liu 2017-03-09 12:45:02 UTC
Description of problem:
Seem like https://github.com/openshift/ose/commit/612dc5117a96e262764c3b0e574ef224252413f7 introduce this bug, pls see the following details.

Version-Release number of selected component (if applicable):
atomic-openshift-3.5.0.49-1.git.0.c8e072a.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Launch instances on openstack
2. set up env via openshift-ansible installer
3. 

Actual results:
router is failed to be deployed.
# oc get pod
router-1-zs593             0/1       CrashLoopBackOff   35         1h

# oc describe po router-1-zs593
<--snip-->
  1h	4m	27	{kubelet openshift-133.xxx}	spec.containers{router}	Normal	Created		(events with common reason combined)
  1h	4m	27	{kubelet openshift-133.xxx}	spec.containers{router}	Normal	Started		(events with common reason combined)
  1h	4m	38	{kubelet openshift-133.xxx}	spec.containers{router}	Warning	Unhealthy	Liveness probe failed: dial tcp 10.14.6.133:1936: getsockopt: no route to host
  1h	4m	5	{kubelet openshift-133.xxx}	spec.containers{router}	Warning	Unhealthy	Readiness probe failed: Get http://localhost:1936/healthz: dial tcp [::1]:1936: getsockopt: connection refused
  1h	4m	27	{kubelet openshift-133.xxx}	spec.containers{router}	Normal	Killing		(events with common reason combined)
<--snip-->

# oc get hostsubnet
NAME                               HOST                               HOST IP       SUBNET
openshift-133.xxx   openshift-133.xxx   10.14.6.133   10.129.0.0/23
openshift-137.xxx   openshift-137.xxx   10.14.6.137   10.128.0.0/23

For openshift-133.xxx, its external IP is 10.14.6.133, its internal IP is 192.168.2.108.

# iptables -L -n|more
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination         
1    KUBE-NODEPORT-NON-LOCAL  all  --  0.0.0.0/0            0.0.0.0/0            /* Ensure that non-local NodePort traffic can flow */
2    KUBE-FIREWALL  all  --  0.0.0.0/0            0.0.0.0/0           
3    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            /* traffic from docker */
4    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            /* traffic from SDN */
5    ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            multiport dports 4789 /* 001 vxlan incoming */
6    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
7    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
8    INPUT_direct  all  --  0.0.0.0/0            0.0.0.0/0           
9    INPUT_ZONES_SOURCE  all  --  0.0.0.0/0            0.0.0.0/0           
10   INPUT_ZONES  all  --  0.0.0.0/0            0.0.0.0/0           
11   DROP       all  --  0.0.0.0/0            0.0.0.0/0            ctstate INVALID
12   REJECT     all  --  0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited

If delete the 12th line in the above iptables, 10.14.6.133:1936 will be accessible, router will be deployed successfully.

# oc get dc router -o yaml
<--snip-->
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 1936
          timeoutSeconds: 1
<--snip-->


Expected results:
router is deployed successfully

Additional info:
In 3.4, there is no such issue, because 3.4 haproxy router is using the following healthz probe:
<--snip-->
        livenessProbe:
          failureThreshold: 3
          httpGet:
            host: localhost
            path: /healthz
            port: 1936
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
<--snip-->

It is accessing localhost:1936 which will not be blocked by iptables.

Comment 1 Phil Cameron 2017-03-09 22:26:12 UTC
Rolled back fix for 1405440
PR 13331

Comment 2 Phil Cameron 2017-03-13 13:42:08 UTC
PR 13331 MERGED

Comment 3 Phil Cameron 2017-03-13 13:46:14 UTC
bmeng
This a rollback of a fix that didn't work properly. The original resolution documented increasing maxconn=20000 to work around this problem.

Comment 4 Phil Cameron 2017-03-14 13:30:49 UTC
Pull request:
https://github.com/openshift/origin/pull/13331

Comment 5 Troy Dawson 2017-03-14 14:22:44 UTC
This has been merged into ocp and is in OCP v3.5.0.52 or newer.

Comment 7 zhaozhanqi 2017-03-15 02:29:44 UTC
Verified this bug on v3.5.0.52

the router pod works well

Check the Liveness and Readiness are using http-get by localhost:

    Liveness:		http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:		http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3

Comment 9 errata-xmlrpc 2017-04-12 19:14:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884


Note You need to log in before you can comment on or make changes to this bug.