Bug 1550007

Summary: Router cannot be running when enable 'ROUTER_BIND_PORTS_AFTER_SYNC' for system container install env
Product: OpenShift Container Platform Reporter: zhaozhanqi <zzhao>
Component: NetworkingAssignee: Jacob Tanenbaum <jtanenba>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, bbennett, hongli, zzhao
Version: 3.9.0   
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The liveness and readiness probes where the same checks Consequence: The router pod could not differentiate between a pod that was alive and one that was ready Fix: Create two probes one for readiness and one for liveness Result: a router pod can be alive but not yet ready
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-11 07:19:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description zhaozhanqi 2018-02-28 09:45:49 UTC
Description of problem:
The router pod will become 'CrashLoopBackOff' When set 'router ROUTER_BIND_PORTS_AFTER_SYNC' to true for router. This issue happen on 'system container' install env. and rpm install is working well.

Version-Release number of selected component (if applicable):
oc v3.9.0-0.53.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

How reproducible:
always

Steps to Reproduce:
1. setup env using 'system container' install 
2.  Check the router pod is running
3. set the ROUTER_BIND_PORTS_AFTER_SYNC to true for router
   oc env dc router ROUTER_BIND_PORTS_AFTER_SYNC=true
4. Check the router pod

Actual results:
step 4: the router pod will become crash, see 'oc describe pod router-10-xxx':

                 router=enabled
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason                 Age              From                                      Message
  ----     ------                 ----             ----                                      -------
  Warning  FailedScheduling       6m (x5 over 7m)  default-scheduler                         0/2 nodes are available: 1 MatchNodeSelector, 1 PodFitsHostPorts.
  Normal   Scheduled              6m               default-scheduler                         Successfully assigned router-10-t7fmv to qe-zzhao-node-registry-router-1
  Normal   SuccessfulMountVolume  6m               kubelet, qe-zzhao-node-registry-router-1  MountVolume.SetUp succeeded for volume "server-certificate"
  Normal   SuccessfulMountVolume  6m               kubelet, qe-zzhao-node-registry-router-1  MountVolume.SetUp succeeded for volume "router-token-sh6jl"
  Normal   Killing                5m (x2 over 6m)  kubelet, qe-zzhao-node-registry-router-1  Killing container with id docker://router:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Created                5m (x3 over 6m)  kubelet, qe-zzhao-node-registry-router-1  Created container
  Normal   Started                5m (x3 over 6m)  kubelet, qe-zzhao-node-registry-router-1  Started container
  Warning  Unhealthy              5m (x6 over 6m)  kubelet, qe-zzhao-node-registry-router-1  Liveness probe failed: HTTP probe failed with statuscode: 500
  Warning  Unhealthy              5m (x6 over 6m)  kubelet, qe-zzhao-node-registry-router-1  Readiness probe failed: HTTP probe failed with statuscode: 500
  Normal   Pulled                 1m (x7 over 6m)  kubelet, qe-zzhao-node-registry-router-1  Container image "registry.reg-aws.openshift.com:443/openshift3/ose-haproxy-router:v3.9.0-0.53.0" already present on machine


Expected results:
router can be running with  ROUTER_BIND_PORTS_AFTER_SYNC=true


Additional info:
FYI:rpm install is working well

Comment 1 Jacob Tanenbaum 2018-03-08 20:45:36 UTC
Can you still reproduce? I saw the behaviour but after deleting the first attempted pod deployed the dc was able to spawn a valid pod, does that happen on your setup?

Comment 2 zhaozhanqi 2018-03-09 03:36:45 UTC
yes, this issue still can be reproduced in system container installed env.

the router cannot be running when enable 'ROUTER_BIND_PORTS_AFTER_SYNC' Even if I delete the first attempted pod.

Comment 3 Ben Bennett 2018-05-01 14:24:30 UTC
PR https://github.com/openshift/origin/issues/19009

Comment 4 openshift-github-bot 2018-05-22 21:06:46 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/978d2bc3de43445e4809193016ee7f658ca1348a
Differentiate liveness and readiness probes for router

Add a backend to the router controller "/livez" that always returns true. This differentiates the liveness and
readiness probes so that a router can be alive and not ready.

Bug 1550007

Comment 6 Hongan Li 2018-08-24 10:35:51 UTC
verified in openshift v3.11.0-0.21.0 and issue has been fixed.

Operation System: Red Hat Enterprise Linux Atomic Host release 7.5
Cluster Install Method: system container
kernel: Linux qe-master-etcd-1 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 8 errata-xmlrpc 2018-10-11 07:19:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652