Description of problem:
The router pod will become 'CrashLoopBackOff' When set 'router ROUTER_BIND_PORTS_AFTER_SYNC' to true for router. This issue happen on 'system container' install env. and rpm install is working well.
Version-Release number of selected component (if applicable):
features: Basic-Auth GSSAPI Kerberos SPNEGO
Steps to Reproduce:
1. setup env using 'system container' install
2. Check the router pod is running
3. set the ROUTER_BIND_PORTS_AFTER_SYNC to true for router
oc env dc router ROUTER_BIND_PORTS_AFTER_SYNC=true
4. Check the router pod
step 4: the router pod will become crash, see 'oc describe pod router-10-xxx':
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 6m (x5 over 7m) default-scheduler 0/2 nodes are available: 1 MatchNodeSelector, 1 PodFitsHostPorts.
Normal Scheduled 6m default-scheduler Successfully assigned router-10-t7fmv to qe-zzhao-node-registry-router-1
Normal SuccessfulMountVolume 6m kubelet, qe-zzhao-node-registry-router-1 MountVolume.SetUp succeeded for volume "server-certificate"
Normal SuccessfulMountVolume 6m kubelet, qe-zzhao-node-registry-router-1 MountVolume.SetUp succeeded for volume "router-token-sh6jl"
Normal Killing 5m (x2 over 6m) kubelet, qe-zzhao-node-registry-router-1 Killing container with id docker://router:Container failed liveness probe.. Container will be killed and recreated.
Normal Created 5m (x3 over 6m) kubelet, qe-zzhao-node-registry-router-1 Created container
Normal Started 5m (x3 over 6m) kubelet, qe-zzhao-node-registry-router-1 Started container
Warning Unhealthy 5m (x6 over 6m) kubelet, qe-zzhao-node-registry-router-1 Liveness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 5m (x6 over 6m) kubelet, qe-zzhao-node-registry-router-1 Readiness probe failed: HTTP probe failed with statuscode: 500
Normal Pulled 1m (x7 over 6m) kubelet, qe-zzhao-node-registry-router-1 Container image "registry.reg-aws.openshift.com:443/openshift3/ose-haproxy-router:v3.9.0-0.53.0" already present on machine
router can be running with ROUTER_BIND_PORTS_AFTER_SYNC=true
FYI:rpm install is working well
Can you still reproduce? I saw the behaviour but after deleting the first attempted pod deployed the dc was able to spawn a valid pod, does that happen on your setup?
yes, this issue still can be reproduced in system container installed env.
the router cannot be running when enable 'ROUTER_BIND_PORTS_AFTER_SYNC' Even if I delete the first attempted pod.
Commit pushed to master at https://github.com/openshift/origin
Differentiate liveness and readiness probes for router
Add a backend to the router controller "/livez" that always returns true. This differentiates the liveness and
readiness probes so that a router can be alive and not ready.
verified in openshift v3.11.0-0.21.0 and issue has been fixed.
Operation System: Red Hat Enterprise Linux Atomic Host release 7.5
Cluster Install Method: system container
kernel: Linux qe-master-etcd-1 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.