Description of problem: Router pods should specify startup probes for the router container. Without a startup probe, the kubelet starts performing liveness probes after the liveness probe's initial delay of 10 seconds, and if the router takes a long time to synchronize (for example because the cluster has an extremely large number of routes or endpoints), the liveness probe can cause the kubelet to restart the container before the router has even finished its initial synchronization. The deployment should specify a startup probe that allows a generous amount of time (for example, 2 minutes) to give the router time to start up if initial synchronization takes a substantial amount of time but still have the kubelet start performing liveness and readiness probes quickly if the initial synchronization is quick. Version-Release number of selected component (if applicable): Startup probes graduated to beta in Kubernetes 1.18 (OpenShift 4.6) and to stable in Kubernetes 1.20. See <https://github.com/kubernetes/enhancements/blob/c1cec820b3b3d0fa18dede73107a2cbb43e27e33/keps/sig-node/950-liveness-probe-holdoff/README.md#implementation-history>. How reproducible: 100%. Steps to Reproduce: 1. Check the default router deployment's definition: oc -n openshift-ingress get deployments/router-default -o yaml Actual results: No startup probe is defined. Expected results: A startup probe should be defined on the "router" container: startupProbe: failureThreshold: 120 httpGet: path: /healthz/ready port: 1936 periodSeconds: 1
Verified in "4.8.0-0.nightly-2021-04-05-174735" release version. With this payload, it is observed that the router deployment now includes the "startup" probe: ------- oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-05-174735 True False 48m Cluster version is 4.8.0-0.nightly-2021-04-05-174735 oc -n openshift-ingress get deployments/router-default -o yaml startupProbe: failureThreshold: 120 httpGet: path: /healthz/ready port: 1936 scheme: HTTP periodSeconds: 1 successThreshold: 1 timeoutSeconds: 1 -------
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438