Bug 1943845

Summary: Router pods should have startup probes configured
Product: OpenShift Container Platform Reporter: Miciah Dashiel Butler Masters <mmasters>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: router QA Contact: Arvind iyengar <aiyengar>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aiyengar, aos-bugs
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:56:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1996897    

Description Miciah Dashiel Butler Masters 2021-03-27 20:10:40 UTC
Description of problem:

Router pods should specify startup probes for the router container.  Without a startup probe, the kubelet starts performing liveness probes after the liveness probe's initial delay of 10 seconds, and if the router takes a long time to synchronize (for example because the cluster has an extremely large number of routes or endpoints), the liveness probe can cause the kubelet to restart the container before the router has even finished its initial synchronization.  

The deployment should specify a startup probe that allows a generous amount of time (for example, 2 minutes) to give the router time to start up if initial synchronization takes a substantial amount of time but still have the kubelet start performing liveness and readiness probes quickly if the initial synchronization is quick.  


Version-Release number of selected component (if applicable):

Startup probes graduated to beta in Kubernetes 1.18 (OpenShift 4.6) and to stable in Kubernetes 1.20.  See <https://github.com/kubernetes/enhancements/blob/c1cec820b3b3d0fa18dede73107a2cbb43e27e33/keps/sig-node/950-liveness-probe-holdoff/README.md#implementation-history>.


How reproducible:

100%.


Steps to Reproduce:

1. Check the default router deployment's definition:

    oc -n openshift-ingress get deployments/router-default -o yaml


Actual results:

No startup probe is defined.


Expected results:

A startup probe should be defined on the "router" container:

          startupProbe:
            failureThreshold: 120
            httpGet:
              path: /healthz/ready
              port: 1936
            periodSeconds: 1

Comment 2 Arvind iyengar 2021-04-06 05:12:39 UTC
Verified in "4.8.0-0.nightly-2021-04-05-174735" release version. With this payload, it is observed that the router deployment now includes the "startup" probe:
-------

oc get clusterversion              
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-05-174735   True        False         48m     Cluster version is 4.8.0-0.nightly-2021-04-05-174735


oc -n openshift-ingress get deployments/router-default -o yaml
        startupProbe:
          failureThreshold: 120
          httpGet:
            path: /healthz/ready
            port: 1936
            scheme: HTTP
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 1
-------

Comment 5 errata-xmlrpc 2021-07-27 22:56:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438