Bug 2066886

Summary: openshift-apiserver pods never going NotReady
Product: OpenShift Container Platform Reporter: Devan Goodwin <dgoodwin>
Component: openshift-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED ERRATA QA Contact: Rahul Gangwar <rgangwar>
Severity: high Docs Contact:
Priority: high    
Version: 4.11CC: akashem, mfojtik, sanchezl
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2109235 (view as bug list) Environment:
Last Closed: 2022-08-10 10:55:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 2109235    

Description Devan Goodwin 2022-03-22 16:54:54 UTC
Description of problem:

Observed in CI that openshift-apiserver pods never seem to go to a NotReady state during upgrades.

Problem traced to fact that pod definition appears to use healthz not readyz, and is taking 100s to go to ready false. (10 readyz=false, every 10s) Only waiting 60s for in-flight requests to drain.

Pod definition likely needs to be updated.

Slack thread: https://coreos.slack.com/archives/C01CQA76KMX/p1647886954470019

Difficult to demonstrate until some upcoming PRs that will graph pod states in CI runs merge, hopefully we can update with that soon.

Comment 1 Abu Kashem 2022-03-23 16:33:12 UTC
readiness probe with default values (from the Pod object)

> readinessProbe:
>       failureThreshold: 10
>       httpGet:
>         path: healthz
>         port: 8443
>         scheme: HTTPS
>       periodSeconds: 10
>       successThreshold: 1
>       timeoutSeconds: 1


- kubelet readiness check should probe '/readyz' endpoint, not '/healthz'
https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/bindata/v3.11.0/openshift-apiserver/deploy.yaml#L128

- with default 'periodSeconds=10s' and 'failureThreshold=10', it will take '100s' for kubelet to set ready=false. Once the pod is patched with 'ready=false' the endpoints controller will rotate the Pod IP out of the Service. 
we can set 'failureThreshold' to '1' - kubelet will take '10s' in the worst case to set ready=false

- do we need a startup probe?


we also have these related settings:
>  shutdown-delay-duration:
>  - 10s # give SDN some time to converge
>  shutdown-send-retry-after:
>  - "true"
https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/bindata/v3.11.0/config/defaultconfig.yaml#L17-L20

Comment 3 Rahul Gangwar 2022-03-29 06:44:34 UTC
In the openshift-apiserver namespace, we do an oc get -n openshift-apiserver pods -w -ojson and then in another window we delete one of the openshift-apiserver pods. the pod's container status before deletion indicated readiness=false.

oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-03-27-140854   True        False         23m     Cluster version is 4.11.0-0.nightly-2022-03-27-140854

oc get pods -n openshift-apiserver
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-7dd8d5695f-662cj   2/2     Running   0          34m
apiserver-7dd8d5695f-cshxc   2/2     Running   0          36m
apiserver-7dd8d5695f-mgb5f   2/2     Running   0          36m

In one terminal we do watcher for below command and other deleted one OAS pod "oc delete pod/apiserver-7dd8d5695f-cshxc -n openshift-apiserver" before completion of deleting pod status indicated readiness=false.
  

 oc get -n openshift-apiserver pods -w -ojson

    },
    "status": {
        "conditions": [
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2022-03-29T06:27:52Z",
                "message": "0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod anti-affinity rules.",
                "reason": "Unschedulable",
                "status": "False",
                "type": "PodScheduled"
            }
        ],
        "phase": "Pending",
        "qosClass": "Burstable"
    }

Comment 4 Rahul Gangwar 2022-03-29 06:51:13 UTC
 oc get -n openshift-apiserver pods -w -ojson

 "status": {
        "conditions": [
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2022-03-29T06:44:35Z",
                "status": "True",
                "type": "Initialized"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2022-03-29T06:44:32Z",
                "message": "containers with unready status: [openshift-apiserver]",
                "reason": "ContainersNotReady",
                "status": "False",
                "type": "Ready"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2022-03-29T06:44:32Z",
                "message": "containers with unready status: [openshift-apiserver]",
                "reason": "ContainersNotReady",
                "status": "False",
                "type": "ContainersReady"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2022-03-29T06:44:32Z",
                "status": "True",
                "type": "PodScheduled"
            }
        ],
        "containerStatuses": [
            {
                "containerID": "cri-o://2c9a6c7a9d97f8b774f62c0763eb92b9bdeda3bf71a26d322b5115706bbb64ab",
                "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21",
                "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21",
                "lastState": {},
                "name": "openshift-apiserver",
                "ready": false,
                "restartCount": 0,
                "started": true,
                "state": {
                    "running": {
                        "startedAt": "2022-03-29T06:44:35Z"
                    }
                }
            },

Termination grace period is 90s as in PR.

 ],
        "nodeName": "ip-10-0-205-119.us-east-2.compute.internal",
        "nodeSelector": {
            "node-role.kubernetes.io/master": ""
        },
        "preemptionPolicy": "PreemptLowerPriority",
        "priority": 2000001000,
        "priorityClassName": "system-node-critical",
        "restartPolicy": "Always",
        "schedulerName": "default-scheduler",
        "securityContext": {},
        "serviceAccount": "openshift-apiserver-sa",
        "serviceAccountName": "openshift-apiserver-sa",
        "terminationGracePeriodSeconds": 90,
        "tolerations": [
            {

After deletion and restart, container status readiness: true

   "status": {
        "conditions": [
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2022-03-29T06:44:35Z",
                "status": "True",
                "type": "Initialized"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2022-03-29T06:44:42Z",
                "status": "True",
                "type": "Ready"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2022-03-29T06:44:42Z",
                "status": "True",
                "type": "ContainersReady"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2022-03-29T06:44:32Z",
                "status": "True",
                "type": "PodScheduled"
            }
        ],
        "containerStatuses": [
            {
                "containerID": "cri-o://2c9a6c7a9d97f8b774f62c0763eb92b9bdeda3bf71a26d322b5115706bbb64ab",
                "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21",
                "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21",
                "lastState": {},
                "name": "openshift-apiserver",
                "ready": true,
                "restartCount": 0,
                "started": true,
                "state": {
                    "running": {
                        "startedAt": "2022-03-29T06:44:35Z"
                    }
                }
            },
            {
                "containerID": "cri-o://006f837b616086325026119c7b48e7ede4761f0d65396c97f1cdd7b67522707b",
                "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4ccd4c61eed4b6dca55f94aa7148766bc2e2ef3682a6607e71ce3ae6331997cb",
                "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4ccd4c61eed4b6dca55f94aa7148766bc2e2ef3682a6607e71ce3ae6331997cb",
                "lastState": {},
                "name": "openshift-apiserver-check-endpoints",
                "ready": true,
                "restartCount": 0,
                "started": true,
                "state": {
                    "running": {
                        "startedAt": "2022-03-29T06:44:35Z"
                    }
                }
            }
        ],
        "hostIP": "10.0.205.119",
        "initContainerStatuses": [
            {
                "containerID": "cri-o://04563c455ea5ed13ff43cad0bc345adea4c56206e193099303a723b8604cc107",
                "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21",
                "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21",
                "lastState": {},
                "name": "fix-audit-permissions",
                "ready": true,
                "restartCount": 0,
                "state": {
                    "terminated": {
                        "containerID": "cri-o://04563c455ea5ed13ff43cad0bc345adea4c56206e193099303a723b8604cc107",
                        "exitCode": 0,
                        "finishedAt": "2022-03-29T06:44:34Z",
                        "reason": "Completed",
                        "startedAt": "2022-03-29T06:44:34Z"
                    }
                }
            }

Comment 7 errata-xmlrpc 2022-08-10 10:55:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069