Description of problem: Observed in CI that openshift-apiserver pods never seem to go to a NotReady state during upgrades. Problem traced to fact that pod definition appears to use healthz not readyz, and is taking 100s to go to ready false. (10 readyz=false, every 10s) Only waiting 60s for in-flight requests to drain. Pod definition likely needs to be updated. Slack thread: https://coreos.slack.com/archives/C01CQA76KMX/p1647886954470019 Difficult to demonstrate until some upcoming PRs that will graph pod states in CI runs merge, hopefully we can update with that soon.
readiness probe with default values (from the Pod object) > readinessProbe: > failureThreshold: 10 > httpGet: > path: healthz > port: 8443 > scheme: HTTPS > periodSeconds: 10 > successThreshold: 1 > timeoutSeconds: 1 - kubelet readiness check should probe '/readyz' endpoint, not '/healthz' https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/bindata/v3.11.0/openshift-apiserver/deploy.yaml#L128 - with default 'periodSeconds=10s' and 'failureThreshold=10', it will take '100s' for kubelet to set ready=false. Once the pod is patched with 'ready=false' the endpoints controller will rotate the Pod IP out of the Service. we can set 'failureThreshold' to '1' - kubelet will take '10s' in the worst case to set ready=false - do we need a startup probe? we also have these related settings: > shutdown-delay-duration: > - 10s # give SDN some time to converge > shutdown-send-retry-after: > - "true" https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/bindata/v3.11.0/config/defaultconfig.yaml#L17-L20
In the openshift-apiserver namespace, we do an oc get -n openshift-apiserver pods -w -ojson and then in another window we delete one of the openshift-apiserver pods. the pod's container status before deletion indicated readiness=false. oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-03-27-140854 True False 23m Cluster version is 4.11.0-0.nightly-2022-03-27-140854 oc get pods -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-7dd8d5695f-662cj 2/2 Running 0 34m apiserver-7dd8d5695f-cshxc 2/2 Running 0 36m apiserver-7dd8d5695f-mgb5f 2/2 Running 0 36m In one terminal we do watcher for below command and other deleted one OAS pod "oc delete pod/apiserver-7dd8d5695f-cshxc -n openshift-apiserver" before completion of deleting pod status indicated readiness=false. oc get -n openshift-apiserver pods -w -ojson }, "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2022-03-29T06:27:52Z", "message": "0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod anti-affinity rules.", "reason": "Unschedulable", "status": "False", "type": "PodScheduled" } ], "phase": "Pending", "qosClass": "Burstable" }
oc get -n openshift-apiserver pods -w -ojson "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2022-03-29T06:44:35Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2022-03-29T06:44:32Z", "message": "containers with unready status: [openshift-apiserver]", "reason": "ContainersNotReady", "status": "False", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": "2022-03-29T06:44:32Z", "message": "containers with unready status: [openshift-apiserver]", "reason": "ContainersNotReady", "status": "False", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2022-03-29T06:44:32Z", "status": "True", "type": "PodScheduled" } ], "containerStatuses": [ { "containerID": "cri-o://2c9a6c7a9d97f8b774f62c0763eb92b9bdeda3bf71a26d322b5115706bbb64ab", "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21", "lastState": {}, "name": "openshift-apiserver", "ready": false, "restartCount": 0, "started": true, "state": { "running": { "startedAt": "2022-03-29T06:44:35Z" } } }, Termination grace period is 90s as in PR. ], "nodeName": "ip-10-0-205-119.us-east-2.compute.internal", "nodeSelector": { "node-role.kubernetes.io/master": "" }, "preemptionPolicy": "PreemptLowerPriority", "priority": 2000001000, "priorityClassName": "system-node-critical", "restartPolicy": "Always", "schedulerName": "default-scheduler", "securityContext": {}, "serviceAccount": "openshift-apiserver-sa", "serviceAccountName": "openshift-apiserver-sa", "terminationGracePeriodSeconds": 90, "tolerations": [ { After deletion and restart, container status readiness: true "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2022-03-29T06:44:35Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2022-03-29T06:44:42Z", "status": "True", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": "2022-03-29T06:44:42Z", "status": "True", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2022-03-29T06:44:32Z", "status": "True", "type": "PodScheduled" } ], "containerStatuses": [ { "containerID": "cri-o://2c9a6c7a9d97f8b774f62c0763eb92b9bdeda3bf71a26d322b5115706bbb64ab", "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21", "lastState": {}, "name": "openshift-apiserver", "ready": true, "restartCount": 0, "started": true, "state": { "running": { "startedAt": "2022-03-29T06:44:35Z" } } }, { "containerID": "cri-o://006f837b616086325026119c7b48e7ede4761f0d65396c97f1cdd7b67522707b", "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4ccd4c61eed4b6dca55f94aa7148766bc2e2ef3682a6607e71ce3ae6331997cb", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4ccd4c61eed4b6dca55f94aa7148766bc2e2ef3682a6607e71ce3ae6331997cb", "lastState": {}, "name": "openshift-apiserver-check-endpoints", "ready": true, "restartCount": 0, "started": true, "state": { "running": { "startedAt": "2022-03-29T06:44:35Z" } } } ], "hostIP": "10.0.205.119", "initContainerStatuses": [ { "containerID": "cri-o://04563c455ea5ed13ff43cad0bc345adea4c56206e193099303a723b8604cc107", "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9be21be1b69f360f0d79a2c1ec6cac7d3519d80e76f46f7f1c02868bfffcda21", "lastState": {}, "name": "fix-audit-permissions", "ready": true, "restartCount": 0, "state": { "terminated": { "containerID": "cri-o://04563c455ea5ed13ff43cad0bc345adea4c56206e193099303a723b8604cc107", "exitCode": 0, "finishedAt": "2022-03-29T06:44:34Z", "reason": "Completed", "startedAt": "2022-03-29T06:44:34Z" } } }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069