Bug 2090988
Summary: | ReplicaSet prometheus-operator-admission-webhook has timed out progressing | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> |
Component: | Monitoring | Assignee: | Jayapriya Pai <janantha> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | high | Docs Contact: | Brian Burt <bburt> |
Priority: | high | ||
Version: | 4.11 | CC: | anpicker, bburt, hongyli, jfajersk, jmarcal |
Target Milestone: | --- | Keywords: | TestBlocker |
Target Release: | 4.12.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-17 19:49:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2107493 |
Description
Junqi Zhao
2022-05-27 08:59:44 UTC
found the issue again in a 4.10.0-0.nightly-2022-06-08-150219 Disconnected UPI on AZURE & OVN IPSEC cluster upgrade to 4.11.0-0.nightly-2022-06-23-044003 monitoring is degraded status: conditions: - lastTransitionTime: "2022-06-23T23:20:58Z" message: Rolling out the stack. reason: RollOutInProgress status: "True" type: Progressing - lastTransitionTime: "2022-06-23T22:35:57Z" message: 'Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: the number of pods targeted by the deployment (3 pods) is different from the number of pods targeted by the deployment that have the desired template spec (1 pods)' reason: UpdatingPrometheusOperatorFailed status: "True" type: Degraded deployment.yaml file status: availableReplicas: 2 conditions: - lastTransitionTime: "2022-06-23T21:22:50Z" lastUpdateTime: "2022-06-23T21:22:50Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2022-06-23T22:30:57Z" lastUpdateTime: "2022-06-23T22:30:57Z" message: ReplicaSet "prometheus-operator-admission-webhook-78fd9c895d" has timed out progressing. reason: ProgressDeadlineExceeded status: "False" type: Progressing observedGeneration: 2 readyReplicas: 2 replicas: 3 unavailableReplicas: 1 updatedReplicas: 1 check the prometheus-operator-admission-webhook pods, there are 2 normal running pods prometheus-operator-admission-webhook-74f7bb977f-b9vxd prometheus-operator-admission-webhook-74f7bb977f-lq5xl 1 Pending pod, ReplicaSet is prometheus-operator-admission-webhook-78fd9c895d, same with the error in deployment.yaml prometheus-operator-admission-webhook-78fd9c895d-bjqwj status: conditions: - lastProbeTime: null lastTransitionTime: "2022-06-23T22:20:56Z" message: '0/5 nodes are available: 2 node(s) didn''t match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn''t match pod anti-affinity rules, 3 Preemption is not helpful for scheduling.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending qosClass: Burstable *** Bug 2105174 has been marked as a duplicate of this bug. *** Encounter the issue again Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-06-30-005428 1. % oc get node NAME STATUS ROLES AGE VERSION ge1n1-b5fdg-master-0 Ready master 4d22h v1.24.0+9ddc8b1 ge1n1-b5fdg-master-1 Ready master 4d22h v1.24.0+9ddc8b1 ge1n1-b5fdg-master-2 Ready master 4d22h v1.24.0+9ddc8b1 ge1n1-b5fdg-worker-qjshm Ready worker 4d22h v1.24.0+9ddc8b1 ge1n1-b5fdg-worker-vjcsp Ready worker 4d22h v1.24.0+9ddc8b 2. % oc -n openshift-monitoring describe pod prometheus-operator-admission-webhook-555d9654f8-ft2jv --- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 10h (x292 over 18h) default-scheduler 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling. 3. % oc describe co monitoring Last Transition Time: 2022-07-07T13:02:15Z Message: Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: the number of pods targeted by the deployment (3 pods) is different from the number of pods targeted by the deployment that have the desired template spec (1 pods) Reason: UpdatingPrometheusOperatorFailed Status: True Type: Degraded Extension: <nil> The availableReplicas of prometheus-operator-admission-webhook are from 'replicas - maxUnavailable' to 'replicas + maxSurge', that is, from 2 to 2 for round down 2*25% is zero. Suppose maxUnavailable and maxSurge are not reasonable. Both maxSurge and maxUnavailable can be specified as either an integer (e.g. 2) or a percentage (e.g. 50%), and they cannot both be zero. When specified as an integer, it represents the actual number of pods; when specifying a percentage, that percentage of the desired number of pods is used, rounded down. found the issue again in a 4.11 to 4.12 upgrade cluster which has 2 workers NAME STATUS ROLES AGE VERSION wduan-0714a-az-q2ps8-master-0 Ready control-plane,master 20h v1.24.0+9546431 wduan-0714a-az-q2ps8-master-1 Ready control-plane,master 20h v1.24.0+9546431 wduan-0714a-az-q2ps8-master-2 Ready control-plane,master 20h v1.24.0+9546431 wduan-0714a-az-q2ps8-worker-4288p Ready worker 20h v1.24.0+9546431 wduan-0714a-az-q2ps8-worker-4cmx5 Ready worker 20h v1.24.0+9546431 # oc -n openshift-monitoring get rs NAME DESIRED CURRENT READY AGE cluster-monitoring-operator-5bbfd998c6 0 0 0 15h cluster-monitoring-operator-7df87f6db 0 0 0 14h cluster-monitoring-operator-848799d46f 1 1 1 42m cluster-monitoring-operator-f794d99fb 0 0 0 20h kube-state-metrics-6d685b8687 0 0 0 15h kube-state-metrics-764c7cbcb 0 0 0 20h kube-state-metrics-f64d8dfd5 1 1 1 14h openshift-state-metrics-65f58d4c67 0 0 0 15h openshift-state-metrics-686fcb4d58 1 1 1 14h prometheus-adapter-58c87f9b89 0 0 0 14h prometheus-adapter-757f89985b 0 0 0 93m prometheus-adapter-77749dbccb 0 0 0 92m prometheus-adapter-78745d5d6f 2 2 2 91m prometheus-adapter-7bd4745484 0 0 0 20h prometheus-adapter-c49dd8496 0 0 0 15h prometheus-operator-5c58cf67d5 0 0 0 20h prometheus-operator-76f79c8d85 0 0 0 15h prometheus-operator-7cdc95fdb4 1 1 1 14h prometheus-operator-admission-webhook-5b9ff5ddbf 1 1 0 41m prometheus-operator-admission-webhook-67f97cf577 2 2 2 14h telemeter-client-66989bbb48 1 1 1 14h telemeter-client-fb5b896cf 0 0 0 15h thanos-querier-6cfb6bd748 0 0 0 15h thanos-querier-8678f9995c 2 2 2 14h # oc -n openshift-monitoring get pod -o wide | grep prometheus-operator-admission-webhook prometheus-operator-admission-webhook-5b9ff5ddbf-p8kmz 0/1 Pending 0 44m <none> <none> <none> <none> prometheus-operator-admission-webhook-67f97cf577-46jvl 1/1 Running 0 13h 10.128.2.7 wduan-0714a-az-q2ps8-worker-4288p <none> <none> prometheus-operator-admission-webhook-67f97cf577-r5mvt 1/1 Running 0 14h 10.131.0.9 wduan-0714a-az-q2ps8-worker-4cmx5 <none> <none> # oc -n openshift-monitoring describe pod prometheus-operator-admission-webhook-5b9ff5ddbf-p8kmz ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 44m default-scheduler 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling. Warning FailedScheduling 44m default-scheduler 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling. checked, maxSurge and maxUnavailable are 25% # oc -n openshift-monitoring get deploy prometheus-operator-admission-webhook -oyaml | grep -E "maxUnavailable|maxSurge" maxSurge: 25% maxUnavailable: 25% from https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ Max Unavailable .spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding down and Max Surge .spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The value cannot be 0 if MaxUnavailable is 0. The absolute number is calculated from the percentage by rounding up. for maxSurge, since the absolute number is calculated from the percentage by rounding up, and for maxUnavailable, the absolute number is calculated from percentage by rounding down, then the existing prometheus-operator-admission-webhook replicas are between [replicas - maxUnavailable] and [replicas + maxSurge]. for the 2 workers cluster, maxUnavailable replicas = 2 * 25%, rounding down, result is 0, maxSurge replicas = 2 * 25%, rounding up, result is 1 then the existing replicas is 2 - 0 <= replicas number <= 3, since the podAntiAffinity rule, no more prometheus-operator-admission-webhook could be scaled up, and no pod is removed. change the maxUnavailable to 1 may fix the issue in 2 workers cluster spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator-admission-webhook app.kubernetes.io/part-of: openshift-monitoring namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname Setting blocker+ with JP's ack since this can impact customer upgrades getting stuck. ipi-on-vsphere, 2 workers cluster, upgrade from 4.10.0-0.nightly-2022-07-16-173050 to 4.11.0-0.nightly-2022-07-16-020951, then continue to upgrade 4.12.0-0.nightly-2022-07-15-235709, 4.12.0-0.nightly-2022-07-17-174647, prometheus-operator-admission-webhook pods and monitoring operator are normal, no issue now # oc get node NAME STATUS ROLES AGE VERSION ***-4g69p-master-0 Ready master 4h50m v1.24.0+9546431 ***-4g69p-master-1 Ready master 4h49m v1.24.0+9546431 ***-4g69p-master-2 Ready master 4h50m v1.24.0+9546431 ***-4g69p-worker-v4l9p Ready worker 4h40m v1.24.0+9546431 ***-4g69p-worker-vr5md Ready worker 4h40m v1.24.0+9546431 # oc get clusterversion version -oyaml ... desired: image: registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-17-174647 version: 4.12.0-0.nightly-2022-07-17-174647 history: - completionTime: "2022-07-18T08:38:45Z" image: registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-17-174647 startedTime: "2022-07-18T08:11:24Z" state: Completed verified: false version: 4.12.0-0.nightly-2022-07-17-174647 - completionTime: "2022-07-18T07:38:45Z" image: registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-15-235709 startedTime: "2022-07-18T06:40:15Z" state: Completed verified: false version: 4.12.0-0.nightly-2022-07-15-235709 - completionTime: "2022-07-18T04:12:49Z" image: registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-16-020951 startedTime: "2022-07-18T03:08:48Z" state: Completed verified: false version: 4.11.0-0.nightly-2022-07-16-020951 - completionTime: "2022-07-18T02:39:46Z" image: registry.ci.openshift.org/ocp/release@sha256:e56645bcabe38850b37f2c7a70fbf890819eedda63898c703b78c1d5005a91ae startedTime: "2022-07-18T02:22:23Z" state: Completed verified: false version: 4.10.0-0.nightly-2022-07-16-173050 # oc get co monitoring NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE monitoring 4.12.0-0.nightly-2022-07-17-174647 True False False 6h21m # oc -n openshift-monitoring get pod -o wide | grep prometheus-operator-admission-webhook prometheus-operator-admission-webhook-6cff644fdf-6j92x 1/1 Running 0 90m 10.131.0.6 ***-4g69p-worker-v4l9p <none> <none> prometheus-operator-admission-webhook-6cff644fdf-mdj26 1/1 Running 0 95m 10.128.2.10 ***-4g69p-worker-vr5md <none> <none> # oc -n openshift-monitoring get deploy prometheus-operator-admission-webhook -oyaml | grep -E "maxUnavailable|maxSurge" maxSurge: 25% maxUnavailable: 1 # oc -n openshift-monitoring get rs | grep prometheus-operator-admission-webhook prometheus-operator-admission-webhook-67f97cf577 0 0 0 5h25m prometheus-operator-admission-webhook-6cff644fdf 2 2 2 115m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |