Hide Forgot
Description of problem: in a IPI on vSphere 7.0 & OVN & WindowsContainer 4.11.0-0.nightly-2022-05-11-054135 CI cluster, see the attached must-gather file. there are 3 coreos masters/workers, 2 windows worker are added later NAME STATUS ROLES AGE VERSION qeci-38095-7q4s8-master-0 Ready master 54m v1.23.3+69213f8 qeci-38095-7q4s8-master-1 Ready master 54m v1.23.3+69213f8 qeci-38095-7q4s8-master-2 Ready master 54m v1.23.3+69213f8 qeci-38095-7q4s8-worker-7ktnh Ready worker 45m v1.23.3+69213f8 qeci-38095-7q4s8-worker-jnmg2 Ready worker 45m v1.23.3+69213f8 winworker-lwlmv Ready worker 12m v1.23.3-2034+eccb3856381a4e winworker-ntmn4 Ready worker 17m v1.23.3-2034+eccb3856381a4e after running for some CI cases, monitoring is degraded - lastTransitionTime: "2022-05-14T18:39:41Z" message: 'Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: the number of pods targeted by the deployment (3 pods) is different from the number of pods targeted by the deployment that have the desired template spec (2 pods)' reason: UpdatingPrometheusOperatorFailed status: "True" type: Degraded prometheus-operator-admission-webhook deployment expects 2 pods, but checked the deployments.yaml file in the must-gather file, status.readyReplicas=1,status.replicas=3, status.updatedReplicas=2 - apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "6" creationTimestamp: "2022-05-13T16:39:08Z" generation: 6 labels: app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: prometheus-operator-admission-webhook app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.55.1 .... spec: progressDeadlineSeconds: 600 replicas: 2 revisionHistoryLimit: 10 .... status: availableReplicas: 1 conditions: - lastTransitionTime: "2022-05-14T01:26:46Z" lastUpdateTime: "2022-05-14T01:26:46Z" message: Deployment does not have minimum availability. reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2022-05-14T18:44:42Z" lastUpdateTime: "2022-05-14T18:44:42Z" message: ReplicaSet "prometheus-operator-admission-webhook-75756c6c85" has timed out progressing. reason: ProgressDeadlineExceeded status: "False" type: Progressing observedGeneration: 6 readyReplicas: 1 replicas: 3 unavailableReplicas: 2 updatedReplicas: 2 there are 4 prometheus-operator-admission-webhook pods from must-gather prometheus-operator-admission-webhook-57848c5cf5-45qv4 # pods under prometheus-operator-admission-webhook-75756c6c85 ReplicaSet prometheus-operator-admission-webhook-75756c6c85-9w6pm prometheus-operator-admission-webhook-75756c6c85-bll98 prometheus-operator-admission-webhook-75756c6c85-c2xvn error in prometheus-operator-admission-webhook-57848c5cf5-45qv4.yaml file is status: conditions: - lastProbeTime: null lastTransitionTime: "2022-05-14T18:29:41Z" message: '0/7 nodes are available: 1 node(s) didn''t match pod anti-affinity rules, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn''t tolerate, 2 node(s) had taint {os: Windows}, that the pod didn''t tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending qosClass: Burstable same for prometheus-operator-admission-webhook-75756c6c85-9w6pm.yaml status: conditions: - lastProbeTime: null lastTransitionTime: "2022-05-14T18:34:41Z" message: '0/7 nodes are available: 1 node(s) didn''t match pod anti-affinity rules, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn''t tolerate, 2 node(s) had taint {os: Windows}, that the pod didn''t tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending qosClass: Burstable they wanted to schedule in windows node, which the windows node has taint taints: - effect: NoSchedule key: os value: Windows so they can not schedule in the windows nodes, this is expected. prometheus-operator-admission-webhook-75756c6c85-bll98.yaml and prometheus-operator-admission-webhook-75756c6c85-c2xvn.yaml file, these 2 pods are normal. prometheus-operator-admission-webhook affinity setting: spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator-admission-webhook app.kubernetes.io/part-of: openshift-monitoring namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname this is more like a kube scheduler issue, file to monitoring first. Version-Release number of selected component (if applicable): IPI on vSphere 7.0 & OVN & WindowsContainer 4.11.0-0.nightly-2022-05-11-054135 How reproducible: not sure Steps to Reproduce: 1. check the monitoring status from must-gather 2. 3. Actual results: monitoring is degraded Expected results: should be normal Additional info:
found the issue again in a 4.10.0-0.nightly-2022-06-08-150219 Disconnected UPI on AZURE & OVN IPSEC cluster upgrade to 4.11.0-0.nightly-2022-06-23-044003 monitoring is degraded status: conditions: - lastTransitionTime: "2022-06-23T23:20:58Z" message: Rolling out the stack. reason: RollOutInProgress status: "True" type: Progressing - lastTransitionTime: "2022-06-23T22:35:57Z" message: 'Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: the number of pods targeted by the deployment (3 pods) is different from the number of pods targeted by the deployment that have the desired template spec (1 pods)' reason: UpdatingPrometheusOperatorFailed status: "True" type: Degraded deployment.yaml file status: availableReplicas: 2 conditions: - lastTransitionTime: "2022-06-23T21:22:50Z" lastUpdateTime: "2022-06-23T21:22:50Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2022-06-23T22:30:57Z" lastUpdateTime: "2022-06-23T22:30:57Z" message: ReplicaSet "prometheus-operator-admission-webhook-78fd9c895d" has timed out progressing. reason: ProgressDeadlineExceeded status: "False" type: Progressing observedGeneration: 2 readyReplicas: 2 replicas: 3 unavailableReplicas: 1 updatedReplicas: 1 check the prometheus-operator-admission-webhook pods, there are 2 normal running pods prometheus-operator-admission-webhook-74f7bb977f-b9vxd prometheus-operator-admission-webhook-74f7bb977f-lq5xl 1 Pending pod, ReplicaSet is prometheus-operator-admission-webhook-78fd9c895d, same with the error in deployment.yaml prometheus-operator-admission-webhook-78fd9c895d-bjqwj status: conditions: - lastProbeTime: null lastTransitionTime: "2022-06-23T22:20:56Z" message: '0/5 nodes are available: 2 node(s) didn''t match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn''t match pod anti-affinity rules, 3 Preemption is not helpful for scheduling.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending qosClass: Burstable
*** Bug 2105174 has been marked as a duplicate of this bug. ***
Encounter the issue again Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-06-30-005428 1. % oc get node NAME STATUS ROLES AGE VERSION ge1n1-b5fdg-master-0 Ready master 4d22h v1.24.0+9ddc8b1 ge1n1-b5fdg-master-1 Ready master 4d22h v1.24.0+9ddc8b1 ge1n1-b5fdg-master-2 Ready master 4d22h v1.24.0+9ddc8b1 ge1n1-b5fdg-worker-qjshm Ready worker 4d22h v1.24.0+9ddc8b1 ge1n1-b5fdg-worker-vjcsp Ready worker 4d22h v1.24.0+9ddc8b 2. % oc -n openshift-monitoring describe pod prometheus-operator-admission-webhook-555d9654f8-ft2jv --- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 10h (x292 over 18h) default-scheduler 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling. 3. % oc describe co monitoring Last Transition Time: 2022-07-07T13:02:15Z Message: Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: the number of pods targeted by the deployment (3 pods) is different from the number of pods targeted by the deployment that have the desired template spec (1 pods) Reason: UpdatingPrometheusOperatorFailed Status: True Type: Degraded Extension: <nil>
The availableReplicas of prometheus-operator-admission-webhook are from 'replicas - maxUnavailable' to 'replicas + maxSurge', that is, from 2 to 2 for round down 2*25% is zero. Suppose maxUnavailable and maxSurge are not reasonable. Both maxSurge and maxUnavailable can be specified as either an integer (e.g. 2) or a percentage (e.g. 50%), and they cannot both be zero. When specified as an integer, it represents the actual number of pods; when specifying a percentage, that percentage of the desired number of pods is used, rounded down.
found the issue again in a 4.11 to 4.12 upgrade cluster which has 2 workers NAME STATUS ROLES AGE VERSION wduan-0714a-az-q2ps8-master-0 Ready control-plane,master 20h v1.24.0+9546431 wduan-0714a-az-q2ps8-master-1 Ready control-plane,master 20h v1.24.0+9546431 wduan-0714a-az-q2ps8-master-2 Ready control-plane,master 20h v1.24.0+9546431 wduan-0714a-az-q2ps8-worker-4288p Ready worker 20h v1.24.0+9546431 wduan-0714a-az-q2ps8-worker-4cmx5 Ready worker 20h v1.24.0+9546431 # oc -n openshift-monitoring get rs NAME DESIRED CURRENT READY AGE cluster-monitoring-operator-5bbfd998c6 0 0 0 15h cluster-monitoring-operator-7df87f6db 0 0 0 14h cluster-monitoring-operator-848799d46f 1 1 1 42m cluster-monitoring-operator-f794d99fb 0 0 0 20h kube-state-metrics-6d685b8687 0 0 0 15h kube-state-metrics-764c7cbcb 0 0 0 20h kube-state-metrics-f64d8dfd5 1 1 1 14h openshift-state-metrics-65f58d4c67 0 0 0 15h openshift-state-metrics-686fcb4d58 1 1 1 14h prometheus-adapter-58c87f9b89 0 0 0 14h prometheus-adapter-757f89985b 0 0 0 93m prometheus-adapter-77749dbccb 0 0 0 92m prometheus-adapter-78745d5d6f 2 2 2 91m prometheus-adapter-7bd4745484 0 0 0 20h prometheus-adapter-c49dd8496 0 0 0 15h prometheus-operator-5c58cf67d5 0 0 0 20h prometheus-operator-76f79c8d85 0 0 0 15h prometheus-operator-7cdc95fdb4 1 1 1 14h prometheus-operator-admission-webhook-5b9ff5ddbf 1 1 0 41m prometheus-operator-admission-webhook-67f97cf577 2 2 2 14h telemeter-client-66989bbb48 1 1 1 14h telemeter-client-fb5b896cf 0 0 0 15h thanos-querier-6cfb6bd748 0 0 0 15h thanos-querier-8678f9995c 2 2 2 14h # oc -n openshift-monitoring get pod -o wide | grep prometheus-operator-admission-webhook prometheus-operator-admission-webhook-5b9ff5ddbf-p8kmz 0/1 Pending 0 44m <none> <none> <none> <none> prometheus-operator-admission-webhook-67f97cf577-46jvl 1/1 Running 0 13h 10.128.2.7 wduan-0714a-az-q2ps8-worker-4288p <none> <none> prometheus-operator-admission-webhook-67f97cf577-r5mvt 1/1 Running 0 14h 10.131.0.9 wduan-0714a-az-q2ps8-worker-4cmx5 <none> <none> # oc -n openshift-monitoring describe pod prometheus-operator-admission-webhook-5b9ff5ddbf-p8kmz ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 44m default-scheduler 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling. Warning FailedScheduling 44m default-scheduler 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling. checked, maxSurge and maxUnavailable are 25% # oc -n openshift-monitoring get deploy prometheus-operator-admission-webhook -oyaml | grep -E "maxUnavailable|maxSurge" maxSurge: 25% maxUnavailable: 25% from https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ Max Unavailable .spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding down and Max Surge .spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The value cannot be 0 if MaxUnavailable is 0. The absolute number is calculated from the percentage by rounding up. for maxSurge, since the absolute number is calculated from the percentage by rounding up, and for maxUnavailable, the absolute number is calculated from percentage by rounding down, then the existing prometheus-operator-admission-webhook replicas are between [replicas - maxUnavailable] and [replicas + maxSurge]. for the 2 workers cluster, maxUnavailable replicas = 2 * 25%, rounding down, result is 0, maxSurge replicas = 2 * 25%, rounding up, result is 1 then the existing replicas is 2 - 0 <= replicas number <= 3, since the podAntiAffinity rule, no more prometheus-operator-admission-webhook could be scaled up, and no pod is removed. change the maxUnavailable to 1 may fix the issue in 2 workers cluster spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator-admission-webhook app.kubernetes.io/part-of: openshift-monitoring namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname
Setting blocker+ with JP's ack since this can impact customer upgrades getting stuck.
ipi-on-vsphere, 2 workers cluster, upgrade from 4.10.0-0.nightly-2022-07-16-173050 to 4.11.0-0.nightly-2022-07-16-020951, then continue to upgrade 4.12.0-0.nightly-2022-07-15-235709, 4.12.0-0.nightly-2022-07-17-174647, prometheus-operator-admission-webhook pods and monitoring operator are normal, no issue now # oc get node NAME STATUS ROLES AGE VERSION ***-4g69p-master-0 Ready master 4h50m v1.24.0+9546431 ***-4g69p-master-1 Ready master 4h49m v1.24.0+9546431 ***-4g69p-master-2 Ready master 4h50m v1.24.0+9546431 ***-4g69p-worker-v4l9p Ready worker 4h40m v1.24.0+9546431 ***-4g69p-worker-vr5md Ready worker 4h40m v1.24.0+9546431 # oc get clusterversion version -oyaml ... desired: image: registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-17-174647 version: 4.12.0-0.nightly-2022-07-17-174647 history: - completionTime: "2022-07-18T08:38:45Z" image: registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-17-174647 startedTime: "2022-07-18T08:11:24Z" state: Completed verified: false version: 4.12.0-0.nightly-2022-07-17-174647 - completionTime: "2022-07-18T07:38:45Z" image: registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-15-235709 startedTime: "2022-07-18T06:40:15Z" state: Completed verified: false version: 4.12.0-0.nightly-2022-07-15-235709 - completionTime: "2022-07-18T04:12:49Z" image: registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-16-020951 startedTime: "2022-07-18T03:08:48Z" state: Completed verified: false version: 4.11.0-0.nightly-2022-07-16-020951 - completionTime: "2022-07-18T02:39:46Z" image: registry.ci.openshift.org/ocp/release@sha256:e56645bcabe38850b37f2c7a70fbf890819eedda63898c703b78c1d5005a91ae startedTime: "2022-07-18T02:22:23Z" state: Completed verified: false version: 4.10.0-0.nightly-2022-07-16-173050 # oc get co monitoring NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE monitoring 4.12.0-0.nightly-2022-07-17-174647 True False False 6h21m # oc -n openshift-monitoring get pod -o wide | grep prometheus-operator-admission-webhook prometheus-operator-admission-webhook-6cff644fdf-6j92x 1/1 Running 0 90m 10.131.0.6 ***-4g69p-worker-v4l9p <none> <none> prometheus-operator-admission-webhook-6cff644fdf-mdj26 1/1 Running 0 95m 10.128.2.10 ***-4g69p-worker-vr5md <none> <none> # oc -n openshift-monitoring get deploy prometheus-operator-admission-webhook -oyaml | grep -E "maxUnavailable|maxSurge" maxSurge: 25% maxUnavailable: 1 # oc -n openshift-monitoring get rs | grep prometheus-operator-admission-webhook prometheus-operator-admission-webhook-67f97cf577 0 0 0 5h25m prometheus-operator-admission-webhook-6cff644fdf 2 2 2 115m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399