Bug 2090988

Summary:	ReplicaSet prometheus-operator-admission-webhook has timed out progressing
Product:	OpenShift Container Platform	Reporter:	Junqi Zhao <juzhao>
Component:	Monitoring	Assignee:	Jayapriya Pai <janantha>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:	Brian Burt <bburt>
Priority:	high
Version:	4.11	CC:	anpicker, bburt, hongyli, jfajersk, jmarcal
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:49:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2107493

Description Junqi Zhao 2022-05-27 08:59:44 UTC

Description of problem:
in a IPI on vSphere 7.0 & OVN & WindowsContainer 4.11.0-0.nightly-2022-05-11-054135 CI cluster, see the attached must-gather file.
there are 3 coreos masters/workers, 2 windows worker are added later
NAME                            STATUS   ROLES    AGE   VERSION
qeci-38095-7q4s8-master-0       Ready    master   54m   v1.23.3+69213f8
qeci-38095-7q4s8-master-1       Ready    master   54m   v1.23.3+69213f8
qeci-38095-7q4s8-master-2       Ready    master   54m   v1.23.3+69213f8
qeci-38095-7q4s8-worker-7ktnh   Ready    worker   45m   v1.23.3+69213f8
qeci-38095-7q4s8-worker-jnmg2   Ready    worker   45m   v1.23.3+69213f8
winworker-lwlmv                 Ready    worker   12m   v1.23.3-2034+eccb3856381a4e
winworker-ntmn4                 Ready    worker   17m   v1.23.3-2034+eccb3856381a4e

after running for some CI cases, monitoring is degraded
  - lastTransitionTime: "2022-05-14T18:39:41Z"
    message: 'Failed to rollout the stack. Error: updating prometheus operator: reconciling
      Prometheus Operator Admission Webhook Deployment failed: updating Deployment
      object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook:
      the number of pods targeted by the deployment (3 pods) is different from the
      number of pods targeted by the deployment that have the desired template spec
      (2 pods)'
    reason: UpdatingPrometheusOperatorFailed
    status: "True"
    type: Degraded

prometheus-operator-admission-webhook deployment expects 2 pods, but checked the deployments.yaml file in the must-gather file, status.readyReplicas=1,status.replicas=3, status.updatedReplicas=2
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "6"
    creationTimestamp: "2022-05-13T16:39:08Z"
    generation: 6
    labels:
      app.kubernetes.io/managed-by: cluster-monitoring-operator
      app.kubernetes.io/name: prometheus-operator-admission-webhook
      app.kubernetes.io/part-of: openshift-monitoring
      app.kubernetes.io/version: 0.55.1
....
  spec:
    progressDeadlineSeconds: 600
    replicas: 2
    revisionHistoryLimit: 10
....
  status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: "2022-05-14T01:26:46Z"
      lastUpdateTime: "2022-05-14T01:26:46Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    - lastTransitionTime: "2022-05-14T18:44:42Z"
      lastUpdateTime: "2022-05-14T18:44:42Z"
      message: ReplicaSet "prometheus-operator-admission-webhook-75756c6c85" has timed
        out progressing.
      reason: ProgressDeadlineExceeded
      status: "False"
      type: Progressing
    observedGeneration: 6
    readyReplicas: 1
    replicas: 3
    unavailableReplicas: 2
    updatedReplicas: 2

there are 4 prometheus-operator-admission-webhook pods from must-gather
prometheus-operator-admission-webhook-57848c5cf5-45qv4
# pods under prometheus-operator-admission-webhook-75756c6c85 ReplicaSet
prometheus-operator-admission-webhook-75756c6c85-9w6pm
prometheus-operator-admission-webhook-75756c6c85-bll98
prometheus-operator-admission-webhook-75756c6c85-c2xvn 

error in prometheus-operator-admission-webhook-57848c5cf5-45qv4.yaml file is
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-05-14T18:29:41Z"
    message: '0/7 nodes are available: 1 node(s) didn''t match pod anti-affinity rules,
      1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn''t
      tolerate, 2 node(s) had taint {os: Windows}, that the pod didn''t tolerate,
      3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t
      tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

same for prometheus-operator-admission-webhook-75756c6c85-9w6pm.yaml
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-05-14T18:34:41Z"
    message: '0/7 nodes are available: 1 node(s) didn''t match pod anti-affinity rules,
      1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn''t
      tolerate, 2 node(s) had taint {os: Windows}, that the pod didn''t tolerate,
      3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t
      tolerate.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

they wanted to schedule in windows node, which the windows node has taint 
  taints:
  - effect: NoSchedule
    key: os
    value: Windows
so they can not schedule in the windows nodes, this is expected.

prometheus-operator-admission-webhook-75756c6c85-bll98.yaml and prometheus-operator-admission-webhook-75756c6c85-c2xvn.yaml file, these 2 pods are normal.

prometheus-operator-admission-webhook affinity setting:
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: prometheus-operator-admission-webhook
            app.kubernetes.io/part-of: openshift-monitoring
        namespaces:
        - openshift-monitoring
        topologyKey: kubernetes.io/hostname

this is more like a kube scheduler issue, file to monitoring first.

Version-Release number of selected component (if applicable):
IPI on vSphere 7.0 & OVN & WindowsContainer 4.11.0-0.nightly-2022-05-11-054135

How reproducible:
not sure

Steps to Reproduce:
1. check the monitoring status from must-gather
2.
3.

Actual results:
monitoring is degraded

Expected results:
should be normal

Additional info:

Comment 2 Junqi Zhao 2022-06-24 09:42:02 UTC

found the issue again in a 4.10.0-0.nightly-2022-06-08-150219 Disconnected UPI on AZURE & OVN IPSEC cluster upgrade to 4.11.0-0.nightly-2022-06-23-044003
monitoring is degraded
status:
  conditions:
  - lastTransitionTime: "2022-06-23T23:20:58Z"
    message: Rolling out the stack.
    reason: RollOutInProgress
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-06-23T22:35:57Z"
    message: 'Failed to rollout the stack. Error: updating prometheus operator: reconciling
      Prometheus Operator Admission Webhook Deployment failed: updating Deployment
      object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook:
      the number of pods targeted by the deployment (3 pods) is different from the
      number of pods targeted by the deployment that have the desired template spec
      (1 pods)'
    reason: UpdatingPrometheusOperatorFailed
    status: "True"
    type: Degraded

deployment.yaml file
  status:
    availableReplicas: 2
    conditions:
    - lastTransitionTime: "2022-06-23T21:22:50Z"
      lastUpdateTime: "2022-06-23T21:22:50Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2022-06-23T22:30:57Z"
      lastUpdateTime: "2022-06-23T22:30:57Z"
      message: ReplicaSet "prometheus-operator-admission-webhook-78fd9c895d" has timed
        out progressing.
      reason: ProgressDeadlineExceeded
      status: "False"
      type: Progressing
    observedGeneration: 2
    readyReplicas: 2
    replicas: 3
    unavailableReplicas: 1
    updatedReplicas: 1

check the prometheus-operator-admission-webhook pods, there are 2 normal running pods
prometheus-operator-admission-webhook-74f7bb977f-b9vxd
prometheus-operator-admission-webhook-74f7bb977f-lq5xl

1 Pending pod, ReplicaSet is prometheus-operator-admission-webhook-78fd9c895d, same with the error in deployment.yaml
prometheus-operator-admission-webhook-78fd9c895d-bjqwj
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-06-23T22:20:56Z"
    message: '0/5 nodes are available: 2 node(s) didn''t match pod anti-affinity rules,
      3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption:
      0/5 nodes are available: 2 node(s) didn''t match pod anti-affinity rules, 3
      Preemption is not helpful for scheduling.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

Comment 6 hongyan li 2022-07-08 08:15:22 UTC

*** Bug 2105174 has been marked as a duplicate of this bug. ***

Comment 7 hongyan li 2022-07-08 08:28:14 UTC

Encounter the issue again
Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-06-30-005428

1. % oc get node
NAME                       STATUS   ROLES    AGE     VERSION
ge1n1-b5fdg-master-0       Ready    master   4d22h   v1.24.0+9ddc8b1
ge1n1-b5fdg-master-1       Ready    master   4d22h   v1.24.0+9ddc8b1
ge1n1-b5fdg-master-2       Ready    master   4d22h   v1.24.0+9ddc8b1
ge1n1-b5fdg-worker-qjshm   Ready    worker   4d22h   v1.24.0+9ddc8b1
ge1n1-b5fdg-worker-vjcsp   Ready    worker   4d22h   v1.24.0+9ddc8b
2. % oc -n openshift-monitoring describe pod prometheus-operator-admission-webhook-555d9654f8-ft2jv 
---
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  10h (x292 over 18h)  default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling.
3. % oc describe co monitoring
    Last Transition Time:  2022-07-07T13:02:15Z
    Message:               Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: the number of pods targeted by the deployment (3 pods) is different from the number of pods targeted by the deployment that have the desired template spec (1 pods)
    Reason:                UpdatingPrometheusOperatorFailed
    Status:                True
    Type:                  Degraded
  Extension:               <nil>

Comment 8 hongyan li 2022-07-08 08:33:14 UTC

The availableReplicas of prometheus-operator-admission-webhook are from 'replicas - maxUnavailable' to 'replicas + maxSurge', that is, from 2 to 2 for round down 2*25% is zero. Suppose maxUnavailable and  maxSurge are not reasonable.

Both maxSurge and maxUnavailable can be specified as either an integer (e.g. 2) or a percentage (e.g. 50%), and they cannot both be zero. When specified as an integer, it represents the actual number of pods; when specifying a percentage, that percentage of the desired number of pods is used, rounded down.

Comment 9 Junqi Zhao 2022-07-15 04:38:03 UTC

found the issue again in a 4.11 to 4.12 upgrade cluster which has 2 workers
NAME                                STATUS   ROLES                  AGE   VERSION
wduan-0714a-az-q2ps8-master-0       Ready    control-plane,master   20h   v1.24.0+9546431
wduan-0714a-az-q2ps8-master-1       Ready    control-plane,master   20h   v1.24.0+9546431
wduan-0714a-az-q2ps8-master-2       Ready    control-plane,master   20h   v1.24.0+9546431
wduan-0714a-az-q2ps8-worker-4288p   Ready    worker                 20h   v1.24.0+9546431
wduan-0714a-az-q2ps8-worker-4cmx5   Ready    worker                 20h   v1.24.0+9546431

# oc -n openshift-monitoring get rs
NAME                                               DESIRED   CURRENT   READY   AGE
cluster-monitoring-operator-5bbfd998c6             0         0         0       15h
cluster-monitoring-operator-7df87f6db              0         0         0       14h
cluster-monitoring-operator-848799d46f             1         1         1       42m
cluster-monitoring-operator-f794d99fb              0         0         0       20h
kube-state-metrics-6d685b8687                      0         0         0       15h
kube-state-metrics-764c7cbcb                       0         0         0       20h
kube-state-metrics-f64d8dfd5                       1         1         1       14h
openshift-state-metrics-65f58d4c67                 0         0         0       15h
openshift-state-metrics-686fcb4d58                 1         1         1       14h
prometheus-adapter-58c87f9b89                      0         0         0       14h
prometheus-adapter-757f89985b                      0         0         0       93m
prometheus-adapter-77749dbccb                      0         0         0       92m
prometheus-adapter-78745d5d6f                      2         2         2       91m
prometheus-adapter-7bd4745484                      0         0         0       20h
prometheus-adapter-c49dd8496                       0         0         0       15h
prometheus-operator-5c58cf67d5                     0         0         0       20h
prometheus-operator-76f79c8d85                     0         0         0       15h
prometheus-operator-7cdc95fdb4                     1         1         1       14h
prometheus-operator-admission-webhook-5b9ff5ddbf   1         1         0       41m
prometheus-operator-admission-webhook-67f97cf577   2         2         2       14h
telemeter-client-66989bbb48                        1         1         1       14h
telemeter-client-fb5b896cf                         0         0         0       15h
thanos-querier-6cfb6bd748                          0         0         0       15h
thanos-querier-8678f9995c                          2         2         2       14h

# oc -n openshift-monitoring get pod -o wide | grep prometheus-operator-admission-webhook
prometheus-operator-admission-webhook-5b9ff5ddbf-p8kmz   0/1     Pending   0          44m   <none>           <none>                              <none>           <none>
prometheus-operator-admission-webhook-67f97cf577-46jvl   1/1     Running   0          13h   10.128.2.7       wduan-0714a-az-q2ps8-worker-4288p   <none>           <none>
prometheus-operator-admission-webhook-67f97cf577-r5mvt   1/1     Running   0          14h   10.131.0.9       wduan-0714a-az-q2ps8-worker-4cmx5   <none>           <none>


# oc -n openshift-monitoring describe pod prometheus-operator-admission-webhook-5b9ff5ddbf-p8kmz
...
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  44m   default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  44m   default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling.

checked, maxSurge and maxUnavailable are 25%
# oc -n openshift-monitoring get deploy prometheus-operator-admission-webhook -oyaml | grep -E "maxUnavailable|maxSurge"
      maxSurge: 25%
      maxUnavailable: 25%
from https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding down

and
Max Surge 
.spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The value cannot be 0 if MaxUnavailable is 0. The absolute number is calculated from the percentage by rounding up.

for maxSurge, since the absolute number is calculated from the percentage by rounding up, and for maxUnavailable, the absolute number is calculated from percentage by rounding down, then the existing prometheus-operator-admission-webhook replicas are between [replicas - maxUnavailable] and [replicas + maxSurge]. for the 2 workers cluster, 
maxUnavailable replicas = 2 * 25%， rounding down， result is 0, maxSurge replicas = 2 * 25%， rounding up， result is 1
then the existing replicas is 2 - 0 <= replicas number <= 3, since the podAntiAffinity rule, no more prometheus-operator-admission-webhook could be scaled up, and no pod is removed. change the maxUnavailable to 1 may fix the issue in 2 workers cluster

    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/name: prometheus-operator-admission-webhook
                app.kubernetes.io/part-of: openshift-monitoring
            namespaces:
            - openshift-monitoring
            topologyKey: kubernetes.io/hostname

Comment 11 Jan Fajerski 2022-07-15 08:26:44 UTC

Setting blocker+ with JP's ack since this can impact customer upgrades getting stuck.

Comment 14 Junqi Zhao 2022-07-18 09:00:15 UTC

ipi-on-vsphere, 2 workers cluster, upgrade from 4.10.0-0.nightly-2022-07-16-173050 to 4.11.0-0.nightly-2022-07-16-020951, then continue to upgrade 4.12.0-0.nightly-2022-07-15-235709, 4.12.0-0.nightly-2022-07-17-174647,  prometheus-operator-admission-webhook pods and monitoring operator are normal, no issue now
# oc get node
NAME                     STATUS   ROLES    AGE     VERSION
***-4g69p-master-0       Ready    master   4h50m   v1.24.0+9546431
***-4g69p-master-1       Ready    master   4h49m   v1.24.0+9546431
***-4g69p-master-2       Ready    master   4h50m   v1.24.0+9546431
***-4g69p-worker-v4l9p   Ready    worker   4h40m   v1.24.0+9546431
***-4g69p-worker-vr5md   Ready    worker   4h40m   v1.24.0+9546431

# oc get clusterversion version -oyaml
...
    desired:
      image: registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-17-174647
      version: 4.12.0-0.nightly-2022-07-17-174647
    history:
    - completionTime: "2022-07-18T08:38:45Z"
      image: registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-17-174647
      startedTime: "2022-07-18T08:11:24Z"
      state: Completed
      verified: false
      version: 4.12.0-0.nightly-2022-07-17-174647
    - completionTime: "2022-07-18T07:38:45Z"
      image: registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-15-235709
      startedTime: "2022-07-18T06:40:15Z"
      state: Completed
      verified: false
      version: 4.12.0-0.nightly-2022-07-15-235709
    - completionTime: "2022-07-18T04:12:49Z"
      image: registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-16-020951
      startedTime: "2022-07-18T03:08:48Z"
      state: Completed
      verified: false
      version: 4.11.0-0.nightly-2022-07-16-020951
    - completionTime: "2022-07-18T02:39:46Z"
      image: registry.ci.openshift.org/ocp/release@sha256:e56645bcabe38850b37f2c7a70fbf890819eedda63898c703b78c1d5005a91ae
      startedTime: "2022-07-18T02:22:23Z"
      state: Completed
      verified: false
      version: 4.10.0-0.nightly-2022-07-16-173050

# oc get co monitoring
NAME         VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.12.0-0.nightly-2022-07-17-174647   True        False         False      6h21m   


# oc -n openshift-monitoring get pod -o wide | grep prometheus-operator-admission-webhook
prometheus-operator-admission-webhook-6cff644fdf-6j92x   1/1     Running   0          90m    10.131.0.6       ***-4g69p-worker-v4l9p   <none>           <none>
prometheus-operator-admission-webhook-6cff644fdf-mdj26   1/1     Running   0          95m    10.128.2.10      ***-4g69p-worker-vr5md   <none>           <none>

# oc -n openshift-monitoring get deploy prometheus-operator-admission-webhook -oyaml | grep -E "maxUnavailable|maxSurge"
      maxSurge: 25%
      maxUnavailable: 1

# oc -n openshift-monitoring get rs | grep prometheus-operator-admission-webhook
prometheus-operator-admission-webhook-67f97cf577   0         0         0       5h25m
prometheus-operator-admission-webhook-6cff644fdf   2         2         2       115m

Comment 20 errata-xmlrpc 2023-01-17 19:49:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399