2105174 – Reconciling Prometheus Operator Admission Webhook Deployment failed on RHV 2 work node cluster

Bug 2105174 - Reconciling Prometheus Operator Admission Webhook Deployment failed on RHV 2 work node cluster

Summary: Reconciling Prometheus Operator Admission Webhook Deployment failed on RHV 2 ...

Keywords:
Status:	CLOSED DUPLICATE of bug 2090988
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Simon Pasquier
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-08 08:05 UTC by hongyan li
Modified:	2022-07-08 08:33 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-08 08:12:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description hongyan li 2022-07-08 08:05:28 UTC

Description of problem:
Reconciling Prometheus Operator Admission Webhook Deployment failed on RHV 2 work node cluster

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-06-30-005428

How reproducible:
always

Steps to Reproduce:
1. Install OCP on a 2 work node cluster
2. % oc get node
NAME                       STATUS   ROLES    AGE     VERSION
ge1n1-b5fdg-master-0       Ready    master   4d22h   v1.24.0+9ddc8b1
ge1n1-b5fdg-master-1       Ready    master   4d22h   v1.24.0+9ddc8b1
ge1n1-b5fdg-master-2       Ready    master   4d22h   v1.24.0+9ddc8b1
ge1n1-b5fdg-worker-qjshm   Ready    worker   4d22h   v1.24.0+9ddc8b1
ge1n1-b5fdg-worker-vjcsp   Ready    worker   4d22h   v1.24.0+9ddc8b
3. % oc -n openshift-monitoring describe pod prometheus-operator-admission-webhook-555d9654f8-ft2jv 
---
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  10h (x292 over 18h)  default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/5 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling.
4. % oc describe co monitoring
    Last Transition Time:  2022-07-07T13:02:15Z
    Message:               Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: the number of pods targeted by the deployment (3 pods) is different from the number of pods targeted by the deployment that have the desired template spec (1 pods)
    Reason:                UpdatingPrometheusOperatorFailed
    Status:                True
    Type:                  Degraded
  Extension:               <nil>

% oc -n openshift-monitoring get deployment prometheus-operator-admission-webhook -oyaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "4"
  creationTimestamp: "2022-07-03T09:00:12Z"
  generation: 4
  labels:
    app.kubernetes.io/managed-by: cluster-monitoring-operator
    app.kubernetes.io/name: prometheus-operator-admission-webhook
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.57.0
  name: prometheus-operator-admission-webhook
  namespace: openshift-monitoring
  resourceVersion: "2843880"
  uid: 8174d183-e7f7-4fd0-aed6-67a0255a40f0
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: prometheus-operator-admission-webhook
      app.kubernetes.io/part-of: openshift-monitoring
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/default-container: prometheus-operator-admission-webhook
        target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
      creationTimestamp: null
      labels:
        app.kubernetes.io/managed-by: cluster-monitoring-operator
        app.kubernetes.io/name: prometheus-operator-admission-webhook
        app.kubernetes.io/part-of: openshift-monitoring
        app.kubernetes.io/version: 0.57.0
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/name: prometheus-operator-admission-webhook
                app.kubernetes.io/part-of: openshift-monitoring
            namespaces:
            - openshift-monitoring
            topologyKey: kubernetes.io/hostname
      automountServiceAccountToken: false


Actual results:


Expected results:


Additional info:

Comment 1 hongyan li 2022-07-08 08:12:30 UTC

some one created Prometheus Operator Admission Webhook Deployment pod exceptionally, close this as not a bug.

Comment 2 hongyan li 2022-07-08 08:15:22 UTC


*** This bug has been marked as a duplicate of bug 2090988 ***

Comment 3 hongyan li 2022-07-08 08:33:59 UTC

The availableReplicas of prometheus-operator-admission-webhook are from 'replicas - maxUnavailable' to 'replicas + maxSurge', that is, from 2 to 2 for round down 2*25% is zero. Suppose maxUnavailable and  maxSurge are not reasonable.

Both maxSurge and maxUnavailable can be specified as either an integer (e.g. 2) or a percentage (e.g. 50%), and they cannot both be zero. When specified as an integer, it represents the actual number of pods; when specifying a percentage, that percentage of the desired number of pods is used, rounded down.

Note You need to log in before you can comment on or make changes to this bug.