1950911 – reconciling PrometheusAdapter Deployment failed in 2 worker nodes cluster

Bug 1950911 - reconciling PrometheusAdapter Deployment failed in 2 worker nodes cluster

Summary: reconciling PrometheusAdapter Deployment failed in 2 worker nodes cluster

Keywords:
Status:	CLOSED DUPLICATE of bug 1950761
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Maciej Szulik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-19 06:50 UTC by Junqi Zhao
Modified:	2021-04-29 12:24 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-04-20 09:56:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Junqi Zhao 2021-04-19 06:50:31 UTC

Description of problem:
IPI AWS cluster with only 2 workers, reconciling PrometheusAdapter Deployment failed, no such issue with 3 workers with the same payload

# oc get no | grep worker
ip-10-0-150-214.ap-south-1.compute.internal   Ready    worker   152m   v1.21.0-rc.0+2993be8
ip-10-0-163-79.ap-south-1.compute.internal    Ready    worker   152m   v1.21.0-rc.0+2993be8

# oc get co monitoring -oyaml
...
status:
  conditions:
  - lastTransitionTime: "2021-04-19T05:03:40Z"
    message: 'Failed to rollout the stack. Error: running task Updating prometheus-adapter
      failed: reconciling PrometheusAdapter Deployment failed: updating Deployment
      object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-adapter:
      expected 3 replicas, got 1 updated replicas'
    reason: UpdatingprometheusAdapterFailed
    status: "True"
    type: Degraded

NOTE: prometheus-adapter deployment requires only 2 prometheus-adapter pods

# oc -n openshift-monitoring get po | grep prometheus-adapter
prometheus-adapter-6d9fc84f4c-2jq9s            0/1     ContainerCreating   0          92m
prometheus-adapter-6d9fc84f4c-pwhb4            0/1     ContainerCreating   0          92m
prometheus-adapter-7785cf7594-df8bf            0/1     Pending             0          87m

# oc -n openshift-monitoring get deploy prometheus-adapter
NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
prometheus-adapter   0/2     1            0           93m

# oc -n openshift-monitoring get rs 
NAME                                     DESIRED   CURRENT   READY   AGE
..
prometheus-adapter-6d9fc84f4c            2         2         0       93m
prometheus-adapter-7785cf7594            1         1         0       88m

describe the Pending pod
# oc -n openshift-monitoring describe pod prometheus-adapter-7785cf7594-df8bf
Name:                 prometheus-adapter-7785cf7594-df8bf
Namespace:            openshift-monitoring
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               app.kubernetes.io/component=metrics-adapter
                      app.kubernetes.io/managed-by=cluster-monitoring-operator
                      app.kubernetes.io/name=prometheus-adapter
                      app.kubernetes.io/part-of=openshift-monitoring
                      app.kubernetes.io/version=0.8.4
                      pod-template-hash=7785cf7594
Annotations:          openshift.io/scc: restricted
                      workload.openshift.io/management: {"effect": "PreferredDuringScheduling"}
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/prometheus-adapter-7785cf7594
Containers:
  prometheus-adapter:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6400b1199456905c17cfdf73b72b606b6f081d533cbcc823d4ae050e4ef63390
    Port:       6443/TCP
    Host Port:  0/TCP
    Args:
      --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml
      --config=/etc/adapter/config.yaml
      --logtostderr=true
      --metrics-relist-interval=1m
      --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091
      --secure-port=6443
      --client-ca-file=/etc/tls/private/client-ca-file
      --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file
      --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator
      --requestheader-extra-headers-prefix=X-Remote-Extra-
      --requestheader-group-headers=X-Remote-Group
      --requestheader-username-headers=X-Remote-User
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
    Requests:
      cpu:        1m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /etc/adapter from config (rw)
      /etc/prometheus-config from prometheus-adapter-prometheus-config (rw)
      /etc/ssl/certs from serving-certs-ca-bundle (rw)
      /etc/tls/private from tls (ro)
      /tmp from tmpfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-r8pjh (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  tmpfs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      adapter-config
    Optional:  false
  prometheus-adapter-prometheus-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-adapter-prometheus-config
    Optional:  false
  serving-certs-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serving-certs-ca-bundle
    Optional:  false
  tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-12shoolsvvf93
    Optional:    false
  prometheus-adapter-token-r8pjh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-token-r8pjh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  67m   default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  67m   default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  66m   default-scheduler  0/5 nodes are available: 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  65m   default-scheduler  0/5 nodes are available: 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  65m   default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity rules, 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  65m   default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity rules, 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  64m   default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity rules, 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

describe the ContainerCreating pod
# oc -n openshift-monitoring describe pod prometheus-adapter-6d9fc84f4c-2jq9s
Name:                 prometheus-adapter-6d9fc84f4c-2jq9s
Namespace:            openshift-monitoring
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 ip-10-0-150-214.ap-south-1.compute.internal/10.0.150.214
Start Time:           Mon, 19 Apr 2021 00:56:35 -0400
Labels:               app.kubernetes.io/component=metrics-adapter
                      app.kubernetes.io/managed-by=cluster-monitoring-operator
                      app.kubernetes.io/name=prometheus-adapter
                      app.kubernetes.io/part-of=openshift-monitoring
                      app.kubernetes.io/version=0.8.4
                      pod-template-hash=6d9fc84f4c
Annotations:          openshift.io/scc: restricted
                      workload.openshift.io/management: {"effect": "PreferredDuringScheduling"}
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/prometheus-adapter-6d9fc84f4c
Containers:
  prometheus-adapter:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6400b1199456905c17cfdf73b72b606b6f081d533cbcc823d4ae050e4ef63390
    Image ID:      
    Port:          6443/TCP
    Host Port:     0/TCP
    Args:
      --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml
      --config=/etc/adapter/config.yaml
      --logtostderr=true
      --metrics-relist-interval=1m
      --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091
      --secure-port=6443
      --client-ca-file=/etc/tls/private/client-ca-file
      --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file
      --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator
      --requestheader-extra-headers-prefix=X-Remote-Extra-
      --requestheader-group-headers=X-Remote-Group
      --requestheader-username-headers=X-Remote-User
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /etc/adapter from config (rw)
      /etc/prometheus-config from prometheus-adapter-prometheus-config (rw)
      /etc/ssl/certs from serving-certs-ca-bundle (rw)
      /etc/tls/private from tls (ro)
      /tmp from tmpfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-r8pjh (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  tmpfs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      adapter-config
    Optional:  false
  prometheus-adapter-prometheus-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-adapter-prometheus-config
    Optional:  false
  serving-certs-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serving-certs-ca-bundle
    Optional:  false
  tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-7gdh4nvu6vtep
    Optional:    false
  prometheus-adapter-token-r8pjh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-token-r8pjh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                       From               Message
  ----     ------            ----                      ----               -------
  Normal   Scheduled         66m                       default-scheduler  Successfully assigned openshift-monitoring/prometheus-adapter-6d9fc84f4c-2jq9s to ip-10-0-150-214.ap-south-1.compute.internal
  Warning  FailedScheduling  74m                       default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  74m                       default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  72m                       default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  71m                       default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  68m                       default-scheduler  0/5 nodes are available: 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  67m                       default-scheduler  0/5 nodes are available: 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  74m                       default-scheduler  0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedMount       51m (x2 over 55m)         kubelet            Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[serving-certs-ca-bundle tls prometheus-adapter-token-r8pjh tmpfs config prometheus-adapter-prometheus-config]: timed out waiting for the condition
  Warning  FailedMount       48m (x4 over 64m)         kubelet            Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[prometheus-adapter-prometheus-config serving-certs-ca-bundle tls prometheus-adapter-token-r8pjh tmpfs config]: timed out waiting for the condition
  Warning  FailedMount       30m (x5 over 53m)         kubelet            Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[prometheus-adapter-token-r8pjh tmpfs config prometheus-adapter-prometheus-config serving-certs-ca-bundle tls]: timed out waiting for the condition
  Warning  FailedMount       26m                       kubelet            Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[config prometheus-adapter-prometheus-config serving-certs-ca-bundle tls prometheus-adapter-token-r8pjh tmpfs]: timed out waiting for the condition
  Warning  FailedMount       21m                       kubelet            Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[tls prometheus-adapter-token-r8pjh tmpfs config prometheus-adapter-prometheus-config serving-certs-ca-bundle]: timed out waiting for the condition
  Warning  FailedMount       <invalid> (x5 over 60m)   kubelet            Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[tmpfs config prometheus-adapter-prometheus-config serving-certs-ca-bundle tls prometheus-adapter-token-r8pjh]: timed out waiting for the condition
  Warning  FailedMount       <invalid> (x50 over 66m)  kubelet            MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-7gdh4nvu6vtep" not found

# oc -n openshift-monitoring get rs prometheus-adapter-7785cf7594 -oyaml | grep "prometheus-adapter-7gdh4nvu6vtep"
no result

# oc -n openshift-monitoring get rs prometheus-adapter-6d9fc84f4c -oyaml | grep "prometheus-adapter-7gdh4nvu6vtep" -C3
      - name: tls
        secret:
          defaultMode: 420
          secretName: prometheus-adapter-7gdh4nvu6vtep
status:
  fullyLabeledReplicas: 2
  observedGeneration: 1


the secret should be prometheus-adapter-12shoolsvvf93 
#  oc -n openshift-monitoring get secret | grep prometheus-adapter
prometheus-adapter-12shoolsvvf93              Opaque                                4      93m
prometheus-adapter-dockercfg-wp4bv            kubernetes.io/dockercfg               1      97m
prometheus-adapter-tls                        kubernetes.io/tls                     2      99m
prometheus-adapter-token-gqp6k                kubernetes.io/service-account-token   4      97m
prometheus-adapter-token-r8pjh                kubernetes.io/service-account-token   4      100m


the deployment also uses prometheus-adapter-12shoolsvvf93 secret
# # oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep "prometheus-adapter-12shoolsvvf93" -C3
      - name: tls
        secret:
          defaultMode: 420
          secretName: prometheus-adapter-12shoolsvvf93
status:
  conditions:
  - lastTransitionTime: "2021-04-19T04:48:39Z"

replicaset prometheus-adapter-7785cf7594 desired number is 1, and it uses prometheus-adapter-12shoolsvvf93 secret, which is the same with prometheus-adapter deployment 
# oc -n openshift-monitoring get rs prometheus-adapter-7785cf7594 -oyaml | grep "prometheus-adapter-12shoolsvvf93" -C3
      - name: tls
        secret:
          defaultMode: 420
          secretName: prometheus-adapter-12shoolsvvf93
status:
  fullyLabeledReplicas: 1
  observedGeneration: 1

replicaset prometheus-adapter-6d9fc84f4c desired number is 2, and it does not use prometheus-adapter-12shoolsvvf93 secret, it ues prometheus-adapter-7gdh4nvu6vtep secret whihc is not found as we see from above
# oc -n openshift-monitoring get rs prometheus-adapter-6d9fc84f4c -oyaml | grep "prometheus-adapter-12shoolsvvf93" -C3
no result


Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-04-18-101412
prometheus-adapter 0.8.4
Kubernetes Version: v1.21.0-rc.0+2993be8


How reproducible:
in 2 worker nodes cluster

Steps to Reproduce:
1. see from the description
2.
3.

Actual results:
reconciling PrometheusAdapter Deployment failed in 2 worker nodes cluster

Expected results:
no issue

Additional info:

Comment 2 RamaKasturi 2021-04-19 07:05:24 UTC

Adding more info which we have observed while doing the RCA:
================================================================
Intially the deployment tried to rollout but it saw that workers nodes were not ready so the pods went into ContainerCreating state, but once the workers were ready deployment could not recongize and it started to rollout another replicaset, thus causing a race and unable to handle itself.

Comment 3 hongyan li 2021-04-19 07:08:23 UTC

Run the following command can recover the env
#oc -n openshift-monitoring delete rs prometheus-adapter-7785cf759

Comment 4 Junqi Zhao 2021-04-19 07:24:36 UTC

(In reply to RamaKasturi from comment #2)
> Adding more info which we have observed while doing the RCA:
> ================================================================
> Intially the deployment tried to rollout but it saw that workers nodes were
> not ready so the pods went into ContainerCreating state, but once the
> workers were ready deployment could not recongize and it started to rollout
> another replicaset, thus causing a race and unable to handle itself.

the workers are all ready, no NotReady nodes, I recreated a cluster to file the bug

Comment 5 Damien Grisonnet 2021-04-19 15:56:26 UTC

This issue is very similar to bug 1950761 that was filled for SNO. This bug was a regression from applying HA conventions to prometheus-adapter. Part of the conventions we added are hard anti-affinity on hostname and maxUnavailability set to 25% as defined for operand with 2 replicas.

So far, we've noticed the same status reported for the prometheus-adapter deployment in both cases:

```
status:
conditions:
- lastTransitionTime: "2021-04-19T04:48:39Z"
lastUpdateTime: "2021-04-19T04:48:39Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: "2021-04-19T05:03:53Z"
lastUpdateTime: "2021-04-19T05:03:53Z"
message: ReplicaSet "prometheus-adapter-7785cf7594" has timed out progressing.
reason: ProgressDeadlineExceeded
status: "False"
type: Progressing
observedGeneration: 2
replicas: 3
unavailableReplicas: 3
updatedReplicas: 1

```
What's weird is that the number of replicas reported in the status is 3 even though we set it to 2 in the deployment's spec. This seems to be caused by the creation of a second replicaset for prometheus-adapter. During rollout, instead of moving all the replicas to the new replicaset, the first replicaset is scaled up to 2 and the second one is scaled up to 1, thus causing an issue with our anti-affinity rule since there are only 2 nodes for 3 pods. Also, it might be worth noting that the `ProgressDeadlineExceeded` status is only observed for the second replicaset, maybe because of the hard anti-affinity rule that prevent the pod from being scheduled.

Hence, my question would be why do we end-up with 3 replicas even though we asked for 2? Is it a normal behavior during rolling update or is there an issue with our configuration?

Comment 6 Deep Mistry 2021-04-19 20:43:42 UTC

Hi,

We have seen failures in MA CI which can be related. I have linked below the failing jobs. We have seen consistent failures from 2 days now. Can be marked as a blocker?

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.8/1384114697518714880
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-remote-libvirt-ppc64le-4.8/1383933519910146048

Comment 7 zhou ying 2021-04-20 07:28:59 UTC

The prometheus-adapter deploy has 2 rs: 
[root@localhost roottest]#  oc get rs |grep prometheus-adapter
prometheus-adapter-68c598744f            2         2         0       100m
prometheus-adapter-6d847c6c8d            1         1         0       95m


The first rs's pods failed with error: 
3m25s       Warning   FailedMount         pod/prometheus-adapter-68c598744f-wfkdn             MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-58qc3mf19bli0" not found



The second rs' pod failed with error:
[root@localhost roottest]# oc get po |grep prometheus-adapter
prometheus-adapter-68c598744f-jm8mq            0/1     ContainerCreating   0          102m
prometheus-adapter-68c598744f-wfkdn            0/1     ContainerCreating   0          102m
prometheus-adapter-6d847c6c8d-nrbn2            0/1     Pending             0          97m
[root@localhost roottest]# oc get events |grep prometheus-adapter-6d847c6c8d-nrbn2 
....
6m8s        Warning   FailedScheduling    pod/prometheus-adapter-6d847c6c8d-nrbn2             0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity rules, 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Comment 8 zhou ying 2021-04-20 07:45:10 UTC

(In reply to Damien Grisonnet from comment #5)
> This issue is very similar to bug 1950761 that was filled for SNO. This bug
> was a regression from applying HA conventions to prometheus-adapter. Part of
> the conventions we added are hard anti-affinity on hostname and
> maxUnavailability set to 25% as defined for operand with 2 replicas.
> 
> So far, we've noticed the same status reported for the prometheus-adapter
> deployment in both cases:
> 
> ```
>   status:
>     conditions:
>     - lastTransitionTime: "2021-04-19T04:48:39Z"
>       lastUpdateTime: "2021-04-19T04:48:39Z"
>       message: Deployment does not have minimum availability.
>       reason: MinimumReplicasUnavailable
>       status: "False"
>       type: Available
>     - lastTransitionTime: "2021-04-19T05:03:53Z"
>       lastUpdateTime: "2021-04-19T05:03:53Z"
>       message: ReplicaSet "prometheus-adapter-7785cf7594" has timed out
> progressing.                    
>       reason: ProgressDeadlineExceeded
>       status: "False"
>       type: Progressing
>     observedGeneration: 2
>     replicas: 3
>     unavailableReplicas: 3
>     updatedReplicas: 1
> 
> ```
> What's weird is that the number of replicas reported in the status is 3 even
> though we set it to 2 in the deployment's spec. This seems to be caused by
> the creation of a second replicaset for prometheus-adapter. During rollout,
> instead of moving all the replicas to the new replicaset, the first
> replicaset is scaled up to 2 and the second one is scaled up to 1, thus
> causing an issue with our anti-affinity rule since there are only 2 nodes
> for 3 pods. Also, it might be worth noting that the
> `ProgressDeadlineExceeded` status is only observed for the second
> replicaset, maybe because of the hard anti-affinity rule that prevent the
> pod from being scheduled.
> 
> Hence, my question would be why do we end-up with 3 replicas even though we
> asked for 2? Is it a normal behavior during rolling update or is there an
> issue with our configuration?

The "3 replicas" is because the "Rollover" feature of the deployment, it's expected .

Comment 9 Junqi Zhao 2021-04-20 09:40:33 UTC

with the fix of bug 1950761, no such issue now
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-19-225513   True        False         43m     Cluster version is 4.8.0-0.nightly-2021-04-19-225513

# oc get no | grep worker
ip-10-0-132-81.ap-south-1.compute.internal    Ready    worker   57m   v1.21.0-rc.0+98d91ef
ip-10-0-160-93.ap-south-1.compute.internal    Ready    worker   57m   v1.21.0-rc.0+98d91ef

# oc get co monitoring
NAME         VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring   4.8.0-0.nightly-2021-04-19-225513   True        False         False      39m

# oc -n openshift-monitoring get po | grep prometheus-adapter
prometheus-adapter-69876c9996-26wg9            1/1     Running   0          42m
prometheus-adapter-69876c9996-6b4pk            1/1     Running   0          42m

# oc -n openshift-monitoring get deploy prometheus-adapter
NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
prometheus-adapter   2/2     2            2           64m

# oc -n openshift-monitoring get rs | grep prometheus-adapter
prometheus-adapter-69876c9996            2         2         2       64m

close this bug

Comment 10 Maciej Szulik 2021-04-20 09:56:00 UTC


*** This bug has been marked as a duplicate of bug 1950761 ***

Note You need to log in before you can comment on or make changes to this bug.