Bug 1989438

Summary: expected replicas is wrong
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: MonitoringAssignee: Prashant Balachandran <pnair>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: medium    
Version: 4.9CC: amuller, anpicker, aos-bugs, erooth, pnair, spasquie
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-12 04:37:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cluster-monitoring-operator pod logs none

Description Junqi Zhao 2021-08-03 08:33:45 UTC
Created attachment 1810385 [details]
cluster-monitoring-operator pod logs

Description of problem:
this is a negative case, deploy openshift-state-metrics/telemeter-client/thanos-querier pods to nodes where the nodeSelector does not exist, in this case, no node labeled with deploy=new
configmap
..................
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    telemeterClient:
      nodeSelector:
        deploy: new
    openshiftStateMetrics:
      nodeSelector:
        deploy: new
    thanosQuerier:
      nodeSelector:
        deploy: new
..................
from the CMO logs, for example, 
..................
W0803 07:55:07.333089       1 tasks.go:71] task 9 of 15: Updating openshift-state-metrics failed: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: expected 2 replicas, got 1 updated replicas
..................
actually we expected 1 replica for openshift-state-metrics, but the log reported expected 2 replicas, got 1 updated replicas. same for telemeter-client(expected 1 replica, but reported expected 2 replicas)/thanos-querier(expected 2 replicas, but reported expected 3 replicas)
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep cluster-monitoring-operator | awk '{print $1}') -c cluster-monitoring-operator | grep "updating Deployment object failed: waiting for DeploymentRollout"
W0803 07:44:55.214731       1 tasks.go:71] task 9 of 15: Updating openshift-state-metrics failed: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: expected 2 replicas, got 1 updated replicas
W0803 07:44:55.691078       1 tasks.go:71] task 11 of 15: Updating Telemeter client failed: reconciling Telemeter client Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/telemeter-client: expected 2 replicas, got 1 updated replicas
W0803 07:44:58.568087       1 tasks.go:71] task 13 of 15: Updating Thanos Querier failed: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: expected 3 replicas, got 2 updated replicas
W0803 07:50:01.855932       1 tasks.go:71] task 9 of 15: Updating openshift-state-metrics failed: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: expected 2 replicas, got 1 updated replicas
W0803 07:50:02.513818       1 tasks.go:71] task 11 of 15: Updating Telemeter client failed: reconciling Telemeter client Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/telemeter-client: expected 2 replicas, got 1 updated replicas
W0803 07:50:04.632149       1 tasks.go:71] task 13 of 15: Updating Thanos Querier failed: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: expected 3 replicas, got 2 updated replicas
W0803 07:55:07.333089       1 tasks.go:71] task 9 of 15: Updating openshift-state-metrics failed: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: expected 2 replicas, got 1 updated replicas
W0803 07:55:07.533440       1 tasks.go:71] task 11 of 15: Updating Telemeter client failed: reconciling Telemeter client Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/telemeter-client: expected 2 replicas, got 1 updated replicas
W0803 07:55:10.435813       1 tasks.go:71] task 13 of 15: Updating Thanos Querier failed: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: expected 3 replicas, got 2 updated replicas
..................
# oc -n openshift-monitoring get rs | grep -E "openshift-state-metrics|telemeter-client|thanos-querier"
openshift-state-metrics-667855d8cb       1         1         0       38m
openshift-state-metrics-7bb78c4978       1         1         1       6h14m
telemeter-client-64457bfb68              1         1         1       6h2m
telemeter-client-9cbd9f797               1         1         0       38m
thanos-querier-5644d48fbd                2         2         0       38m
thanos-querier-7958b75d7                 0         0         0       79m
thanos-querier-86b84c6756                1         1         1       6h3m

# oc -n openshift-monitoring get deploy | grep -E "openshift-state-metrics|telemeter-client|thanos-querier"
openshift-state-metrics       1/1     1            1           6h15m
telemeter-client              1/1     1            1           6h3m
thanos-querier                1/2     2            1           6h4m

# oc -n openshift-monitoring get pod | grep -E "openshift-state-metrics|telemeter-client|thanos-querier"
openshift-state-metrics-667855d8cb-ht265       0/3     Pending   0          49m
openshift-state-metrics-7bb78c4978-vlvtx       3/3     Running   0          6h25m
telemeter-client-64457bfb68-2drsp              3/3     Running   0          6h14m
telemeter-client-9cbd9f797-gqmrg               0/3     Pending   0          49m
thanos-querier-5644d48fbd-7tl8j                0/5     Pending   0          49m
thanos-querier-5644d48fbd-m9bxj                0/5     Pending   0          49m
thanos-querier-86b84c6756-fvb8g                5/5     Running   0          51m
..................
# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2021-08-03T08:15:33Z"
    message: Rolling out the stack.
    reason: RollOutInProgress
    status: "True"
    type: Progressing
  - lastTransitionTime: "2021-08-03T07:55:10Z"
    message: |-
      Failed to rollout the stack. Error: updating openshift-state-metrics: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: expected 2 replicas, got 1 updated replicas
      updating telemeter client: reconciling Telemeter client Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/telemeter-client: expected 2 replicas, got 1 updated replicas
      updating thanos querier: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: expected 3 replicas, got 2 updated replicas
    reason: MultipleTasksFailed
    status: "True"
    type: Degraded


Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-02-145924

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Junqi Zhao 2021-09-22 07:51:47 UTC
tested with the PR, the error now looks normal
# oc -n openshift-monitoring get pod | grep -E "openshift-state-metrics|telemeter-client|thanos-querier"
openshift-state-metrics-59dc557c86-6jcfb       3/3     Running   0             69m
openshift-state-metrics-b8557c78d-dxzq4        0/3     Pending   0             15m
telemeter-client-584b7d88d8-b4h2h              0/3     Pending   0             15m
telemeter-client-b48ddbc69-w8rbp               3/3     Running   0             69m
thanos-querier-5696fff86b-5lzpp                0/5     Pending   0             15m
thanos-querier-5696fff86b-v7bkm                0/5     Pending   0             15m
thanos-querier-7589f7578d-cqdx6                5/5     Running   0             63m

# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2021-09-22T07:45:04Z"
    message: |-
      Failed to rollout the stack. Error: updating openshift-state-metrics: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: the number of pods targeted by the deployment (2 pods) is different from the number of pods targeted by the deployment that have the desired template spec (1 pods)
      updating telemeter client: reconciling Telemeter client Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/telemeter-client: the number of pods targeted by the deployment (2 pods) is different from the number of pods targeted by the deployment that have the desired template spec (1 pods)
      updating thanos querier: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: the number of pods targeted by the deployment (3 pods) is different from the number of pods targeted by the deployment that have the desired template spec (2 pods)
    reason: MultipleTasksFailed
    status: "True"
    type: Degraded

Comment 10 errata-xmlrpc 2022-03-12 04:37:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056