Bug 1982778

Summary:	Thanos querier probes are timing out
Product:	OpenShift Container Platform	Reporter:	Philip Gough <pgough>
Component:	Monitoring	Assignee:	Philip Gough <pgough>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	alegrand, amuller, anpicker, aos-bugs, erooth, juzhao, kakkoyun, pkrupa, stwalter
Target Milestone:	---
Target Release:	4.8.z
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1980888	Environment:
Last Closed:	2021-08-16 18:32:12 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1980888
Bug Blocks:

Comment 4 Junqi Zhao 2021-08-10 02:55:59 UTC

searched the CI results
https://search.ci.openshift.org/chart?search=alert+KubeDeploymentReplicasMismatch+fired&maxAge=24h&type=junit
0 (0% of all failures) alert KubeDeploymentReplicasMismatch fired

did not see KubeDeploymentReplicasMismatch alert for thanos-querier


# oc -n openshift-monitoring get deploy thanos-querier -oyaml
...
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: 9091
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: oauth-proxy
        ports:
        - containerPort: 9091
          name: web
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: 9091
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

 oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml | grep KubeDeploymentReplicasMismatch
...
      - alert: KubeDeploymentReplicasMismatch
        annotations:
          description: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has
            not matched the expected number of replicas for longer than 15 minutes. This
            indicates that cluster infrastructure is unable to start or restart the necessary
            components. This most often occurs when one or more nodes are down or partioned
            from the cluster, or a fault occurs on the node that prevents the workload
            from starting. In rare cases this may indicate a new version of a cluster
            component cannot start due to a bug or configuration error. Assess the pods
            for this deployment to verify they are running on healthy nodes and then contact
            support.
          summary: Deployment has not matched the expected number of replicas
        expr: |
          (
            kube_deployment_spec_replicas{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}
              !=
            kube_deployment_status_replicas_available{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}
          ) and (
            changes(kube_deployment_status_replicas_updated{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[5m])
              ==
            0
          ) and cluster:control_plane:all_nodes_ready
        for: 15m
        labels:
          severity: warning

Comment 6 errata-xmlrpc 2021-08-16 18:32:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.5 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3121