Bug 1982778 - Thanos querier probes are timing out
Summary: Thanos querier probes are timing out
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: All
OS: All
unspecified
medium
Target Milestone: ---
: 4.8.z
Assignee: Philip Gough
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1980888
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-15 16:48 UTC by Philip Gough
Modified: 2021-08-16 18:32 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1980888
Environment:
Last Closed: 2021-08-16 18:32:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1289 0 None open Bug 1982778: jsonnet: thanosquery: Use HTTP probes as opposed to exec 2021-07-22 08:07:01 UTC
Red Hat Product Errata RHBA-2021:3121 0 None None None 2021-08-16 18:32:25 UTC

Comment 4 Junqi Zhao 2021-08-10 02:55:59 UTC
searched the CI results
https://search.ci.openshift.org/chart?search=alert+KubeDeploymentReplicasMismatch+fired&maxAge=24h&type=junit
0 (0% of all failures) alert KubeDeploymentReplicasMismatch fired

did not see KubeDeploymentReplicasMismatch alert for thanos-querier


# oc -n openshift-monitoring get deploy thanos-querier -oyaml
...
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: 9091
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: oauth-proxy
        ports:
        - containerPort: 9091
          name: web
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: 9091
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

 oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml | grep KubeDeploymentReplicasMismatch
...
      - alert: KubeDeploymentReplicasMismatch
        annotations:
          description: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has
            not matched the expected number of replicas for longer than 15 minutes. This
            indicates that cluster infrastructure is unable to start or restart the necessary
            components. This most often occurs when one or more nodes are down or partioned
            from the cluster, or a fault occurs on the node that prevents the workload
            from starting. In rare cases this may indicate a new version of a cluster
            component cannot start due to a bug or configuration error. Assess the pods
            for this deployment to verify they are running on healthy nodes and then contact
            support.
          summary: Deployment has not matched the expected number of replicas
        expr: |
          (
            kube_deployment_spec_replicas{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}
              !=
            kube_deployment_status_replicas_available{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}
          ) and (
            changes(kube_deployment_status_replicas_updated{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[5m])
              ==
            0
          ) and cluster:control_plane:all_nodes_ready
        for: 15m
        labels:
          severity: warning

Comment 6 errata-xmlrpc 2021-08-16 18:32:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.5 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3121


Note You need to log in before you can comment on or make changes to this bug.