Bug 1980888 - Thanos querier probes are timing out
Summary: Thanos querier probes are timing out
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.9
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Philip Gough
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1976940 1982757 (view as bug list)
Depends On:
Blocks: 1982778
TreeView+ depends on / blocked
 
Reported: 2021-07-09 17:58 UTC by Philip Gough
Modified: 2021-10-18 17:39 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1982778 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:39:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1277 0 None open BUG 1980888: jsonnet: Favour http probes for thanos querier 2021-07-13 08:20:25 UTC
Red Hat Knowledge Base (Solution) 6189261 0 None None None 2021-07-15 19:40:35 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:39:41 UTC

Description Philip Gough 2021-07-09 17:58:21 UTC
Description of problem:

While checking the behaviour of recently modified alert "KubeDeploymentReplicasMismatch" in CI, it appears the alert is firing because thanos-query probes are timing out:


https://search.ci.openshift.org/chart?search=alert+KubeDeploymentReplicasMismatch+fired&maxAge=24h&type=junit


An example job can be investigated here:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1148/pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn/1413425417573896192/artifacts/e2e-azure-ovn/gather-must-gather/artifacts/event-filter.html


Version-Release number of selected component (if applicable):


How reproducible:

This alert is currently firing in about 1% of CI failures and the majority of these are thanos related which on investigation are the probe issues each time


Expected results:

Thanos probes do no timeout and hence this alert, or others do not fire.

Additional info:

Probes were removed for some other components and there is some discourse here https://github.com/prometheus-operator/prometheus-operator/pull/3502

Comment 2 Philip Gough 2021-07-15 16:45:48 UTC
*** Bug 1982757 has been marked as a duplicate of this bug. ***

Comment 3 Junqi Zhao 2021-07-19 04:00:54 UTC
searched the CI results
https://search.ci.openshift.org/?search=KubeDeploymentGenerationMismatch&maxAge=336h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

did not see KubeDeploymentReplicasMismatch alert for thanos-querier


# oc -n openshift-monitoring get deploy thanos-querier -oyaml
...
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: 9091
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: oauth-proxy
        ports:
        - containerPort: 9091
          name: web
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: 9091
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

Comment 4 Philip Gough 2021-07-20 16:51:17 UTC
*** Bug 1976940 has been marked as a duplicate of this bug. ***

Comment 13 errata-xmlrpc 2021-10-18 17:39:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.