Bug 1904985 - Prometheus and thanos sidecar targets are down
Summary: Prometheus and thanos sidecar targets are down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.7.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1905418 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-07 10:25 UTC by Lili Cosic
Modified: 2021-02-24 15:39 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late]
Last Closed: 2021-02-24 15:38:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1008 0 None closed Bug 1904985: fix TLS secrets for Thanos sidecars 2021-01-29 08:58:45 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:39:59 UTC

Description Lili Cosic 2020-12-07 10:25:29 UTC
Description of problem:
TargetDown alerts for prometheus and thanos-sidecar are firing in 4.7 nightly cluster. Seems like Prometheus is also not accessible via the openshift route. 

Version-Release number of selected component (if applicable):

4.7.0-0.ci-2020-12-07-045229

How reproducible:

So far only launched one cluster.

Steps to Reproduce:
1. Check alerting page and see alerts fire.
2. Try to access the Prometheus route.

Actual results:


Expected results:


Additional info:

Comment 1 Sergiusz Urbaniak 2020-12-07 10:46:35 UTC
tentatively setting blocker- flag as it was seen on just one cluster.

Comment 2 Simon Pasquier 2020-12-08 10:21:23 UTC
*** Bug 1905418 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2020-12-08 17:55:40 UTC
Seen in another job [1]:

* e2e process fails to hit Prometheus, receiving a "Route and path matches, but all pods are down." 503.
* But all the Prom containers are ready.
* Ingress ClusterOperator is also happy.

Per [2] (private comment, sorry external folks), the signature for this bug is:

  x509: certificate is valid for prometheus-k8s-thanos-sidecar.openshift-monitoring.svc, prometheus-k8s-thanos-sidecar.openshift-monitoring.svc.cluster.local, not prometheus-k8s.openshift-monitoring.svc

in Telemeter-client logs, which we see in [1]'s assets:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25749/pull-ci-openshift-origin-master-e2e-aws-fips/1336293963228778496/artifacts/e2e-aws-fips/pods/openshift-monitoring_telemeter-client-7567f58784-9jvzw_telemeter-client.log | grep 'x509: certificate is valid for' | tail -n1
level=error caller=forwarder.go:268 ts=2020-12-08T14:19:10.358374323Z component=forwarder/worker msg="unable to forward results" err="Get \"https://prometheus-k8s.openshift-monitoring.svc:9091/federate?...\": x509: certificate is valid for prometheus-k8s-thanos-sidecar.openshift-monitoring.svc, prometheus-k8s-thanos-sidecar.openshift-monitoring.svc.cluster.local, not prometheus-k8s.openshift-monitoring.svc"

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25749/pull-ci-openshift-origin-master-e2e-aws-fips/1336293963228778496
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1905418#c1

Comment 6 Junqi Zhao 2020-12-10 03:07:29 UTC
tested with 4.7.0-0.nightly-2020-12-09-112139, prometheus and thanos-sidecar targets are up and no alerts for them
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts' | jq '.data[].labels | {alertname}'
{
  "alertname": "AlertmanagerReceiversNotConfigured"
}
{
  "alertname": "PrometheusNotIngestingSamples"
}
{
  "alertname": "PrometheusNotIngestingSamples"
}
{
  "alertname": "CannotRetrieveUpdates"
}
{
  "alertname": "Watchdog"
}

Comment 10 errata-xmlrpc 2021-02-24 15:38:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.