1904985 – Prometheus and thanos sidecar targets are down

Bug 1904985 - Prometheus and thanos sidecar targets are down

Summary: Prometheus and thanos sidecar targets are down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1905418 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-07 10:25 UTC by Lili Cosic
Modified:	2021-02-24 15:39 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late]
Last Closed:	2021-02-24 15:38:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1008	0	None	closed	Bug 1904985: fix TLS secrets for Thanos sidecars	2021-01-29 08:58:45 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:39:59 UTC

Description Lili Cosic 2020-12-07 10:25:29 UTC

Description of problem:
TargetDown alerts for prometheus and thanos-sidecar are firing in 4.7 nightly cluster. Seems like Prometheus is also not accessible via the openshift route. 

Version-Release number of selected component (if applicable):

4.7.0-0.ci-2020-12-07-045229

How reproducible:

So far only launched one cluster.

Steps to Reproduce:
1. Check alerting page and see alerts fire.
2. Try to access the Prometheus route.

Actual results:


Expected results:


Additional info:

Comment 1 Sergiusz Urbaniak 2020-12-07 10:46:35 UTC

tentatively setting blocker- flag as it was seen on just one cluster.

Comment 2 Simon Pasquier 2020-12-08 10:21:23 UTC

*** Bug 1905418 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2020-12-08 17:55:40 UTC

Seen in another job [1]:

* e2e process fails to hit Prometheus, receiving a "Route and path matches, but all pods are down." 503.
* But all the Prom containers are ready.
* Ingress ClusterOperator is also happy.

Per [2] (private comment, sorry external folks), the signature for this bug is:

  x509: certificate is valid for prometheus-k8s-thanos-sidecar.openshift-monitoring.svc, prometheus-k8s-thanos-sidecar.openshift-monitoring.svc.cluster.local, not prometheus-k8s.openshift-monitoring.svc

in Telemeter-client logs, which we see in [1]'s assets:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25749/pull-ci-openshift-origin-master-e2e-aws-fips/1336293963228778496/artifacts/e2e-aws-fips/pods/openshift-monitoring_telemeter-client-7567f58784-9jvzw_telemeter-client.log | grep 'x509: certificate is valid for' | tail -n1
level=error caller=forwarder.go:268 ts=2020-12-08T14:19:10.358374323Z component=forwarder/worker msg="unable to forward results" err="Get \"https://prometheus-k8s.openshift-monitoring.svc:9091/federate?...\": x509: certificate is valid for prometheus-k8s-thanos-sidecar.openshift-monitoring.svc, prometheus-k8s-thanos-sidecar.openshift-monitoring.svc.cluster.local, not prometheus-k8s.openshift-monitoring.svc"

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25749/pull-ci-openshift-origin-master-e2e-aws-fips/1336293963228778496
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1905418#c1

Comment 6 Junqi Zhao 2020-12-10 03:07:29 UTC

tested with 4.7.0-0.nightly-2020-12-09-112139, prometheus and thanos-sidecar targets are up and no alerts for them
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts' | jq '.data[].labels | {alertname}'
{
  "alertname": "AlertmanagerReceiversNotConfigured"
}
{
  "alertname": "PrometheusNotIngestingSamples"
}
{
  "alertname": "PrometheusNotIngestingSamples"
}
{
  "alertname": "CannotRetrieveUpdates"
}
{
  "alertname": "Watchdog"
}

Comment 10 errata-xmlrpc 2021-02-24 15:38:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.