Bug 1904985

Summary:	Prometheus and thanos sidecar targets are down
Product:	OpenShift Container Platform	Reporter:	Lili Cosic <lcosic>
Component:	Monitoring	Assignee:	Simon Pasquier <spasquie>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	alegrand, anpicker, erooth, kakkoyun, lcosic, lszaszki, pkrupa, spasquie, surbania, wking
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late]
Last Closed:	2021-02-24 15:38:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lili Cosic 2020-12-07 10:25:29 UTC

Description of problem:
TargetDown alerts for prometheus and thanos-sidecar are firing in 4.7 nightly cluster. Seems like Prometheus is also not accessible via the openshift route. 

Version-Release number of selected component (if applicable):

4.7.0-0.ci-2020-12-07-045229

How reproducible:

So far only launched one cluster.

Steps to Reproduce:
1. Check alerting page and see alerts fire.
2. Try to access the Prometheus route.

Actual results:


Expected results:


Additional info:

Comment 1 Sergiusz Urbaniak 2020-12-07 10:46:35 UTC

tentatively setting blocker- flag as it was seen on just one cluster.

Comment 2 Simon Pasquier 2020-12-08 10:21:23 UTC

*** Bug 1905418 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2020-12-08 17:55:40 UTC

Seen in another job [1]:

* e2e process fails to hit Prometheus, receiving a "Route and path matches, but all pods are down." 503.
* But all the Prom containers are ready.
* Ingress ClusterOperator is also happy.

Per [2] (private comment, sorry external folks), the signature for this bug is:

  x509: certificate is valid for prometheus-k8s-thanos-sidecar.openshift-monitoring.svc, prometheus-k8s-thanos-sidecar.openshift-monitoring.svc.cluster.local, not prometheus-k8s.openshift-monitoring.svc

in Telemeter-client logs, which we see in [1]'s assets:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25749/pull-ci-openshift-origin-master-e2e-aws-fips/1336293963228778496/artifacts/e2e-aws-fips/pods/openshift-monitoring_telemeter-client-7567f58784-9jvzw_telemeter-client.log | grep 'x509: certificate is valid for' | tail -n1
level=error caller=forwarder.go:268 ts=2020-12-08T14:19:10.358374323Z component=forwarder/worker msg="unable to forward results" err="Get \"https://prometheus-k8s.openshift-monitoring.svc:9091/federate?...\": x509: certificate is valid for prometheus-k8s-thanos-sidecar.openshift-monitoring.svc, prometheus-k8s-thanos-sidecar.openshift-monitoring.svc.cluster.local, not prometheus-k8s.openshift-monitoring.svc"

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25749/pull-ci-openshift-origin-master-e2e-aws-fips/1336293963228778496
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1905418#c1

Comment 6 Junqi Zhao 2020-12-10 03:07:29 UTC

tested with 4.7.0-0.nightly-2020-12-09-112139, prometheus and thanos-sidecar targets are up and no alerts for them
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts' | jq '.data[].labels | {alertname}'
{
  "alertname": "AlertmanagerReceiversNotConfigured"
}
{
  "alertname": "PrometheusNotIngestingSamples"
}
{
  "alertname": "PrometheusNotIngestingSamples"
}
{
  "alertname": "CannotRetrieveUpdates"
}
{
  "alertname": "Watchdog"
}

Comment 10 errata-xmlrpc 2021-02-24 15:38:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633