Description of problem: TargetDown alerts for prometheus and thanos-sidecar are firing in 4.7 nightly cluster. Seems like Prometheus is also not accessible via the openshift route. Version-Release number of selected component (if applicable): 4.7.0-0.ci-2020-12-07-045229 How reproducible: So far only launched one cluster. Steps to Reproduce: 1. Check alerting page and see alerts fire. 2. Try to access the Prometheus route. Actual results: Expected results: Additional info:
tentatively setting blocker- flag as it was seen on just one cluster.
*** Bug 1905418 has been marked as a duplicate of this bug. ***
Seen in another job [1]: * e2e process fails to hit Prometheus, receiving a "Route and path matches, but all pods are down." 503. * But all the Prom containers are ready. * Ingress ClusterOperator is also happy. Per [2] (private comment, sorry external folks), the signature for this bug is: x509: certificate is valid for prometheus-k8s-thanos-sidecar.openshift-monitoring.svc, prometheus-k8s-thanos-sidecar.openshift-monitoring.svc.cluster.local, not prometheus-k8s.openshift-monitoring.svc in Telemeter-client logs, which we see in [1]'s assets: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25749/pull-ci-openshift-origin-master-e2e-aws-fips/1336293963228778496/artifacts/e2e-aws-fips/pods/openshift-monitoring_telemeter-client-7567f58784-9jvzw_telemeter-client.log | grep 'x509: certificate is valid for' | tail -n1 level=error caller=forwarder.go:268 ts=2020-12-08T14:19:10.358374323Z component=forwarder/worker msg="unable to forward results" err="Get \"https://prometheus-k8s.openshift-monitoring.svc:9091/federate?...\": x509: certificate is valid for prometheus-k8s-thanos-sidecar.openshift-monitoring.svc, prometheus-k8s-thanos-sidecar.openshift-monitoring.svc.cluster.local, not prometheus-k8s.openshift-monitoring.svc" [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25749/pull-ci-openshift-origin-master-e2e-aws-fips/1336293963228778496 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1905418#c1
tested with 4.7.0-0.nightly-2020-12-09-112139, prometheus and thanos-sidecar targets are up and no alerts for them # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts' | jq '.data[].labels | {alertname}' { "alertname": "AlertmanagerReceiversNotConfigured" } { "alertname": "PrometheusNotIngestingSamples" } { "alertname": "PrometheusNotIngestingSamples" } { "alertname": "CannotRetrieveUpdates" } { "alertname": "Watchdog" }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633