Bug 1904985
Summary: | Prometheus and thanos sidecar targets are down | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Lili Cosic <lcosic> |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.7 | CC: | alegrand, anpicker, erooth, kakkoyun, lcosic, lszaszki, pkrupa, spasquie, surbania, wking |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late]
|
|
Last Closed: | 2021-02-24 15:38:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Lili Cosic
2020-12-07 10:25:29 UTC
tentatively setting blocker- flag as it was seen on just one cluster. *** Bug 1905418 has been marked as a duplicate of this bug. *** Seen in another job [1]: * e2e process fails to hit Prometheus, receiving a "Route and path matches, but all pods are down." 503. * But all the Prom containers are ready. * Ingress ClusterOperator is also happy. Per [2] (private comment, sorry external folks), the signature for this bug is: x509: certificate is valid for prometheus-k8s-thanos-sidecar.openshift-monitoring.svc, prometheus-k8s-thanos-sidecar.openshift-monitoring.svc.cluster.local, not prometheus-k8s.openshift-monitoring.svc in Telemeter-client logs, which we see in [1]'s assets: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25749/pull-ci-openshift-origin-master-e2e-aws-fips/1336293963228778496/artifacts/e2e-aws-fips/pods/openshift-monitoring_telemeter-client-7567f58784-9jvzw_telemeter-client.log | grep 'x509: certificate is valid for' | tail -n1 level=error caller=forwarder.go:268 ts=2020-12-08T14:19:10.358374323Z component=forwarder/worker msg="unable to forward results" err="Get \"https://prometheus-k8s.openshift-monitoring.svc:9091/federate?...\": x509: certificate is valid for prometheus-k8s-thanos-sidecar.openshift-monitoring.svc, prometheus-k8s-thanos-sidecar.openshift-monitoring.svc.cluster.local, not prometheus-k8s.openshift-monitoring.svc" [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25749/pull-ci-openshift-origin-master-e2e-aws-fips/1336293963228778496 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1905418#c1 tested with 4.7.0-0.nightly-2020-12-09-112139, prometheus and thanos-sidecar targets are up and no alerts for them # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts' | jq '.data[].labels | {alertname}' { "alertname": "AlertmanagerReceiversNotConfigured" } { "alertname": "PrometheusNotIngestingSamples" } { "alertname": "PrometheusNotIngestingSamples" } { "alertname": "CannotRetrieveUpdates" } { "alertname": "Watchdog" } Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |