Bug 1845561
Summary: | alerts PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers activated and scrape manager errors in prometheus | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | German Parente <gparente> | ||||
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> | ||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.4 | CC: | alegrand, anpicker, cblecker, cshereme, dapark, erooth, gshereme, kakkoyun, lcosic, mloibl, mmazur, ngirard, nmalik, pkrupa, spasquie, stwalter, surbania, wking | ||||
Target Milestone: | --- | Keywords: | ServiceDeliveryImpact | ||||
Target Release: | 4.6.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: in few occasions, the configuration reload for Prometheus was triggered while the configuration on disk wasn't fully generated.
Consequence: Prometheus reloaded a configuration without scrape and alerting targets triggering the PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts.
Fix: the process reloading the configuration ensures that the configuration on disk is valid before effectively reloading Prometheus.
Result: the PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts don't fire anymore.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-10-27 16:06:02 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
German Parente
2020-06-09 14:09:18 UTC
I've stumbled upon a CI run that seems to correspond with the issue described here. In this case, PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts fired for the prometheus-k8s-1 instance. The last logs from prometheus-k8s-1 indicate that a config reload has happened but we see no message from the k8s discovery component (as if the configuration was empty). level=info ts=2020-07-08T21:11:13.898Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config" level=info ts=2020-07-08T21:11:13.900Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config" level=info ts=2020-07-08T21:11:13.901Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config" level=info ts=2020-07-08T21:11:13.915Z caller=kubernetes.go:192 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config" level=info ts=2020-07-08T21:11:13.964Z caller=main.go:771 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=info ts=2020-07-08T21:12:15.901Z caller=main.go:743 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=info ts=2020-07-08T21:12:15.912Z caller=main.go:771 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.3/1280948762553880576/artifacts/e2e-aws-proxy/pods/openshift-monitoring_prometheus-k8s-1_prometheus.log https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424 is another example of the same issue. Again the prometheus logs show that prometheus-k8s-1 was reloaded but no message in the logs with "Using pod service account via in-cluster config" which implies that the configuration is empty... Created attachment 1700821 [details] sum by (pod) (rate(prometheus_sd_kubernetes_events_total[5m])) The Prometheus data dump from https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424 shows that the k8s service discovery for the prometheus-k8s-1 pod sees no update after the last config reload, confirming the hypothesis that the pod loaded an empty configuration. *** Bug 1861543 has been marked as a duplicate of this bug. *** Tested with 4.6.0-0.nightly-2020-10-03-051134, did not see errors in prometheus pod's log, and searched the CI result, no PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers found in recent CI builds Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |