Bug 1845561 - alerts PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers activated and scrape manager errors in prometheus
Summary: alerts PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1861543 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-09 14:09 UTC by German Parente
Modified: 2021-02-03 02:59 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: in few occasions, the configuration reload for Prometheus was triggered while the configuration on disk wasn't fully generated. Consequence: Prometheus reloaded a configuration without scrape and alerting targets triggering the PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts. Fix: the process reloading the configuration ensures that the configuration on disk is valid before effectively reloading Prometheus. Result: the PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts don't fire anymore.
Clone Of:
Environment:
Last Closed: 2020-10-27 16:06:02 UTC
Target Upstream Version:


Attachments (Terms of Use)
sum by (pod) (rate(prometheus_sd_kubernetes_events_total[5m])) (67.56 KB, image/png)
2020-07-13 11:21 UTC, Simon Pasquier
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 938 0 None closed Bug 1845561: disable (temporarily) resource requests for config reloaders 2021-02-10 13:05:33 UTC
Github openshift cluster-monitoring-operator pull 943 0 None closed Bug 1845561: enable resource requests for config reloaders 2021-02-10 13:05:34 UTC
Github openshift prometheus-operator pull 94 0 None closed Bug 1845561: use a single reloader for Prometheus 2021-02-10 13:05:34 UTC
Github prometheus-operator prometheus-operator pull 3457 0 None closed *: use a single reloader for Prometheus 2021-02-10 13:05:34 UTC
Red Hat Knowledge Base (Solution) 5514091 0 None None None 2020-10-23 16:35:06 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:06:19 UTC

Description German Parente 2020-06-09 14:09:18 UTC
Description of problem:

in certain situations we can see these two alerts shown:

PrometheusNotIngestingSamples
PrometheusNotConnectedToAlertmanagers

when we take a look at prometheus logs, we can see this errors:

level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/kubelet/2"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-ingress/router-default/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-insights/insights-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-authentication-operator/authentication-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/prometheus-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/node-exporter/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-etcd-operator/etcd-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-multus/monitor-multus-admission-controller/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-apiserver/openshift-apiserver/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator/0"

The errors dissapear once the prometheus pods deleted and restarted.

I am just documenting this bug behavior to have a trace.


Version-Release number of selected component (if applicable): 4.4



How reproducible:

not easily reproduced.

Comment 9 Simon Pasquier 2020-07-09 10:05:49 UTC
I've stumbled upon a CI run that seems to correspond with the issue described here. In this case, PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts fired for the prometheus-k8s-1 instance. The last logs from prometheus-k8s-1 indicate that a config reload has happened but we see no message from the k8s discovery component (as if the configuration was empty).

level=info ts=2020-07-08T21:11:13.898Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.900Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.901Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.915Z caller=kubernetes.go:192 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.964Z caller=main.go:771 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2020-07-08T21:12:15.901Z caller=main.go:743 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2020-07-08T21:12:15.912Z caller=main.go:771 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.3/1280948762553880576/artifacts/e2e-aws-proxy/pods/openshift-monitoring_prometheus-k8s-1_prometheus.log

Comment 11 Simon Pasquier 2020-07-13 11:02:26 UTC
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424 is another example of the same issue. Again the prometheus logs show that prometheus-k8s-1 was reloaded but no message in the logs with "Using pod service account via in-cluster config" which implies that the configuration is empty...

Comment 12 Simon Pasquier 2020-07-13 11:21:43 UTC
Created attachment 1700821 [details]
sum by (pod) (rate(prometheus_sd_kubernetes_events_total[5m]))

The Prometheus data dump from
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424 shows that the k8s service discovery for the prometheus-k8s-1 pod sees no update after the last config reload, confirming the hypothesis that the pod loaded an empty configuration.

Comment 24 Simon Pasquier 2020-09-11 12:43:46 UTC
*** Bug 1861543 has been marked as a duplicate of this bug. ***

Comment 28 Junqi Zhao 2020-10-05 05:13:34 UTC
Tested with 4.6.0-0.nightly-2020-10-03-051134, did not see errors in prometheus pod's log, and searched the CI result, no PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers found in recent CI builds

Comment 31 errata-xmlrpc 2020-10-27 16:06:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.