1845561 – alerts PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers activated and scrape manager errors in prometheus

Bug 1845561 - alerts PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers activated and scrape manager errors in prometheus

Summary: alerts PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1861543 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-09 14:09 UTC by German Parente
Modified:	2024-03-25 16:02 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: in few occasions, the configuration reload for Prometheus was triggered while the configuration on disk wasn't fully generated. Consequence: Prometheus reloaded a configuration without scrape and alerting targets triggering the PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts. Fix: the process reloading the configuration ensures that the configuration on disk is valid before effectively reloading Prometheus. Result: the PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts don't fire anymore.
Clone Of:
Environment:
Last Closed:	2020-10-27 16:06:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sum by (pod) (rate(prometheus_sd_kubernetes_events_total[5m])) (67.56 KB, image/png) 2020-07-13 11:21 UTC, Simon Pasquier	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 938	None	closed	Bug 1845561: disable (temporarily) resource requests for config reloaders	2021-02-10 13:05:33 UTC
Github	openshift cluster-monitoring-operator pull 943	None	closed	Bug 1845561: enable resource requests for config reloaders	2021-02-10 13:05:34 UTC
Github	openshift prometheus-operator pull 94	None	closed	Bug 1845561: use a single reloader for Prometheus	2021-02-10 13:05:34 UTC
Github	prometheus-operator prometheus-operator pull 3457	None	closed	*: use a single reloader for Prometheus	2021-02-10 13:05:34 UTC
Red Hat Knowledge Base (Solution)	5514091	None	None	None	2020-10-23 16:35:06 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:06:19 UTC

Description German Parente 2020-06-09 14:09:18 UTC

Description of problem:

in certain situations we can see these two alerts shown:

PrometheusNotIngestingSamples
PrometheusNotConnectedToAlertmanagers

when we take a look at prometheus logs, we can see this errors:

level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/kubelet/2"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-ingress/router-default/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-insights/insights-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-authentication-operator/authentication-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/prometheus-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/node-exporter/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-etcd-operator/etcd-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-multus/monitor-multus-admission-controller/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-apiserver/openshift-apiserver/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator/0"

The errors dissapear once the prometheus pods deleted and restarted.

I am just documenting this bug behavior to have a trace.


Version-Release number of selected component (if applicable): 4.4



How reproducible:

not easily reproduced.

Comment 9 Simon Pasquier 2020-07-09 10:05:49 UTC

I've stumbled upon a CI run that seems to correspond with the issue described here. In this case, PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts fired for the prometheus-k8s-1 instance. The last logs from prometheus-k8s-1 indicate that a config reload has happened but we see no message from the k8s discovery component (as if the configuration was empty).

level=info ts=2020-07-08T21:11:13.898Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.900Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.901Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.915Z caller=kubernetes.go:192 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.964Z caller=main.go:771 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2020-07-08T21:12:15.901Z caller=main.go:743 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2020-07-08T21:12:15.912Z caller=main.go:771 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.3/1280948762553880576/artifacts/e2e-aws-proxy/pods/openshift-monitoring_prometheus-k8s-1_prometheus.log

Comment 11 Simon Pasquier 2020-07-13 11:02:26 UTC

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424 is another example of the same issue. Again the prometheus logs show that prometheus-k8s-1 was reloaded but no message in the logs with "Using pod service account via in-cluster config" which implies that the configuration is empty...

Comment 12 Simon Pasquier 2020-07-13 11:21:43 UTC

Created attachment 1700821 [details]
sum by (pod) (rate(prometheus_sd_kubernetes_events_total[5m]))

The Prometheus data dump from
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424 shows that the k8s service discovery for the prometheus-k8s-1 pod sees no update after the last config reload, confirming the hypothesis that the pod loaded an empty configuration.

Comment 24 Simon Pasquier 2020-09-11 12:43:46 UTC

*** Bug 1861543 has been marked as a duplicate of this bug. ***

Comment 28 Junqi Zhao 2020-10-05 05:13:34 UTC

Tested with 4.6.0-0.nightly-2020-10-03-051134, did not see errors in prometheus pod's log, and searched the CI result, no PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers found in recent CI builds

Comment 31 errata-xmlrpc 2020-10-27 16:06:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.