Bug 1845561

Summary: alerts PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers activated and scrape manager errors in prometheus
Product: OpenShift Container Platform Reporter: German Parente <gparente>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 4.4CC: alegrand, anpicker, cblecker, cshereme, dapark, erooth, gshereme, kakkoyun, lcosic, mloibl, mmazur, ngirard, nmalik, pkrupa, spasquie, stwalter, surbania, wking
Target Milestone: ---Keywords: ServiceDeliveryImpact
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: in few occasions, the configuration reload for Prometheus was triggered while the configuration on disk wasn't fully generated. Consequence: Prometheus reloaded a configuration without scrape and alerting targets triggering the PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts. Fix: the process reloading the configuration ensures that the configuration on disk is valid before effectively reloading Prometheus. Result: the PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts don't fire anymore.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:06:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sum by (pod) (rate(prometheus_sd_kubernetes_events_total[5m])) none

Description German Parente 2020-06-09 14:09:18 UTC
Description of problem:

in certain situations we can see these two alerts shown:

PrometheusNotIngestingSamples
PrometheusNotConnectedToAlertmanagers

when we take a look at prometheus logs, we can see this errors:

level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/kubelet/2"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-ingress/router-default/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-insights/insights-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-authentication-operator/authentication-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/prometheus-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/node-exporter/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-etcd-operator/etcd-operator/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-multus/monitor-multus-admission-controller/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-apiserver/openshift-apiserver/0"
level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator/0"

The errors dissapear once the prometheus pods deleted and restarted.

I am just documenting this bug behavior to have a trace.


Version-Release number of selected component (if applicable): 4.4



How reproducible:

not easily reproduced.

Comment 9 Simon Pasquier 2020-07-09 10:05:49 UTC
I've stumbled upon a CI run that seems to correspond with the issue described here. In this case, PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts fired for the prometheus-k8s-1 instance. The last logs from prometheus-k8s-1 indicate that a config reload has happened but we see no message from the k8s discovery component (as if the configuration was empty).

level=info ts=2020-07-08T21:11:13.898Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.900Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.901Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.915Z caller=kubernetes.go:192 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-07-08T21:11:13.964Z caller=main.go:771 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2020-07-08T21:12:15.901Z caller=main.go:743 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2020-07-08T21:12:15.912Z caller=main.go:771 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.3/1280948762553880576/artifacts/e2e-aws-proxy/pods/openshift-monitoring_prometheus-k8s-1_prometheus.log

Comment 11 Simon Pasquier 2020-07-13 11:02:26 UTC
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424 is another example of the same issue. Again the prometheus logs show that prometheus-k8s-1 was reloaded but no message in the logs with "Using pod service account via in-cluster config" which implies that the configuration is empty...

Comment 12 Simon Pasquier 2020-07-13 11:21:43 UTC
Created attachment 1700821 [details]
sum by (pod) (rate(prometheus_sd_kubernetes_events_total[5m]))

The Prometheus data dump from
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424 shows that the k8s service discovery for the prometheus-k8s-1 pod sees no update after the last config reload, confirming the hypothesis that the pod loaded an empty configuration.

Comment 24 Simon Pasquier 2020-09-11 12:43:46 UTC
*** Bug 1861543 has been marked as a duplicate of this bug. ***

Comment 28 Junqi Zhao 2020-10-05 05:13:34 UTC
Tested with 4.6.0-0.nightly-2020-10-03-051134, did not see errors in prometheus pod's log, and searched the CI result, no PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers found in recent CI builds

Comment 31 errata-xmlrpc 2020-10-27 16:06:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196