Description of problem: in certain situations we can see these two alerts shown: PrometheusNotIngestingSamples PrometheusNotConnectedToAlertmanagers when we take a look at prometheus logs, we can see this errors: level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/kubelet/2" level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-ingress/router-default/0" level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-insights/insights-operator/0" level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-authentication-operator/authentication-operator/0" level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/prometheus-operator/0" level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-monitoring/node-exporter/0" level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-etcd-operator/etcd-operator/0" level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-multus/monitor-multus-admission-controller/0" level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-apiserver/openshift-apiserver/0" level=error ts=2020-06-02T10:48:14.341Z caller=manager.go:118 component="scrape manager" msg="error reloading target set" err="invalid config id:openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator/0" The errors dissapear once the prometheus pods deleted and restarted. I am just documenting this bug behavior to have a trace. Version-Release number of selected component (if applicable): 4.4 How reproducible: not easily reproduced.
I've stumbled upon a CI run that seems to correspond with the issue described here. In this case, PrometheusNotIngestingSamples and PrometheusNotConnectedToAlertmanagers alerts fired for the prometheus-k8s-1 instance. The last logs from prometheus-k8s-1 indicate that a config reload has happened but we see no message from the k8s discovery component (as if the configuration was empty). level=info ts=2020-07-08T21:11:13.898Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config" level=info ts=2020-07-08T21:11:13.900Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config" level=info ts=2020-07-08T21:11:13.901Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config" level=info ts=2020-07-08T21:11:13.915Z caller=kubernetes.go:192 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config" level=info ts=2020-07-08T21:11:13.964Z caller=main.go:771 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=info ts=2020-07-08T21:12:15.901Z caller=main.go:743 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=info ts=2020-07-08T21:12:15.912Z caller=main.go:771 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.3/1280948762553880576/artifacts/e2e-aws-proxy/pods/openshift-monitoring_prometheus-k8s-1_prometheus.log
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424 is another example of the same issue. Again the prometheus logs show that prometheus-k8s-1 was reloaded but no message in the logs with "Using pod service account via in-cluster config" which implies that the configuration is empty...
Created attachment 1700821 [details] sum by (pod) (rate(prometheus_sd_kubernetes_events_total[5m])) The Prometheus data dump from https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.4/1282455870612967424 shows that the k8s service discovery for the prometheus-k8s-1 pod sees no update after the last config reload, confirming the hypothesis that the pod loaded an empty configuration.
*** Bug 1861543 has been marked as a duplicate of this bug. ***
Tested with 4.6.0-0.nightly-2020-10-03-051134, did not see errors in prometheus pod's log, and searched the CI result, no PrometheusNotIngestingSamples / PrometheusNotConnectedToAlertmanagers found in recent CI builds
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196