Seen in 4.8.0-0.ci-2021-04-15-093535 -> 4.8.0-0.ci-2021-04-15-142848 update CI [1,2]: level=error ts=2021-04-15T16:47:09.673Z caller=main.go:293 msg="Error loading config (--config.file=/etc/prometheus/config_out/prometheus.env.yaml)" err="open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory" The pod seemed to recover after a restart. Perhaps there is a race between whatever is laying down that file and Prom trying to read it? Replacement container successfully loaded the file [3]: level=info ts=2021-04-15T16:47:24.397Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml Only restarted that once: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1382704884662407168/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-0").status.containerStatuses[] | select(.name == "prometheus").restartCount' 1 Seems very popular, although not fatal: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=prometheus.env.yaml:+no+such+file+or+directory' | grep 'failures match' | sort openshift-ibm-roks-toolkit-release-4.6-create-cluster-periodics (all) - 2 runs, 50% failed, 100% of failures match = 50% impact openshift-ibm-roks-toolkit-release-4.7-create-cluster-periodics (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.6-e2e-aws-serial (all) - 4 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.6-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.6-e2e-azure-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.7-e2e-aws-serial (all) - 5 runs, 40% failed, 200% of failures match = 80% impact periodic-ci-openshift-release-master-ci-4.7-e2e-gcp-upgrade (all) - 7 runs, 14% failed, 600% of failures match = 86% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 11 runs, 36% failed, 250% of failures match = 91% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade (all) - 7 runs, 43% failed, 233% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 4 runs, 25% failed, 400% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 4 runs, 25% failed, 400% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 4 runs, 75% failed, 133% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-upgrade (all) - 4 runs, 25% failed, 400% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-ovirt-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial (all) - 16 runs, 31% failed, 180% of failures match = 56% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 17 runs, 82% failed, 121% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 18 runs, 100% failed, 83% of failures match = 83% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 16 runs, 81% failed, 123% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 5 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 19 runs, 89% failed, 100% of failures match = 89% impact periodic-ci-openshift-release-master-nightly-4.6-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere-serial (all) - 6 runs, 83% failed, 80% of failures match = 67% impact periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere-upi-serial (all) - 5 runs, 80% failed, 100% of failures match = 80% impact periodic-ci-openshift-release-master-nightly-4.6-upgrade-from-stable-4.5-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-upgrade (all) - 6 runs, 33% failed, 100% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-azure-fips-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-gcp-fips-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-metal-ipi-upgrade (all) - 6 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-vsphere-serial (all) - 9 runs, 56% failed, 80% of failures match = 44% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-vsphere-upi-serial (all) - 9 runs, 56% failed, 20% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-metal-ipi-upgrade (all) - 6 runs, 100% failed, 83% of failures match = 83% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-serial (all) - 15 runs, 33% failed, 160% of failures match = 53% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 9 runs, 89% failed, 75% of failures match = 67% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 9 runs, 89% failed, 38% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-serial (all) - 10 runs, 60% failed, 117% of failures match = 70% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-azure-disk-csi-driver-operator-master-e2e-azure-csi (all) - 2 runs, 100% failed, 100% of failures match = 100% impact ... pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact rehearse-15360-pull-ci-openshift-aws-ebs-csi-driver-operator-master-e2e-aws-csi-migration (all) - 4 runs, 50% failed, 50% of failures match = 25% impact ... rehearse-17778-pull-ci-operator-framework-operator-marketplace-release-4.9-e2e-aws-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact release-openshift-ocp-installer-e2e-azure-serial-4.6 (all) - 4 runs, 25% failed, 300% of failures match = 75% impact release-openshift-ocp-installer-e2e-azure-serial-4.8 (all) - 9 runs, 56% failed, 120% of failures match = 67% impact release-openshift-ocp-installer-e2e-gcp-serial-4.2 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact release-openshift-ocp-installer-e2e-gcp-serial-4.7 (all) - 6 runs, 50% failed, 167% of failures match = 83% impact release-openshift-ocp-installer-e2e-gcp-serial-4.8 (all) - 10 runs, 60% failed, 100% of failures match = 60% impact release-openshift-ocp-installer-e2e-openstack-serial-4.8 (all) - 3 runs, 100% failed, 33% of failures match = 33% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 16 runs, 69% failed, 136% of failures match = 94% impact release-openshift-origin-installer-e2e-aws-upgrade (all) - 16 runs, 50% failed, 113% of failures match = 56% impact release-openshift-origin-installer-e2e-aws-upgrade-4.1-stable-to-4.2-ci (all) - 6 runs, 33% failed, 250% of failures match = 83% impact release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci (all) - 4 runs, 25% failed, 300% of failures match = 75% impact release-openshift-origin-installer-e2e-gcp-serial-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-e2e-gcp-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact release-openshift-origin-installer-e2e-gcp-upgrade-4.2 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact release-openshift-origin-installer-launch-aws (all) - 57 runs, 42% failed, 63% of failures match = 26% impact release-openshift-origin-installer-launch-azure (all) - 21 runs, 52% failed, 100% of failures match = 52% impact release-openshift-origin-installer-launch-gcp (all) - 107 runs, 55% failed, 15% of failures match = 8% impact release-openshift-origin-installer-old-rhcos-e2e-aws-4.2 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-old-rhcos-e2e-aws-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact Also mentioned in some older bugs, but I didn't see it mentioned in anything recent. Feel free to closed as a dup if I just missed the existing ticket. I guess this could also be an issue with the CRI-O not mounting a volume before launching the container or some such. I haven't dug into kubelet/CRI-O logs. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1382704884662407168 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1382704884662407168/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_prometheus_previous.log [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1382704884662407168/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log
it seems it is the same bug as bug 1777216
Yeah, looks likely. I'm not super-excited about "exit 1 to get a fresh pass at loading a config file"; it seems like it would be easy enough to have the current process reload the file. But if that's the intended behavior, it seems like we should at least log a "prometheus.env.yaml created; exiting so the replacement container will load the new config" line or something so we don't have lots of folks wondering if this is a bug or not.
There is an upstream PR for this issue which has been reviewed and should be merged this week: https://github.com/prometheus-operator/prometheus-operator/pull/3955 This issue would therefore get fixed once we bump the version of prometheus-operator in CMO
The upstream PR has been merged and should be included in v0.49.0 (planned for end of June).
FYI: same issue for the prometheus-user-workload pods # oc -n openshift-user-workload-monitoring describe pod prometheus-user-workload-0 ... Containers: prometheus: Container ID: cri-o://579bd8fa35f14ef615d32f1ab1ea01e7c026b19ae968d2cf1721190dc24712f4 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5dec081bc9e08360e810eac16b662b962e95bd98c163ff5f13059790bf9dfe10 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5dec081bc9e08360e810eac16b662b962e95bd98c163ff5f13059790bf9dfe10 Port: <none> Host Port: <none> Args: --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --storage.tsdb.retention.time=24h --web.enable-lifecycle --storage.tsdb.no-lockfile --web.route-prefix=/ --web.listen-address=127.0.0.1:9090 State: Running Started: Wed, 16 Jun 2021 02:35:33 -0400 Last State: Terminated Reason: Error Message: level=error ts=2021-06-16T06:35:31.993Z caller=main.go:347 msg="Error loading config (--config.file=/etc/prometheus/config_out/prometheus.env.yaml)" err="open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory" Exit Code: 2
MODIFIED by https://github.com/openshift/prometheus-operator/pull/131
issue is fixed with 4.9.0-0.nightly-2021-07-18-155939 $ oc -n openshift-monitoring get pod | grep prometheus-k8s prometheus-k8s-0 7/7 Running 0 176m prometheus-k8s-1 7/7 Running 0 176m $ oc -n openshift-user-workload-monitoring get pod | grep prometheus-user-workload prometheus-user-workload-0 5/5 Running 0 8m19s prometheus-user-workload-1 5/5 Running 0 8m19s $ oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | head -n 2 level=info ts=2021-07-18T23:45:16.567948925Z caller=main.go:295 msg="Starting Prometheus Operator" version="(version=0.49.0, branch=rhaos-4.9-rhel-8, revision=c878cd4)" level=info ts=2021-07-18T23:45:16.56799377Z caller=main.go:296 build_context="(go=go1.16.4, user=root, date=20210709-06:10:25)"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759