Bug 1950173 - Non-fatal: prometheus.env.yaml: no such file or directory
Summary: Non-fatal: prometheus.env.yaml: no such file or directory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.9.0
Assignee: Filip Petkovski
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-15 23:31 UTC by W. Trevor King
Modified: 2021-10-18 17:30 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:30:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift prometheus-operator pull 131 0 None closed Bug 1977435: Bump prometheus-operator to v0.49.0 2021-07-07 12:59:13 UTC
Github prometheus-operator prometheus-operator issues 3061 0 None open level=error ts=2020-03-02T08:30:43.933Z caller=main.go:727 err="error loading config from \"/etc/prometheus/config_out/p... 2021-04-16 07:10:17 UTC
Github prometheus-operator prometheus-operator pull 3955 0 None open Added init container which apples config file once by setting `--watch-interval` equals to `0` 2021-04-16 07:10:17 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:30:35 UTC

Description W. Trevor King 2021-04-15 23:31:54 UTC
Seen in 4.8.0-0.ci-2021-04-15-093535 -> 4.8.0-0.ci-2021-04-15-142848 update CI [1,2]:

  level=error ts=2021-04-15T16:47:09.673Z caller=main.go:293 msg="Error loading config (--config.file=/etc/prometheus/config_out/prometheus.env.yaml)" err="open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory"

The pod seemed to recover after a restart.  Perhaps there is a race between whatever is laying down that file and Prom trying to read it?  Replacement container successfully loaded the file [3]:

  level=info ts=2021-04-15T16:47:24.397Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml

Only restarted that once:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1382704884662407168/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-0").status.containerStatuses[] | select(.name == "prometheus").restartCount'
  1

Seems very popular, although not fatal:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=prometheus.env.yaml:+no+such+file+or+directory' | grep 'failures match' | sort
openshift-ibm-roks-toolkit-release-4.6-create-cluster-periodics (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
openshift-ibm-roks-toolkit-release-4.7-create-cluster-periodics (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-aws-serial (all) - 4 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-azure-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.7-e2e-aws-serial (all) - 5 runs, 40% failed, 200% of failures match = 80% impact
periodic-ci-openshift-release-master-ci-4.7-e2e-gcp-upgrade (all) - 7 runs, 14% failed, 600% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 11 runs, 36% failed, 250% of failures match = 91% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade (all) - 7 runs, 43% failed, 233% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 4 runs, 25% failed, 400% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 4 runs, 25% failed, 400% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 4 runs, 75% failed, 133% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-upgrade (all) - 4 runs, 25% failed, 400% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-ovirt-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial (all) - 16 runs, 31% failed, 180% of failures match = 56% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 17 runs, 82% failed, 121% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 18 runs, 100% failed, 83% of failures match = 83% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 16 runs, 81% failed, 123% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 5 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 19 runs, 89% failed, 100% of failures match = 89% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere-serial (all) - 6 runs, 83% failed, 80% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere-upi-serial (all) - 5 runs, 80% failed, 100% of failures match = 80% impact
periodic-ci-openshift-release-master-nightly-4.6-upgrade-from-stable-4.5-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-upgrade (all) - 6 runs, 33% failed, 100% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-azure-fips-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-gcp-fips-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-metal-ipi-upgrade (all) - 6 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-vsphere-serial (all) - 9 runs, 56% failed, 80% of failures match = 44% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-vsphere-upi-serial (all) - 9 runs, 56% failed, 20% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-metal-ipi-upgrade (all) - 6 runs, 100% failed, 83% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-serial (all) - 15 runs, 33% failed, 160% of failures match = 53% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 9 runs, 89% failed, 75% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 9 runs, 89% failed, 38% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-serial (all) - 10 runs, 60% failed, 117% of failures match = 70% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-azure-disk-csi-driver-operator-master-e2e-azure-csi (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
...
pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
rehearse-15360-pull-ci-openshift-aws-ebs-csi-driver-operator-master-e2e-aws-csi-migration (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
...
rehearse-17778-pull-ci-operator-framework-operator-marketplace-release-4.9-e2e-aws-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
release-openshift-ocp-installer-e2e-azure-serial-4.6 (all) - 4 runs, 25% failed, 300% of failures match = 75% impact
release-openshift-ocp-installer-e2e-azure-serial-4.8 (all) - 9 runs, 56% failed, 120% of failures match = 67% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.2 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.7 (all) - 6 runs, 50% failed, 167% of failures match = 83% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.8 (all) - 10 runs, 60% failed, 100% of failures match = 60% impact
release-openshift-ocp-installer-e2e-openstack-serial-4.8 (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 16 runs, 69% failed, 136% of failures match = 94% impact
release-openshift-origin-installer-e2e-aws-upgrade (all) - 16 runs, 50% failed, 113% of failures match = 56% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.1-stable-to-4.2-ci (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci (all) - 4 runs, 25% failed, 300% of failures match = 75% impact
release-openshift-origin-installer-e2e-gcp-serial-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
release-openshift-origin-installer-e2e-gcp-upgrade-4.2 (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
release-openshift-origin-installer-launch-aws (all) - 57 runs, 42% failed, 63% of failures match = 26% impact
release-openshift-origin-installer-launch-azure (all) - 21 runs, 52% failed, 100% of failures match = 52% impact
release-openshift-origin-installer-launch-gcp (all) - 107 runs, 55% failed, 15% of failures match = 8% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.2 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact


Also mentioned in some older bugs, but I didn't see it mentioned in anything recent.  Feel free to closed as a dup if I just missed the existing ticket.  I guess this could also be an issue with the CRI-O not mounting a volume before launching the container or some such.  I haven't dug into kubelet/CRI-O logs.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1382704884662407168
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1382704884662407168/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_prometheus_previous.log
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1382704884662407168/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log

Comment 1 Junqi Zhao 2021-04-16 01:34:43 UTC
it seems it is the same bug as bug 1777216

Comment 2 W. Trevor King 2021-04-16 02:52:36 UTC
Yeah, looks likely.  I'm not super-excited about "exit 1 to get a fresh pass at loading a config file"; it seems like it would be easy enough to have the current process reload the file.  But if that's the intended behavior, it seems like we should at least log a "prometheus.env.yaml created; exiting so the replacement container will load the new config" line or something so we don't have lots of folks wondering if this is a bug or not.

Comment 4 Filip Petkovski 2021-06-08 08:08:30 UTC
There is an upstream PR for this issue which has been reviewed and should be merged this week: https://github.com/prometheus-operator/prometheus-operator/pull/3955

This issue would therefore get fixed once we bump the version of prometheus-operator in CMO

Comment 5 Simon Pasquier 2021-06-10 07:34:01 UTC
The upstream PR has been merged and should be included in v0.49.0 (planned for end of June).

Comment 6 Junqi Zhao 2021-06-16 06:39:19 UTC
FYI: same issue for the prometheus-user-workload pods
# oc -n openshift-user-workload-monitoring describe pod prometheus-user-workload-0
...
Containers:
  prometheus:
    Container ID:  cri-o://579bd8fa35f14ef615d32f1ab1ea01e7c026b19ae968d2cf1721190dc24712f4
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5dec081bc9e08360e810eac16b662b962e95bd98c163ff5f13059790bf9dfe10
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5dec081bc9e08360e810eac16b662b962e95bd98c163ff5f13059790bf9dfe10
    Port:          <none>
    Host Port:     <none>
    Args:
      --web.console.templates=/etc/prometheus/consoles
      --web.console.libraries=/etc/prometheus/console_libraries
      --config.file=/etc/prometheus/config_out/prometheus.env.yaml
      --storage.tsdb.path=/prometheus
      --storage.tsdb.retention.time=24h
      --web.enable-lifecycle
      --storage.tsdb.no-lockfile
      --web.route-prefix=/
      --web.listen-address=127.0.0.1:9090
    State:       Running
      Started:   Wed, 16 Jun 2021 02:35:33 -0400
    Last State:  Terminated
      Reason:    Error
      Message:   level=error ts=2021-06-16T06:35:31.993Z caller=main.go:347 msg="Error loading config (--config.file=/etc/prometheus/config_out/prometheus.env.yaml)" err="open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory"

      Exit Code:    2

Comment 7 Simon Pasquier 2021-07-07 12:59:14 UTC
MODIFIED by https://github.com/openshift/prometheus-operator/pull/131

Comment 9 Junqi Zhao 2021-07-19 02:52:43 UTC
issue is fixed with 4.9.0-0.nightly-2021-07-18-155939
$ oc -n openshift-monitoring get pod | grep prometheus-k8s
prometheus-k8s-0                             7/7     Running   0          176m
prometheus-k8s-1                             7/7     Running   0          176m

$ oc -n openshift-user-workload-monitoring get pod | grep prometheus-user-workload
prometheus-user-workload-0             5/5     Running   0          8m19s
prometheus-user-workload-1             5/5     Running   0          8m19s

$ oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | head -n 2
level=info ts=2021-07-18T23:45:16.567948925Z caller=main.go:295 msg="Starting Prometheus Operator" version="(version=0.49.0, branch=rhaos-4.9-rhel-8, revision=c878cd4)"
level=info ts=2021-07-18T23:45:16.56799377Z caller=main.go:296 build_context="(go=go1.16.4, user=root, date=20210709-06:10:25)"

Comment 16 errata-xmlrpc 2021-10-18 17:30:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.