Description of problem: Spotted in this CI job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary/1503017798106877952 The alert got pending for 15m then firing for 30s then it was resolved. Version-Release number of selected component (if applicable): 4.10 How reproducible: Sometimes Steps to Reproduce: 1. 2. 3. Actual results: The PrometheusOperatorWatchErrors alert went firing. Expected results: The PrometheusOperatorWatchErrors alert doesn't fire. Additional info: The alert expression has a time range of 10 minutes so it will not resolve "quickly" if the operator experiences a transient burst of watch errors. Given that metrics are scraped every 30 seconds, it would be better to use a time range of 5 minutes (consistent with what Prometheus & Alertmanager are doing).
> The alert got pending for 15m then firing for 30s then it was resolved. how does that reconcile with: PrometheusOperatorWatchErrors was at or above info for at least 2m57s on platformidentification.JobType{Release:"4.10", FromRelease:"", Platform:"aws", Network:"sdn", Topology:"ha"} (maxAllowed=23.75s): pending for 9m45s, firing for 2m57s: ? I understand it may have been pending for longer than this reports, but why is this reporting it was firing for almost 3 minutes if it was only firing for 30s? Or you're saying the issue only existed for 30s, but it took almost 3 minutes for enough history to fall off that it stopped firing?
I have no clue about the "[bz-monitoring][invariant] alert/PrometheusOperatorWatchErrors should not be at or above info expand_more" failure. Reading the code [1], it seems to reconstruct the pending/firing intervals and verify whether they fall within the 95th percentile value [2]. [1] https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/synthetictests/allowedalerts/basic_alert.go#L153-L199 [2] https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/synthetictests/allowedalerts/query_results.json#L2582-L2591
> it seems to reconstruct the pending/firing intervals and verify whether they fall within the 95th percentile value yeah i'm trying to understand why you say it was only firing for 30s, but the test logic seems to think it was firing for almost 3 minutes (but that 3 minutes is an aggregation of all the periods over which it was firing, so it might have fired for 30s, then stopped firing, then fired for another 30s, etc, totalling 3 minutes. So perhaps you only looked at one instance of it firing?) if the test logic that computes the duration of the alert is wrong, then that's going to be a general problem for the test.
Created attachment 1865989 [details] How long PrometheusOperatorWatchErrors has been in firing state > yeah i'm trying to understand why you say it was only firing for 30s, Correction: it's been firing for 1 minute (see attached screenshot). This is what the Prometheus data tells when we query it for "ALERTS{alertname="PrometheusOperatorWatchErrors"}"). "pending for 9m45s, firing for 2m57s" makes no sense to me: the alert has a "for: 15m" clause so it should be pending for at least 15m. I assume that the test doesn't mean the same thing than Prometheus about "pending" and "firing" alerts.
> "pending for 9m45s, firing for 2m57s" makes no sense to me: the alert has a "for: 15m" clause so it should be pending for at least 15m. I assume that the test doesn't mean the same thing than Prometheus about "pending" and "firing" alerts. oh, I see what it's doing. It added up all 3 59s alerts(all PrometheusOperatorWatchErrors alerts, but not all from the same controller), to get 2m 57s total: Mar 13 15:20:26.436 - 59s W alert/PrometheusOperatorWatchErrors ns/openshift-monitoring ALERTS{alertname="PrometheusOperatorWatchErrors", alertstate="firing", controller="alertmanager", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="warning"} Mar 13 15:20:26.436 - 59s W alert/PrometheusOperatorWatchErrors ns/openshift-monitoring ALERTS{alertname="PrometheusOperatorWatchErrors", alertstate="firing", controller="prometheus", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="warning"} Mar 13 15:20:26.436 - 59s W alert/PrometheusOperatorWatchErrors ns/openshift-monitoring ALERTS{alertname="PrometheusOperatorWatchErrors", alertstate="firing", controller="thanos", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="warning"} whether that's really the behavior we want the test to perform or not, i'm not sure. And i still can't quite explain the pending summary behavior being less than 15m, but that seems less important for now. For anyone coming to this in the future, here's all the code in question... it adds up all the intervals during which the alert was firing, and all the intervals during which the alert was pending. it starts by doing a similar query to what you did to find all the firing alerts: https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/monitor/alerts.go#L36-L37 for any alert separated by less than 5s, it considers it to be on-going, for alerts separated by more than 5s, it treats it as a new interval, but at the end it adds up all the durations: https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/monitor/alerts.go#L160-L176 https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/synthetictests/allowedalerts/basic_alert.go#L178 https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/monitor/monitorapi/types.go#L152-L165
(In reply to Ben Parees from comment #5) > And i still can't quite explain the pending summary behavior being less than > 15m, but that seems less important for now. Dropping [1] into PromeCIeus, and running: ALERTS{alertname="PrometheusOperatorWatchErrors"} shows pending from 15:05:26 through 15:20:25, so that's 15m. The reported 9m45s is because [1]: INFO[2022-03-13T15:15:19Z] Running step e2e-aws-canary-openshift-e2e-test. so a good chunk of the pending time was before the test-step's monitors were up and watching, and it's summing the three controllers over the ~three minutes it was watching them in pending: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary/1503017798106877952/artifacts/e2e-aws-canary/openshift-e2e- test/artifacts/junit/e2e-intervals_everything_20220313-151710.json | jq -c '.items[] | select(tostring | (contains("PrometheusOperatorWatchErrors") and contains("pending")))' {"level":"Info","locator":"alert/PrometheusOperatorWatchErrors ns/openshift-monitoring","message":"ALERTS{alertname=\"PrometheusOperatorWatchErrors\", alertstate=\"pending\", controller=\"alertmanager\", namespace=\"openshift-monitoring\", prometheus=\"openshift-monitoring/k8s\", severity=\"warning\"}","from":"2022-03-13T15:17:10Z","to":"2022-03-13T15:20:25Z"} {"level":"Info","locator":"alert/PrometheusOperatorWatchErrors ns/openshift-monitoring","message":"ALERTS{alertname=\"PrometheusOperatorWatchErrors\", alertstate=\"pending\", controller=\"prometheus\", namespace=\"openshift-monitoring\", prometheus=\"openshift-monitoring/k8s\", severity=\"warning\"}","from":"2022-03-13T15:17:10Z","to":"2022-03-13T15:20:25Z"} {"level":"Info","locator":"alert/PrometheusOperatorWatchErrors ns/openshift-monitoring","message":"ALERTS{alertname=\"PrometheusOperatorWatchErrors\", alertstate=\"pending\", controller=\"thanos\", namespace=\"openshift-monitoring\", prometheus=\"openshift-monitoring/k8s\", severity=\"warning\"}","from":"2022-03-13T15:17:10Z","to":"2022-03-13T15:20:25Z"} [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary/1503017798106877952
> so a good chunk of the pending time was before the test-step's monitors were up and watching I guess the issue is that unless the reported time of the alerts the test gets from this query: https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/monitor/alerts.go#L36-L37 is bounded by the start time that is being passed in. so even though the alert data was there, the test-monitor code doesn't see it, and generates its own conclusion about how long the alert was firing(or pending).
(In reply to Ben Parees from comment #7) > is bounded by the start time that is being passed in. Yup. And in some cases it makes sense to have the origin complaints be limited to the span of time when the origin suite was running. Perhaps the failure message should include "while the test suite was watching from $START_TIME through $STOP_TIME" or something else to make this distinction more clear.
Wait for PR is in payload
Test with payload 4.11.0-0.nightly-2022-03-29-152521 % oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml|grep -A10 PrometheusOperatorWatchErrors - alert: PrometheusOperatorWatchErrors annotations: description: Errors while performing watch operations in controller {{$labels.controller}} in {{$labels.namespace}} namespace. summary: Errors while performing watch operations in controller. expr: | (sum by (controller,namespace) (rate(prometheus_operator_watch_operations_failed_total{job="prometheus-operator", namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}[5m])) / sum by (controller,namespace) (rate(prometheus_operator_watch_operations_total{job="prometheus-operator", namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}[5m]))) > 0.4 for: 15m labels: severity: warning
also search ci test run results, no alert PrometheusOperatorWatchErrors pending or fires in the last 2 days.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069