Bug 2063905 - PrometheusOperatorWatchErrors alert may fire shortly in case of transient errors from the API server
Summary: PrometheusOperatorWatchErrors alert may fire shortly in case of transient err...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Joao Marcal
QA Contact: hongyan li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-14 15:33 UTC by Simon Pasquier
Modified: 2022-08-10 10:54 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:54:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
How long PrometheusOperatorWatchErrors has been in firing state (199.38 KB, image/png)
2022-03-15 09:40 UTC, Simon Pasquier
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1610 0 None open Bug 2063905: [bot] Update jsonnet dependencies 2022-03-24 08:46:46 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:54:21 UTC

Description Simon Pasquier 2022-03-14 15:33:38 UTC
Description of problem:
Spotted in this CI job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary/1503017798106877952

The alert got pending for 15m then firing for 30s then it was resolved.

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Sometimes

Steps to Reproduce:
1.
2.
3.

Actual results:
The PrometheusOperatorWatchErrors alert went firing.

Expected results:
The PrometheusOperatorWatchErrors alert doesn't fire.

Additional info:

The alert expression has a time range of 10 minutes so it will not resolve "quickly" if the operator experiences a transient burst of watch errors.
Given that metrics are scraped every 30 seconds, it would be better to use a time range of 5 minutes (consistent with what Prometheus & Alertmanager are doing).

Comment 1 Ben Parees 2022-03-14 16:29:46 UTC
> The alert got pending for 15m then firing for 30s then it was resolved.


how does that reconcile with:
PrometheusOperatorWatchErrors was at or above info for at least 2m57s on platformidentification.JobType{Release:"4.10", FromRelease:"", Platform:"aws", Network:"sdn", Topology:"ha"} (maxAllowed=23.75s): pending for 9m45s, firing for 2m57s:

?

I understand it may have been pending for longer than this reports, but why is this reporting it was firing for almost 3 minutes if it was only firing for 30s?  Or you're saying the issue only existed for 30s, but it took almost 3 minutes for enough history to fall off that it stopped firing?

Comment 2 Simon Pasquier 2022-03-14 16:48:52 UTC
I have no clue about the "[bz-monitoring][invariant] alert/PrometheusOperatorWatchErrors should not be at or above info expand_more" failure. Reading the code [1], it seems to reconstruct the pending/firing intervals and verify whether they fall within the 95th percentile value [2].

[1] https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/synthetictests/allowedalerts/basic_alert.go#L153-L199
[2] https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/synthetictests/allowedalerts/query_results.json#L2582-L2591

Comment 3 Ben Parees 2022-03-14 17:10:52 UTC
> it seems to reconstruct the pending/firing intervals and verify whether they fall within the 95th percentile value

yeah i'm trying to understand why you say it was only firing for 30s, but the test logic seems to think it was firing for almost 3 minutes (but that 3 minutes is an aggregation of all the periods over which it was firing, so it might have fired for 30s, then stopped firing, then fired for another 30s, etc, totalling 3 minutes.  So perhaps you only looked at one instance of it firing?)

if the test logic that computes the duration of the alert is wrong, then that's going to be a general problem for the test.

Comment 4 Simon Pasquier 2022-03-15 09:40:12 UTC
Created attachment 1865989 [details]
How long PrometheusOperatorWatchErrors has been in firing state

> yeah i'm trying to understand why you say it was only firing for 30s,

Correction: it's been firing for 1 minute (see attached screenshot). This is what the Prometheus data tells when we query it for "ALERTS{alertname="PrometheusOperatorWatchErrors"}").

"pending for 9m45s, firing for 2m57s" makes no sense to me: the alert has a "for: 15m" clause so it should be pending for at least 15m. I assume that the test doesn't mean the same thing than Prometheus about "pending" and "firing" alerts.

Comment 5 Ben Parees 2022-03-15 16:06:03 UTC
> "pending for 9m45s, firing for 2m57s" makes no sense to me: the alert has a "for: 15m" clause so it should be pending for at least 15m. I assume that the test doesn't mean the same thing than Prometheus about "pending" and "firing" alerts.


oh, I see what it's doing.  It added up all 3 59s alerts(all PrometheusOperatorWatchErrors alerts, but not all from the same controller), to get 2m 57s total:

Mar 13 15:20:26.436 - 59s   W alert/PrometheusOperatorWatchErrors ns/openshift-monitoring ALERTS{alertname="PrometheusOperatorWatchErrors", alertstate="firing", controller="alertmanager", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="warning"}
Mar 13 15:20:26.436 - 59s   W alert/PrometheusOperatorWatchErrors ns/openshift-monitoring ALERTS{alertname="PrometheusOperatorWatchErrors", alertstate="firing", controller="prometheus", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="warning"}
Mar 13 15:20:26.436 - 59s   W alert/PrometheusOperatorWatchErrors ns/openshift-monitoring ALERTS{alertname="PrometheusOperatorWatchErrors", alertstate="firing", controller="thanos", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="warning"}


whether that's really the behavior we want the test to perform or not, i'm not sure.

And i still can't quite explain the pending summary behavior being less than 15m, but that seems less important for now.


For anyone coming to this in the future, here's all the code in question... it adds up all the intervals during which the alert was firing, and all the intervals during which the alert was pending.

it starts by doing a similar query to what you did to find all the firing alerts:

https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/monitor/alerts.go#L36-L37

for any alert separated by less than 5s, it considers it to be on-going, for alerts separated by more than 5s, it treats it as a new interval, but at the end it adds up all the durations:

https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/monitor/alerts.go#L160-L176

https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/synthetictests/allowedalerts/basic_alert.go#L178

https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/monitor/monitorapi/types.go#L152-L165

Comment 6 W. Trevor King 2022-03-15 19:14:21 UTC
(In reply to Ben Parees from comment #5)
> And i still can't quite explain the pending summary behavior being less than
> 15m, but that seems less important for now.

Dropping [1] into PromeCIeus, and running:

  ALERTS{alertname="PrometheusOperatorWatchErrors"}

shows pending from 15:05:26 through 15:20:25, so that's 15m.  The reported 9m45s is because [1]:

  INFO[2022-03-13T15:15:19Z] Running step e2e-aws-canary-openshift-e2e-test.

so a good chunk of the pending time was before the test-step's monitors were up and watching, and it's summing the three controllers over the ~three minutes it was watching them in pending:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary/1503017798106877952/artifacts/e2e-aws-canary/openshift-e2e- test/artifacts/junit/e2e-intervals_everything_20220313-151710.json | jq -c '.items[] | select(tostring | (contains("PrometheusOperatorWatchErrors") and contains("pending")))'
  {"level":"Info","locator":"alert/PrometheusOperatorWatchErrors ns/openshift-monitoring","message":"ALERTS{alertname=\"PrometheusOperatorWatchErrors\", alertstate=\"pending\", controller=\"alertmanager\", namespace=\"openshift-monitoring\", prometheus=\"openshift-monitoring/k8s\", severity=\"warning\"}","from":"2022-03-13T15:17:10Z","to":"2022-03-13T15:20:25Z"}
  {"level":"Info","locator":"alert/PrometheusOperatorWatchErrors ns/openshift-monitoring","message":"ALERTS{alertname=\"PrometheusOperatorWatchErrors\", alertstate=\"pending\", controller=\"prometheus\", namespace=\"openshift-monitoring\", prometheus=\"openshift-monitoring/k8s\", severity=\"warning\"}","from":"2022-03-13T15:17:10Z","to":"2022-03-13T15:20:25Z"}
  {"level":"Info","locator":"alert/PrometheusOperatorWatchErrors ns/openshift-monitoring","message":"ALERTS{alertname=\"PrometheusOperatorWatchErrors\", alertstate=\"pending\", controller=\"thanos\", namespace=\"openshift-monitoring\", prometheus=\"openshift-monitoring/k8s\", severity=\"warning\"}","from":"2022-03-13T15:17:10Z","to":"2022-03-13T15:20:25Z"}

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-canary/1503017798106877952

Comment 7 Ben Parees 2022-03-15 19:46:03 UTC
> so a good chunk of the pending time was before the test-step's monitors were up and watching

I guess the issue is that unless the reported time of the alerts the test gets from this query:
https://github.com/openshift/origin/blob/6f2ecc7c9690de3209becffd6c7e0544ebede1fa/pkg/monitor/alerts.go#L36-L37

is bounded by the start time that is being passed in.

so even though the alert data was there, the test-monitor code doesn't see it, and generates its own conclusion about how long the alert was firing(or pending).

Comment 8 W. Trevor King 2022-03-15 20:14:20 UTC
(In reply to Ben Parees from comment #7)
> is bounded by the start time that is being passed in.

Yup.  And in some cases it makes sense to have the origin complaints be limited to the span of time when the origin suite was running.  Perhaps the failure message should include "while the test suite was watching from $START_TIME through $STOP_TIME" or something else to make this distinction more clear.

Comment 10 hongyan li 2022-03-28 03:14:28 UTC
Wait for PR is in payload

Comment 12 hongyan li 2022-03-31 05:12:47 UTC
Test with payload 4.11.0-0.nightly-2022-03-29-152521
% oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml|grep -A10 PrometheusOperatorWatchErrors
      - alert: PrometheusOperatorWatchErrors
        annotations:
          description: Errors while performing watch operations in controller {{$labels.controller}}
            in {{$labels.namespace}} namespace.
          summary: Errors while performing watch operations in controller.
        expr: |
          (sum by (controller,namespace) (rate(prometheus_operator_watch_operations_failed_total{job="prometheus-operator", namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}[5m])) / sum by (controller,namespace) (rate(prometheus_operator_watch_operations_total{job="prometheus-operator", namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}[5m]))) > 0.4
        for: 15m
        labels:
          severity: warning

Comment 13 hongyan li 2022-03-31 05:28:35 UTC
also search ci test run results, no alert PrometheusOperatorWatchErrors pending or fires in the last 2 days.

Comment 17 errata-xmlrpc 2022-08-10 10:54:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.