Bug 1866200 - open alert detail for a time, page will display 'No alerts found'
Summary: open alert detail for a time, page will display 'No alerts found'
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Sergiusz Urbaniak
QA Contact: hongyan li
URL:
Whiteboard:
: 1872782 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-05 05:37 UTC by hongyan li
Modified: 2020-11-12 09:50 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:24:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
screen recording (441.87 KB, application/x-matroska)
2020-08-05 05:37 UTC, hongyan li
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-machine-approver pull 81 0 None closed Bug 1866200: manifests: fix recording rules group name 2020-12-10 10:14:54 UTC
Github openshift cluster-network-operator pull 752 0 None closed Bug 1866200: fix recording rules group name 2020-12-10 10:14:54 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:25:07 UTC

Description hongyan li 2020-08-05 05:37:57 UTC
Created attachment 1710462 [details]
screen recording

Description of problem:

Open alert detail for a time, page will display 'No alerts found', change back to detail and change as 'No alerts found', display keep changing


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-04-193041

How reproducible:
always

Steps to Reproduce:
1.login management console with admin
2.open monitoring-alerts
3.click Watchdog and open alert detail page, wait for a while, the page will display 'No alerts found'
4. The page will change back to alert detail and change to 'No alerts found'


Actual results:


Expected results:
Open alert detail for a time, the page always displays alert detail information

Additional info:

Comment 1 hongyan li 2020-08-05 07:22:39 UTC
Alert Rule detail page face similar issue

Open alert rule detail for a time, page will display 'No alert rules found', change back to detail and change as 'No alert rules found', display keep changing

Comment 3 David Taylor 2020-08-10 15:15:30 UTC
Was able to reproduce on 4.6.0-0.nightly-2020-08-02-134243 cluster.
Waited on Watchdog alert details page, eventually saw blank page with "No Alert Found"

Comment 4 David Taylor 2020-08-10 19:36:22 UTC
It appears subsequent calls to `/api/v1/rules` returns a different `rule.id` for 'Watchdog' alert.

Initial webpage is:
http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none
- where '2525623224' is the rule id we use in the url, it is initially that value in the resultant JSON:

{
  "rule": {
   "state": "firing",
   "name": "Watchdog",
   "alerts": [
    {
     "labels": {
      "alertname": "Watchdog",
      "severity": "none"
     },
...
...
   ],
   ...
   ...
   "id": "2525623224"     <------------------------ Initial ID
  },

However, subsequent polling calls to `/api/v1/rules` return a different `rule.id` for `Watchdog`:

{
  "rule": {
   "state": "firing",
   "name": "Watchdog",
   "alerts": [
    {
     "labels": {
      "alertname": "Watchdog",
      "severity": "none"
     },
     ...
     ...
   ],
   ...
   ...
   "id": "3511958173" <------------------------ new ID, although web page is still on 'http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none`
  },

The mismatch of url ruleId compared to new rule.id returned from api call causes webpage to show 'No Alert Found'.

- Going back up to Alerting view, then drilling down to Watchdog alert details again results in new ruleID in url:
  http://0.0.0.0:9000/monitoring/alerts/3511958173?alertname=Watchdog&severity=none
  which shows the page until the rule.id reverts back to `2525623224` where we again get 'No Alert Found'!

- Going back up to Alerting view, then drilling down to Watchdog alert details results in url:
  http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none
  which shows the page until the rule.id changes to '3720669783' (note: diff from '3511958173`!)

Seems like rule.id is changing from `2525623224` to a rule.id's beginning with `3xxxxxxxxx`, but then switches back at some point because if you stay on page with url:  http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none
The page initially loads, then fails with 'No Alert Found' when rule.id === '3xxxxxxx', but after sometime, without page refresh, api polling to `/api/v1/rules` once again returns rule.id === '2525623224' and page once again shows the correct alert details for Watchdog.

Comment 5 David Taylor 2020-08-10 19:45:55 UTC
Seeing the same for a 'CannotRetrieveUpdates` alert.  Initial rule.id was '755421313`, then switched to '1449234446`, then back to '755421313', then back to '1449234446'.  Display alternating between 'No Alert Found' and the alert details page with chart.

Comment 6 David Taylor 2020-08-10 20:07:05 UTC
Suspicious of https://github.com/openshift/console/blob/master/frontend/public/components/monitoring/utils.ts#L43  where console code flattens the data returned from `/api/v1/rules` and adds and ID.

Comment 7 David Taylor 2020-08-10 20:30:59 UTC
Re-assigning this to anpicker as I believe he wrote:  https://github.com/openshift/console/blob/master/frontend/public/components/monitoring/utils.ts

Hi Andy, 
I printed out the 'key' from https://github.com/openshift/console/blob/master/frontend/public/components/monitoring/utils.ts#L51 before and after the 'No Alert Found' message, in all cases the 'key' =

/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-prometheus-k8s-rules.yaml,general.rules,Watchdog,0,vector(1),openshift-monitoring/k8s=prometheus,none=severity

So maybe the `id: String(murmur3(key, 'monitoring-salt'))' is doing something to change the ids?

Comment 8 Andrew Pickering 2020-08-11 08:15:32 UTC
Looks like this is happening because the the response from Prometheus' /rules is returning different values for `data.groups[].file`. When polling, the value of `file` is changing between requests, but it should return the same value each time.

Comment 9 Sergiusz Urbaniak 2020-08-11 12:38:09 UTC
The reason for this behavior is pretty simple to explain. The deduplication algorithm in Thanos Querier deduplicates groups based on their "name" field as the file name is "just" the place where this group has been mounted to.

The offending rules in OpenShift are:

```
$ jq '.data.groups[] | { name: .name, file: .file} | select(.name == "general.rules")' rules.json
{
  "name": "general.rules",
  "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-cluster-machine-approver-machineapprover-rules.yaml"
}
{
  "name": "general.rules",
  "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-prometheus-k8s-rules.yaml"
}
{
  "name": "general.rules",
  "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-sdn-networking-rules.yaml"
}
```

I will submit a PR against the offending repos to resolve the clashes.

Comment 10 Sergiusz Urbaniak 2020-08-11 16:34:36 UTC
Another good news: upstream reached consensus about fixing it centrally Thanos Querier Rules API: https://github.com/thanos-io/thanos/issues/3017

I am working on the upstream fix, once that is available we can also bump downstream.

Comment 11 Sergiusz Urbaniak 2020-08-17 12:49:26 UTC
upstream fix has been merged in https://github.com/thanos-io/thanos/pull/3024.

we're waiting for upstream to release an RC candidate, which we'll pull in once an RC candidate is out.

Comment 12 Sergiusz Urbaniak 2020-08-24 09:45:00 UTC
lucas: as discussed, let's simply do a cherry-pick against downstream as it is not clear when upstream is going to release 0.15-RC or 0.15.

Comment 13 Andrew Pickering 2020-08-27 00:27:25 UTC
*** Bug 1872782 has been marked as a duplicate of this bug. ***

Comment 14 David Taylor 2020-08-28 19:06:34 UTC
Still seeing this in 4.6.0-0.nightly-2020-08-27.  This might be impacting https://bugzilla.redhat.com/show_bug.cgi?id=1873612

Comment 15 Sergiusz Urbaniak 2020-09-07 14:19:05 UTC
setting to MODIFIED as the Thanos bump in https://bugzilla.redhat.com/show_bug.cgi?id=1873353 fixes this one too.

Comment 18 hongyan li 2020-09-10 01:58:17 UTC
Test with payload 4.6.0-0.nightly-2020-09-09-173545

1.login management console with admin
2.open monitoring-alerts
3.click Watchdog and open alert detail page, wait for a while, the page displays well

Comment 20 errata-xmlrpc 2020-10-27 16:24:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.