1866200 – open alert detail for a time, page will display 'No alerts found'

Bug 1866200 - open alert detail for a time, page will display 'No alerts found'

Summary: open alert detail for a time, page will display 'No alerts found'

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Sergiusz Urbaniak
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1872782 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-05 05:37 UTC by hongyan li
Modified:	2020-11-12 09:50 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:24:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screen recording (441.87 KB, application/x-matroska) 2020-08-05 05:37 UTC, hongyan li	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-machine-approver pull 81	None	closed	Bug 1866200: manifests: fix recording rules group name	2020-12-10 10:14:54 UTC
Github	openshift cluster-network-operator pull 752	None	closed	Bug 1866200: fix recording rules group name	2020-12-10 10:14:54 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:25:07 UTC

Description hongyan li 2020-08-05 05:37:57 UTC

Created attachment 1710462 [details]
screen recording

Description of problem:

Open alert detail for a time, page will display 'No alerts found', change back to detail and change as 'No alerts found', display keep changing


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-04-193041

How reproducible:
always

Steps to Reproduce:
1.login management console with admin
2.open monitoring-alerts
3.click Watchdog and open alert detail page, wait for a while, the page will display 'No alerts found'
4. The page will change back to alert detail and change to 'No alerts found'


Actual results:


Expected results:
Open alert detail for a time, the page always displays alert detail information

Additional info:

Comment 1 hongyan li 2020-08-05 07:22:39 UTC

Alert Rule detail page face similar issue

Open alert rule detail for a time, page will display 'No alert rules found', change back to detail and change as 'No alert rules found', display keep changing

Comment 2 David Taylor 2020-08-07 15:16:27 UTC

Possibly related BZ:  https://bugzilla.redhat.com/show_bug.cgi?id=1856189:
"Sergiusz Urbaniak 2020-08-07 10:05:42 UTC
Raising severity to high as the observed symptons are missing recording and alerting rules."

This might also explain our Cypress monitoring/alert flakes:  https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6259/pull-ci-openshift-console-master-e2e-gcp-console/1291488604429750272/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/screenshots/monitoring/monitoring.spec.ts/Monitoring%20Alerts%20--%20creates%20and%20expires%20a%20Silence%20%28failed%29.png

Comment 3 David Taylor 2020-08-10 15:15:30 UTC

Was able to reproduce on 4.6.0-0.nightly-2020-08-02-134243 cluster.
Waited on Watchdog alert details page, eventually saw blank page with "No Alert Found"

Comment 4 David Taylor 2020-08-10 19:36:22 UTC

It appears subsequent calls to `/api/v1/rules` returns a different `rule.id` for 'Watchdog' alert.

Initial webpage is:
http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none
- where '2525623224' is the rule id we use in the url, it is initially that value in the resultant JSON:

{
  "rule": {
   "state": "firing",
   "name": "Watchdog",
   "alerts": [
    {
     "labels": {
      "alertname": "Watchdog",
      "severity": "none"
     },
...
...
   ],
   ...
   ...
   "id": "2525623224"     <------------------------ Initial ID
  },

However, subsequent polling calls to `/api/v1/rules` return a different `rule.id` for `Watchdog`:

{
  "rule": {
   "state": "firing",
   "name": "Watchdog",
   "alerts": [
    {
     "labels": {
      "alertname": "Watchdog",
      "severity": "none"
     },
     ...
     ...
   ],
   ...
   ...
   "id": "3511958173" <------------------------ new ID, although web page is still on 'http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none`
  },

The mismatch of url ruleId compared to new rule.id returned from api call causes webpage to show 'No Alert Found'.

- Going back up to Alerting view, then drilling down to Watchdog alert details again results in new ruleID in url:
  http://0.0.0.0:9000/monitoring/alerts/3511958173?alertname=Watchdog&severity=none
  which shows the page until the rule.id reverts back to `2525623224` where we again get 'No Alert Found'!

- Going back up to Alerting view, then drilling down to Watchdog alert details results in url:
  http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none
  which shows the page until the rule.id changes to '3720669783' (note: diff from '3511958173`!)

Seems like rule.id is changing from `2525623224` to a rule.id's beginning with `3xxxxxxxxx`, but then switches back at some point because if you stay on page with url:  http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none
The page initially loads, then fails with 'No Alert Found' when rule.id === '3xxxxxxx', but after sometime, without page refresh, api polling to `/api/v1/rules` once again returns rule.id === '2525623224' and page once again shows the correct alert details for Watchdog.

Comment 5 David Taylor 2020-08-10 19:45:55 UTC

Seeing the same for a 'CannotRetrieveUpdates` alert.  Initial rule.id was '755421313`, then switched to '1449234446`, then back to '755421313', then back to '1449234446'.  Display alternating between 'No Alert Found' and the alert details page with chart.

Comment 6 David Taylor 2020-08-10 20:07:05 UTC

Suspicious of https://github.com/openshift/console/blob/master/frontend/public/components/monitoring/utils.ts#L43  where console code flattens the data returned from `/api/v1/rules` and adds and ID.

Comment 7 David Taylor 2020-08-10 20:30:59 UTC

Re-assigning this to anpicker as I believe he wrote:  https://github.com/openshift/console/blob/master/frontend/public/components/monitoring/utils.ts

Hi Andy, 
I printed out the 'key' from https://github.com/openshift/console/blob/master/frontend/public/components/monitoring/utils.ts#L51 before and after the 'No Alert Found' message, in all cases the 'key' =

/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-prometheus-k8s-rules.yaml,general.rules,Watchdog,0,vector(1),openshift-monitoring/k8s=prometheus,none=severity

So maybe the `id: String(murmur3(key, 'monitoring-salt'))' is doing something to change the ids?

Comment 8 Andrew Pickering 2020-08-11 08:15:32 UTC

Looks like this is happening because the the response from Prometheus' /rules is returning different values for `data.groups[].file`. When polling, the value of `file` is changing between requests, but it should return the same value each time.

Comment 9 Sergiusz Urbaniak 2020-08-11 12:38:09 UTC

The reason for this behavior is pretty simple to explain. The deduplication algorithm in Thanos Querier deduplicates groups based on their "name" field as the file name is "just" the place where this group has been mounted to.

The offending rules in OpenShift are:

```
$ jq '.data.groups[] | { name: .name, file: .file} | select(.name == "general.rules")' rules.json
{
  "name": "general.rules",
  "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-cluster-machine-approver-machineapprover-rules.yaml"
}
{
  "name": "general.rules",
  "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-prometheus-k8s-rules.yaml"
}
{
  "name": "general.rules",
  "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-sdn-networking-rules.yaml"
}
```

I will submit a PR against the offending repos to resolve the clashes.

Comment 10 Sergiusz Urbaniak 2020-08-11 16:34:36 UTC

Another good news: upstream reached consensus about fixing it centrally Thanos Querier Rules API: https://github.com/thanos-io/thanos/issues/3017

I am working on the upstream fix, once that is available we can also bump downstream.

Comment 11 Sergiusz Urbaniak 2020-08-17 12:49:26 UTC

upstream fix has been merged in https://github.com/thanos-io/thanos/pull/3024.

we're waiting for upstream to release an RC candidate, which we'll pull in once an RC candidate is out.

Comment 12 Sergiusz Urbaniak 2020-08-24 09:45:00 UTC

lucas: as discussed, let's simply do a cherry-pick against downstream as it is not clear when upstream is going to release 0.15-RC or 0.15.

Comment 13 Andrew Pickering 2020-08-27 00:27:25 UTC

*** Bug 1872782 has been marked as a duplicate of this bug. ***

Comment 14 David Taylor 2020-08-28 19:06:34 UTC

Still seeing this in 4.6.0-0.nightly-2020-08-27.  This might be impacting https://bugzilla.redhat.com/show_bug.cgi?id=1873612

Comment 15 Sergiusz Urbaniak 2020-09-07 14:19:05 UTC

setting to MODIFIED as the Thanos bump in https://bugzilla.redhat.com/show_bug.cgi?id=1873353 fixes this one too.

Comment 18 hongyan li 2020-09-10 01:58:17 UTC

Test with payload 4.6.0-0.nightly-2020-09-09-173545

1.login management console with admin
2.open monitoring-alerts
3.click Watchdog and open alert detail page, wait for a while, the page displays well

Comment 20 errata-xmlrpc 2020-10-27 16:24:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.