Created attachment 1710462 [details] screen recording Description of problem: Open alert detail for a time, page will display 'No alerts found', change back to detail and change as 'No alerts found', display keep changing Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-08-04-193041 How reproducible: always Steps to Reproduce: 1.login management console with admin 2.open monitoring-alerts 3.click Watchdog and open alert detail page, wait for a while, the page will display 'No alerts found' 4. The page will change back to alert detail and change to 'No alerts found' Actual results: Expected results: Open alert detail for a time, the page always displays alert detail information Additional info:
Alert Rule detail page face similar issue Open alert rule detail for a time, page will display 'No alert rules found', change back to detail and change as 'No alert rules found', display keep changing
Possibly related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1856189: "Sergiusz Urbaniak 2020-08-07 10:05:42 UTC Raising severity to high as the observed symptons are missing recording and alerting rules." This might also explain our Cypress monitoring/alert flakes: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6259/pull-ci-openshift-console-master-e2e-gcp-console/1291488604429750272/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/screenshots/monitoring/monitoring.spec.ts/Monitoring%20Alerts%20--%20creates%20and%20expires%20a%20Silence%20%28failed%29.png
Was able to reproduce on 4.6.0-0.nightly-2020-08-02-134243 cluster. Waited on Watchdog alert details page, eventually saw blank page with "No Alert Found"
It appears subsequent calls to `/api/v1/rules` returns a different `rule.id` for 'Watchdog' alert. Initial webpage is: http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none - where '2525623224' is the rule id we use in the url, it is initially that value in the resultant JSON: { "rule": { "state": "firing", "name": "Watchdog", "alerts": [ { "labels": { "alertname": "Watchdog", "severity": "none" }, ... ... ], ... ... "id": "2525623224" <------------------------ Initial ID }, However, subsequent polling calls to `/api/v1/rules` return a different `rule.id` for `Watchdog`: { "rule": { "state": "firing", "name": "Watchdog", "alerts": [ { "labels": { "alertname": "Watchdog", "severity": "none" }, ... ... ], ... ... "id": "3511958173" <------------------------ new ID, although web page is still on 'http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none` }, The mismatch of url ruleId compared to new rule.id returned from api call causes webpage to show 'No Alert Found'. - Going back up to Alerting view, then drilling down to Watchdog alert details again results in new ruleID in url: http://0.0.0.0:9000/monitoring/alerts/3511958173?alertname=Watchdog&severity=none which shows the page until the rule.id reverts back to `2525623224` where we again get 'No Alert Found'! - Going back up to Alerting view, then drilling down to Watchdog alert details results in url: http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none which shows the page until the rule.id changes to '3720669783' (note: diff from '3511958173`!) Seems like rule.id is changing from `2525623224` to a rule.id's beginning with `3xxxxxxxxx`, but then switches back at some point because if you stay on page with url: http://0.0.0.0:9000/monitoring/alerts/2525623224?alertname=Watchdog&severity=none The page initially loads, then fails with 'No Alert Found' when rule.id === '3xxxxxxx', but after sometime, without page refresh, api polling to `/api/v1/rules` once again returns rule.id === '2525623224' and page once again shows the correct alert details for Watchdog.
Seeing the same for a 'CannotRetrieveUpdates` alert. Initial rule.id was '755421313`, then switched to '1449234446`, then back to '755421313', then back to '1449234446'. Display alternating between 'No Alert Found' and the alert details page with chart.
Suspicious of https://github.com/openshift/console/blob/master/frontend/public/components/monitoring/utils.ts#L43 where console code flattens the data returned from `/api/v1/rules` and adds and ID.
Re-assigning this to anpicker as I believe he wrote: https://github.com/openshift/console/blob/master/frontend/public/components/monitoring/utils.ts Hi Andy, I printed out the 'key' from https://github.com/openshift/console/blob/master/frontend/public/components/monitoring/utils.ts#L51 before and after the 'No Alert Found' message, in all cases the 'key' = /etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-prometheus-k8s-rules.yaml,general.rules,Watchdog,0,vector(1),openshift-monitoring/k8s=prometheus,none=severity So maybe the `id: String(murmur3(key, 'monitoring-salt'))' is doing something to change the ids?
Looks like this is happening because the the response from Prometheus' /rules is returning different values for `data.groups[].file`. When polling, the value of `file` is changing between requests, but it should return the same value each time.
The reason for this behavior is pretty simple to explain. The deduplication algorithm in Thanos Querier deduplicates groups based on their "name" field as the file name is "just" the place where this group has been mounted to. The offending rules in OpenShift are: ``` $ jq '.data.groups[] | { name: .name, file: .file} | select(.name == "general.rules")' rules.json { "name": "general.rules", "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-cluster-machine-approver-machineapprover-rules.yaml" } { "name": "general.rules", "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-prometheus-k8s-rules.yaml" } { "name": "general.rules", "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-sdn-networking-rules.yaml" } ``` I will submit a PR against the offending repos to resolve the clashes.
Another good news: upstream reached consensus about fixing it centrally Thanos Querier Rules API: https://github.com/thanos-io/thanos/issues/3017 I am working on the upstream fix, once that is available we can also bump downstream.
upstream fix has been merged in https://github.com/thanos-io/thanos/pull/3024. we're waiting for upstream to release an RC candidate, which we'll pull in once an RC candidate is out.
lucas: as discussed, let's simply do a cherry-pick against downstream as it is not clear when upstream is going to release 0.15-RC or 0.15.
*** Bug 1872782 has been marked as a duplicate of this bug. ***
Still seeing this in 4.6.0-0.nightly-2020-08-27. This might be impacting https://bugzilla.redhat.com/show_bug.cgi?id=1873612
setting to MODIFIED as the Thanos bump in https://bugzilla.redhat.com/show_bug.cgi?id=1873353 fixes this one too.
Test with payload 4.6.0-0.nightly-2020-09-09-173545 1.login management console with admin 2.open monitoring-alerts 3.click Watchdog and open alert detail page, wait for a while, the page displays well
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196