Bug 1870287
Summary: | console-master-e2e-gcp-console test periodically fail due to no Alerts found | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Taylor <dtaylor> | |
Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> | |
Status: | CLOSED DUPLICATE | QA Contact: | Yadan Pei <yapei> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.6 | CC: | alegrand, anpicker, aos-bugs, dtaylor, erooth, jhadvig, jokerman, juzhao, kakkoyun, lcosic, mloibl, pkrupa, pneedle, pweil, spadgett, surbania | |
Target Milestone: | --- | |||
Target Release: | 4.7.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Known Issue | ||
Doc Text: |
Rules API back-ends are sometimes missed if Store API stores are detected before Rules API stores. When this occurs, a store reference is created without a Rules API client and the Rules API endpoint from Thanos Querier does not return any rules.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1885946 (view as bug list) | Environment: | ||
Last Closed: | 2020-12-01 00:45:53 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1858991 | |||
Bug Blocks: | 1885946 | |||
Attachments: |
Description
David Taylor
2020-08-19 16:39:22 UTC
Here is a video of it working: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6362/pull-ci-openshift-console-master-e2e-gcp-console/1295781805340758016/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/videos/monitoring/monitoring.spec.ts.mp4 Note: Watchdog is the only 'firing' alert, that is the one we use for testing. Created attachment 1711938 [details]
Initial load of Monitoring -> Alerting with source=Platform and State=firing
Created attachment 1711940 [details]
Monitoring -> Alerting with all filters cleared, 'platform' alerts
Created attachment 1711941 [details]
no platform Alerts and HighErrors notification alert
Hi, just ran into a cluster where there were no source=platform && status=firing Alerts. When I cleared the filter only 3 user alerts were shown and there was a HighError notification with 'prometheus' strings in the labels assocated with the HighError. Please see: attachment 1711938 [details] attachment 1711940 [details] attachment 1711941 [details] I think this is caused by https://bugzilla.redhat.com/show_bug.cgi?id=1858991 https://bugzilla.redhat.com/show_bug.cgi?id=1858991 fix has merged, so changed this issue to MODIFIED too. Still seeing this in CI Lowering the severity as we haven't been able to reproduce outside of CI, and this is only failing in about 5% of runs. We observed one CI run where the response from the `/rules` endpoint contained no rules (response was `{\"groups\":[]}`). It seems more likely that this happening somewhere on the backend, not the frontend, so moving this to the Monitoring component. We were able to gather more debug data, every time we observe the symptom there is a problem with resolving addresses for the rulesAPI. From the e2e test gathered here (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/6753/pull-ci-openshift-console-master-e2e-gcp-console/1309929118241918976/artifacts/e2e-gcp-console/pods/openshift-monitoring_thanos-querier-997dcd9cf-89gqr_thanos-query.log): level=error ts=2020-09-26T19:34:08.397393771Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: read udp 10.129.2.8:38312->172.30.0.10:53: i/o timeout" And from a running cluster we could also observe resolver issues: $ kubectl -n openshift-monitoring logs thanos-querier-8899c754c-67j9v -c thanos-query ... level=error ts=2020-09-28T03:18:39.976854363Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-1.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-1.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: no such host" level=error ts=2020-09-28T03:20:44.977430129Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: no such host" Independently of the fact that the network is flaky it seems that Thanos Querier does not re-resolve rulesAPI backends. If this is the case it must be solved in Thanos itself. I also filed the issue upstream https://github.com/thanos-io/thanos/issues/3244 so we can track it there too. This is planned for the next sprint. We submitted a fix upstream https://github.com/thanos-io/thanos/pull/3280 no such issue within 7 days https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=168h&context=3&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Hi, We are seeing this happen quite often in CI: https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=168h&context=3&type=bug%2Bjunit&name=e2e-gcp-console&maxMatches=5&maxBytes=20971520&groupBy=job This is now tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1897252 *** This bug has been marked as a duplicate of bug 1897252 *** |