Hide Forgot
Hi, we are noticing many Alerting pull-ci-openshift-console-master-e2e-gcp-console test flakes: https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=48h&context=3&type=bug%2Bjunit&name=e2e-gcp-console&maxMatches=5&maxBytes=20971520&groupBy=job Here is a screenshot: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6236/pull-ci-openshift-console-master-e2e-gcp-console/1295850715876429824/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/screenshots/monitoring/monitoring.spec.ts/Monitoring%20Alerts%20--%20displays%20and%20filters%20the%20Alerts%20list%20page,%20links%20to%20detail%20pages%20%28failed%29.png Related Bug maybe https://bugzilla.redhat.com/show_bug.cgi?id=1856189 which indicated that a symptom was that some alerts and/or rules not available after cluster initialization.
Here is a video of it working: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6362/pull-ci-openshift-console-master-e2e-gcp-console/1295781805340758016/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/videos/monitoring/monitoring.spec.ts.mp4 Note: Watchdog is the only 'firing' alert, that is the one we use for testing.
Created attachment 1711938 [details] Initial load of Monitoring -> Alerting with source=Platform and State=firing
Created attachment 1711940 [details] Monitoring -> Alerting with all filters cleared, 'platform' alerts
Created attachment 1711941 [details] no platform Alerts and HighErrors notification alert
Hi, just ran into a cluster where there were no source=platform && status=firing Alerts. When I cleared the filter only 3 user alerts were shown and there was a HighError notification with 'prometheus' strings in the labels assocated with the HighError. Please see: attachment 1711938 [details] attachment 1711940 [details] attachment 1711941 [details]
I think this is caused by https://bugzilla.redhat.com/show_bug.cgi?id=1858991
https://bugzilla.redhat.com/show_bug.cgi?id=1858991 fix has merged, so changed this issue to MODIFIED too.
Still seeing this in CI
Lowering the severity as we haven't been able to reproduce outside of CI, and this is only failing in about 5% of runs.
We observed one CI run where the response from the `/rules` endpoint contained no rules (response was `{\"groups\":[]}`). It seems more likely that this happening somewhere on the backend, not the frontend, so moving this to the Monitoring component.
We were able to gather more debug data, every time we observe the symptom there is a problem with resolving addresses for the rulesAPI. From the e2e test gathered here (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/6753/pull-ci-openshift-console-master-e2e-gcp-console/1309929118241918976/artifacts/e2e-gcp-console/pods/openshift-monitoring_thanos-querier-997dcd9cf-89gqr_thanos-query.log): level=error ts=2020-09-26T19:34:08.397393771Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: read udp 10.129.2.8:38312->172.30.0.10:53: i/o timeout" And from a running cluster we could also observe resolver issues: $ kubectl -n openshift-monitoring logs thanos-querier-8899c754c-67j9v -c thanos-query ... level=error ts=2020-09-28T03:18:39.976854363Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-1.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-1.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: no such host" level=error ts=2020-09-28T03:20:44.977430129Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: no such host" Independently of the fact that the network is flaky it seems that Thanos Querier does not re-resolve rulesAPI backends. If this is the case it must be solved in Thanos itself.
I also filed the issue upstream https://github.com/thanos-io/thanos/issues/3244 so we can track it there too.
This is planned for the next sprint.
We submitted a fix upstream https://github.com/thanos-io/thanos/pull/3280
no such issue within 7 days https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=168h&context=3&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
Hi, We are seeing this happen quite often in CI: https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=168h&context=3&type=bug%2Bjunit&name=e2e-gcp-console&maxMatches=5&maxBytes=20971520&groupBy=job
This is now tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1897252 *** This bug has been marked as a duplicate of bug 1897252 ***