1870287 – console-master-e2e-gcp-console test periodically fail due to no Alerts found

Bug 1870287 - console-master-e2e-gcp-console test periodically fail due to no Alerts found

Summary: console-master-e2e-gcp-console test periodically fail due to no Alerts found

Keywords:
Status:	CLOSED DUPLICATE of bug 1897252
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Yadan Pei
Docs Contact:
URL:
Whiteboard:
Depends On:	1858991
Blocks:	1885946
TreeView+	depends on / blocked

Reported:	2020-08-19 16:39 UTC by David Taylor
Modified:	2020-12-01 00:46 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Rules API back-ends are sometimes missed if Store API stores are detected before Rules API stores. When this occurs, a store reference is created without a Rules API client and the Rules API endpoint from Thanos Querier does not return any rules.
Clone Of:
Clones:	1885946 (view as bug list)
Environment:
Last Closed:	2020-12-01 00:45:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Initial load of Monitoring -> Alerting with source=Platform and State=firing (48.92 KB, image/png) 2020-08-19 19:59 UTC, David Taylor	no flags	Details
Monitoring -> Alerting with all filters cleared, 'platform' alerts (45.68 KB, image/png) 2020-08-19 20:00 UTC, David Taylor	no flags	Details
no platform Alerts and HighErrors notification alert (105.49 KB, image/png) 2020-08-19 20:01 UTC, David Taylor	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift thanos pull 37	0	None	closed	Bug 1870287: pkg/query: eventually update rules client	2021-01-12 03:21:37 UTC
Github	thanos-io thanos issues 3244	0	None	closed	Thanos Querier misses Rules API backends	2021-01-12 03:21:34 UTC

Description David Taylor 2020-08-19 16:39:22 UTC

Hi, we are noticing many Alerting pull-ci-openshift-console-master-e2e-gcp-console test flakes: https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=48h&context=3&type=bug%2Bjunit&name=e2e-gcp-console&maxMatches=5&maxBytes=20971520&groupBy=job

Here is a screenshot:  https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6236/pull-ci-openshift-console-master-e2e-gcp-console/1295850715876429824/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/screenshots/monitoring/monitoring.spec.ts/Monitoring%20Alerts%20--%20displays%20and%20filters%20the%20Alerts%20list%20page,%20links%20to%20detail%20pages%20%28failed%29.png

Related Bug maybe https://bugzilla.redhat.com/show_bug.cgi?id=1856189 which indicated that a symptom was that some alerts and/or rules not available after cluster initialization.

Comment 1 David Taylor 2020-08-19 17:37:09 UTC

Here is a video of it working: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6362/pull-ci-openshift-console-master-e2e-gcp-console/1295781805340758016/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/videos/monitoring/monitoring.spec.ts.mp4

Note: Watchdog is the only 'firing' alert, that is the one we use for testing.

Comment 2 David Taylor 2020-08-19 19:59:58 UTC

Created attachment 1711938 [details]
Initial load of Monitoring -> Alerting with source=Platform and State=firing

Comment 3 David Taylor 2020-08-19 20:00:50 UTC

Created attachment 1711940 [details]
Monitoring -> Alerting with all filters cleared, 'platform' alerts

Comment 4 David Taylor 2020-08-19 20:01:27 UTC

Created attachment 1711941 [details]
no platform Alerts and HighErrors notification alert

Comment 5 David Taylor 2020-08-19 20:04:30 UTC

Hi, just ran into a cluster where there were no source=platform && status=firing Alerts.   When I cleared the filter only 3 user alerts were shown and there was a HighError notification with 'prometheus' strings in the labels assocated with the HighError.  Please see:
attachment 1711938 [details]
attachment 1711940 [details]
attachment 1711941 [details]

Comment 6 Andrew Pickering 2020-08-21 11:27:16 UTC

I think this is caused by https://bugzilla.redhat.com/show_bug.cgi?id=1858991

Comment 9 Andrew Pickering 2020-09-09 00:25:38 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1858991 fix has merged, so changed this issue to MODIFIED too.

Comment 12 Andrew Pickering 2020-09-09 12:36:40 UTC

Still seeing this in CI

Comment 13 Samuel Padgett 2020-09-11 02:14:45 UTC

Lowering the severity as we haven't been able to reproduce outside of CI, and this is only failing in about 5% of runs.

Comment 14 Andrew Pickering 2020-09-15 07:00:47 UTC

We observed one CI run where the response from the `/rules` endpoint contained no rules (response was `{\"groups\":[]}`).

It seems more likely that this happening somewhere on the backend, not the frontend, so moving this to the Monitoring component.

Comment 15 Sergiusz Urbaniak 2020-09-28 07:04:31 UTC

We were able to gather more debug data, every time we observe the symptom there is a problem with resolving addresses for the rulesAPI. From the e2e test gathered here (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/6753/pull-ci-openshift-console-master-e2e-gcp-console/1309929118241918976/artifacts/e2e-gcp-console/pods/openshift-monitoring_thanos-querier-997dcd9cf-89gqr_thanos-query.log):

level=error ts=2020-09-26T19:34:08.397393771Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: read udp 10.129.2.8:38312->172.30.0.10:53: i/o timeout"

And from a running cluster we could also observe resolver issues:

$ kubectl -n openshift-monitoring logs thanos-querier-8899c754c-67j9v -c thanos-query
...
level=error ts=2020-09-28T03:18:39.976854363Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-1.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-1.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: no such host"
level=error ts=2020-09-28T03:20:44.977430129Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: no such host"

Independently of the fact that the network is flaky it seems that Thanos Querier does not re-resolve rulesAPI backends. If this is the case it must be solved in Thanos itself.

Comment 16 Sergiusz Urbaniak 2020-09-28 10:12:03 UTC

I also filed the issue upstream https://github.com/thanos-io/thanos/issues/3244 so we can track it there too.

Comment 17 Sergiusz Urbaniak 2020-10-02 13:29:14 UTC

This is planned for the next sprint.

Comment 18 Sergiusz Urbaniak 2020-10-06 13:03:51 UTC

We submitted a fix upstream https://github.com/thanos-io/thanos/pull/3280

Comment 21 Junqi Zhao 2020-10-22 06:57:44 UTC

no such issue within 7 days
https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=168h&context=3&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 22 David Taylor 2020-11-30 21:59:56 UTC

Hi, We are seeing this happen quite often in CI:

https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=168h&context=3&type=bug%2Bjunit&name=e2e-gcp-console&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 23 Andrew Pickering 2020-12-01 00:45:53 UTC

This is now tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1897252

*** This bug has been marked as a duplicate of bug 1897252 ***

Note You need to log in before you can comment on or make changes to this bug.