Bug 1870287 - console-master-e2e-gcp-console test periodically fail due to no Alerts found
Summary: console-master-e2e-gcp-console test periodically fail due to no Alerts found
Keywords:
Status: CLOSED DUPLICATE of bug 1897252
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Sergiusz Urbaniak
QA Contact: Yadan Pei
URL:
Whiteboard:
Depends On: 1858991
Blocks: 1885946
TreeView+ depends on / blocked
 
Reported: 2020-08-19 16:39 UTC by David Taylor
Modified: 2020-12-01 00:46 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Rules API back-ends are sometimes missed if Store API stores are detected before Rules API stores. When this occurs, a store reference is created without a Rules API client and the Rules API endpoint from Thanos Querier does not return any rules.
Clone Of:
: 1885946 (view as bug list)
Environment:
Last Closed: 2020-12-01 00:45:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Initial load of Monitoring -> Alerting with source=Platform and State=firing (48.92 KB, image/png)
2020-08-19 19:59 UTC, David Taylor
no flags Details
Monitoring -> Alerting with all filters cleared, 'platform' alerts (45.68 KB, image/png)
2020-08-19 20:00 UTC, David Taylor
no flags Details
no platform Alerts and HighErrors notification alert (105.49 KB, image/png)
2020-08-19 20:01 UTC, David Taylor
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift thanos pull 37 0 None closed Bug 1870287: pkg/query: eventually update rules client 2021-01-12 03:21:37 UTC
Github thanos-io thanos issues 3244 0 None closed Thanos Querier misses Rules API backends 2021-01-12 03:21:34 UTC

Comment 2 David Taylor 2020-08-19 19:59:58 UTC
Created attachment 1711938 [details]
Initial load of Monitoring -> Alerting with source=Platform and State=firing

Comment 3 David Taylor 2020-08-19 20:00:50 UTC
Created attachment 1711940 [details]
Monitoring -> Alerting with all filters cleared, 'platform' alerts

Comment 4 David Taylor 2020-08-19 20:01:27 UTC
Created attachment 1711941 [details]
no platform Alerts and HighErrors notification alert

Comment 5 David Taylor 2020-08-19 20:04:30 UTC
Hi, just ran into a cluster where there were no source=platform && status=firing Alerts.   When I cleared the filter only 3 user alerts were shown and there was a HighError notification with 'prometheus' strings in the labels assocated with the HighError.  Please see:
attachment 1711938 [details]
attachment 1711940 [details]
attachment 1711941 [details]

Comment 6 Andrew Pickering 2020-08-21 11:27:16 UTC
I think this is caused by https://bugzilla.redhat.com/show_bug.cgi?id=1858991

Comment 9 Andrew Pickering 2020-09-09 00:25:38 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1858991 fix has merged, so changed this issue to MODIFIED too.

Comment 12 Andrew Pickering 2020-09-09 12:36:40 UTC
Still seeing this in CI

Comment 13 Samuel Padgett 2020-09-11 02:14:45 UTC
Lowering the severity as we haven't been able to reproduce outside of CI, and this is only failing in about 5% of runs.

Comment 14 Andrew Pickering 2020-09-15 07:00:47 UTC
We observed one CI run where the response from the `/rules` endpoint contained no rules (response was `{\"groups\":[]}`).

It seems more likely that this happening somewhere on the backend, not the frontend, so moving this to the Monitoring component.

Comment 15 Sergiusz Urbaniak 2020-09-28 07:04:31 UTC
We were able to gather more debug data, every time we observe the symptom there is a problem with resolving addresses for the rulesAPI. From the e2e test gathered here (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/6753/pull-ci-openshift-console-master-e2e-gcp-console/1309929118241918976/artifacts/e2e-gcp-console/pods/openshift-monitoring_thanos-querier-997dcd9cf-89gqr_thanos-query.log):

level=error ts=2020-09-26T19:34:08.397393771Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: read udp 10.129.2.8:38312->172.30.0.10:53: i/o timeout"

And from a running cluster we could also observe resolver issues:

$ kubectl -n openshift-monitoring logs thanos-querier-8899c754c-67j9v -c thanos-query
...
level=error ts=2020-09-28T03:18:39.976854363Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-1.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-1.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: no such host"
level=error ts=2020-09-28T03:20:44.977430129Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: no such host"

Independently of the fact that the network is flaky it seems that Thanos Querier does not re-resolve rulesAPI backends. If this is the case it must be solved in Thanos itself.

Comment 16 Sergiusz Urbaniak 2020-09-28 10:12:03 UTC
I also filed the issue upstream https://github.com/thanos-io/thanos/issues/3244 so we can track it there too.

Comment 17 Sergiusz Urbaniak 2020-10-02 13:29:14 UTC
This is planned for the next sprint.

Comment 18 Sergiusz Urbaniak 2020-10-06 13:03:51 UTC
We submitted a fix upstream https://github.com/thanos-io/thanos/pull/3280

Comment 23 Andrew Pickering 2020-12-01 00:45:53 UTC
This is now tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1897252

*** This bug has been marked as a duplicate of bug 1897252 ***


Note You need to log in before you can comment on or make changes to this bug.