Bug 1870287

Summary:

console-master-e2e-gcp-console test periodically fail due to no Alerts found

Product:

OpenShift Container Platform

Reporter:

David Taylor <dtaylor>

Component:

Monitoring

Assignee:

Sergiusz Urbaniak <surbania>

Status:

CLOSED DUPLICATE

QA Contact:

Yadan Pei <yapei>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.6

CC:

alegrand, anpicker, aos-bugs, dtaylor, erooth, jhadvig, jokerman, juzhao, kakkoyun, lcosic, mloibl, pkrupa, pneedle, pweil, spadgett, surbania

Target Milestone:

---

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Known Issue

Doc Text:

Rules API back-ends are sometimes missed if Store API stores are detected before Rules API stores. When this occurs, a store reference is created without a Rules API client and the Rules API endpoint from Thanos Querier does not return any rules.

Story Points:

---

Clone Of:

Clones:

1885946 (view as bug list)

Environment:

Last Closed:

2020-12-01 00:45:53 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1858991

Bug Blocks:

1885946

Attachments:

Description	Flags
Initial load of Monitoring -> Alerting with source=Platform and State=firing	none
Monitoring -> Alerting with all filters cleared, 'platform' alerts	none
no platform Alerts and HighErrors notification alert	none

Description David Taylor 2020-08-19 16:39:22 UTC

Hi, we are noticing many Alerting pull-ci-openshift-console-master-e2e-gcp-console test flakes: https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=48h&context=3&type=bug%2Bjunit&name=e2e-gcp-console&maxMatches=5&maxBytes=20971520&groupBy=job

Here is a screenshot:  https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6236/pull-ci-openshift-console-master-e2e-gcp-console/1295850715876429824/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/screenshots/monitoring/monitoring.spec.ts/Monitoring%20Alerts%20--%20displays%20and%20filters%20the%20Alerts%20list%20page,%20links%20to%20detail%20pages%20%28failed%29.png

Related Bug maybe https://bugzilla.redhat.com/show_bug.cgi?id=1856189 which indicated that a symptom was that some alerts and/or rules not available after cluster initialization.

Comment 1 David Taylor 2020-08-19 17:37:09 UTC

Here is a video of it working: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6362/pull-ci-openshift-console-master-e2e-gcp-console/1295781805340758016/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/videos/monitoring/monitoring.spec.ts.mp4

Note: Watchdog is the only 'firing' alert, that is the one we use for testing.

Comment 2 David Taylor 2020-08-19 19:59:58 UTC

Created attachment 1711938 [details]
Initial load of Monitoring -> Alerting with source=Platform and State=firing

Comment 3 David Taylor 2020-08-19 20:00:50 UTC

Created attachment 1711940 [details]
Monitoring -> Alerting with all filters cleared, 'platform' alerts

Comment 4 David Taylor 2020-08-19 20:01:27 UTC

Created attachment 1711941 [details]
no platform Alerts and HighErrors notification alert

Comment 5 David Taylor 2020-08-19 20:04:30 UTC

Hi, just ran into a cluster where there were no source=platform && status=firing Alerts.   When I cleared the filter only 3 user alerts were shown and there was a HighError notification with 'prometheus' strings in the labels assocated with the HighError.  Please see:
attachment 1711938 [details]
attachment 1711940 [details]
attachment 1711941 [details]

Comment 6 Andrew Pickering 2020-08-21 11:27:16 UTC

I think this is caused by https://bugzilla.redhat.com/show_bug.cgi?id=1858991

Comment 9 Andrew Pickering 2020-09-09 00:25:38 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1858991 fix has merged, so changed this issue to MODIFIED too.

Comment 12 Andrew Pickering 2020-09-09 12:36:40 UTC

Still seeing this in CI

Comment 13 Samuel Padgett 2020-09-11 02:14:45 UTC

Lowering the severity as we haven't been able to reproduce outside of CI, and this is only failing in about 5% of runs.

Comment 14 Andrew Pickering 2020-09-15 07:00:47 UTC

We observed one CI run where the response from the `/rules` endpoint contained no rules (response was `{\"groups\":[]}`).

It seems more likely that this happening somewhere on the backend, not the frontend, so moving this to the Monitoring component.

Comment 15 Sergiusz Urbaniak 2020-09-28 07:04:31 UTC

We were able to gather more debug data, every time we observe the symptom there is a problem with resolving addresses for the rulesAPI. From the e2e test gathered here (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/6753/pull-ci-openshift-console-master-e2e-gcp-console/1309929118241918976/artifacts/e2e-gcp-console/pods/openshift-monitoring_thanos-querier-997dcd9cf-89gqr_thanos-query.log):

level=error ts=2020-09-26T19:34:08.397393771Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: read udp 10.129.2.8:38312->172.30.0.10:53: i/o timeout"

And from a running cluster we could also observe resolver issues:

$ kubectl -n openshift-monitoring logs thanos-querier-8899c754c-67j9v -c thanos-query
...
level=error ts=2020-09-28T03:18:39.976854363Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-1.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-1.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: no such host"
level=error ts=2020-09-28T03:20:44.977430129Z caller=query.go:384 msg="failed to resolve addresses for rulesAPIs" err="look IP addresses \"prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local.\": lookup prometheus-k8s-0.prometheus-operated.openshift-monitoring.svc.cluster.local. on 172.30.0.10:53: no such host"

Independently of the fact that the network is flaky it seems that Thanos Querier does not re-resolve rulesAPI backends. If this is the case it must be solved in Thanos itself.

Comment 16 Sergiusz Urbaniak 2020-09-28 10:12:03 UTC

I also filed the issue upstream https://github.com/thanos-io/thanos/issues/3244 so we can track it there too.

Comment 17 Sergiusz Urbaniak 2020-10-02 13:29:14 UTC

This is planned for the next sprint.

Comment 18 Sergiusz Urbaniak 2020-10-06 13:03:51 UTC

We submitted a fix upstream https://github.com/thanos-io/thanos/pull/3280

Comment 21 Junqi Zhao 2020-10-22 06:57:44 UTC

no such issue within 7 days
https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=168h&context=3&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 22 David Taylor 2020-11-30 21:59:56 UTC

Hi, We are seeing this happen quite often in CI:

https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=168h&context=3&type=bug%2Bjunit&name=e2e-gcp-console&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 23 Andrew Pickering 2020-12-01 00:45:53 UTC

This is now tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1897252

*** This bug has been marked as a duplicate of bug 1897252 ***