Bug 1897252 - Firing alerts are not showing up in console UI after cluster is up for some time
Summary: Firing alerts are not showing up in console UI after cluster is up for some time
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.7.0
Assignee: Lili Cosic
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1870287 1901618 1903423 1913961 (view as bug list)
Depends On:
Blocks: 1885946
TreeView+ depends on / blocked
 
Reported: 2020-11-12 15:56 UTC by Lili Cosic
Modified: 2021-04-16 15:47 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:32:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
no alert under "Monitoring -> Alerts" (141.04 KB, image/png)
2020-12-01 07:59 UTC, Junqi Zhao
no flags Details
thanos-querier logs (2.07 MB, text/plain)
2020-12-01 08:00 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 995 0 None closed Bug 1897252: Add Thanos query log level 2021-01-18 14:14:53 UTC
Github openshift thanos pull 41 0 None closed Bug 1897252: CARRY: cmd/thanos/query.go: Timeout DNS resolution with refresh inter… 2021-01-18 14:14:53 UTC
Github openshift thanos pull 42 0 None closed Bug 1897252: CARRY: cmd/thanos: fix DNS resolution when ctx is canceled 2021-01-18 14:14:53 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:33:13 UTC

Description Lili Cosic 2020-11-12 15:56:17 UTC
Description of problem:

In console the alerts are not showing up that are firing after some time of cluster being up.

Version-Release number of selected component (if applicable):

4.7.0-0.ci-2020-11-12-053309 on GCP.

How reproducible:

Have not launched another cluster, so don't know.

Steps to Reproduce:
1. Launch 4.7 cluster and see Alert firing
2. Alert stops showing up in console after some time

Actual results:


Expected results:

See all alerts firing in console, that fire in either prom. instances


Additional info:

Comment 2 Andrew Pickering 2020-11-18 01:02:16 UTC
Seems to be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1870287

One of the Thanos pods was hitting `failed to resolve addresses for rulesAPIs` errors. This caused the response from the `/rules` endpoint to sometimes be empty (when the call hit the failing pod).

Comment 4 Lili Cosic 2020-11-25 16:38:46 UTC
*** Bug 1901618 has been marked as a duplicate of this bug. ***

Comment 7 Andrew Pickering 2020-12-01 00:45:27 UTC
*** Bug 1870287 has been marked as a duplicate of this bug. ***

Comment 8 Junqi Zhao 2020-12-01 07:58:15 UTC
tested with 4.7.0-0.nightly-2020-11-30-172451, can't find the alerts in UI now, see from the picture
there're errors in thanos-querier
level=error ts=2020-12-01T07:53:22.984451181Z caller=query.go:394 msg="failed to resolve addresses for storeAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": lookup _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local on 172.30.0.10:53: dial udp 172.30.0.10:53: operation was canceled"
level=error ts=2020-12-01T07:53:22.984914317Z caller=query.go:397 msg="failed to resolve addresses for rulesAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": lookup _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local on 172.30.0.10:53: dial udp 172.30.0.10:53: operation was canceled"
level=warn ts=2020-12-01T07:53:24.466067505Z caller=storeset.go:456 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.129.2.25:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.129.2.25:10901: connect: no route to host\"" address=10.129.2.25:10901

Comment 9 Junqi Zhao 2020-12-01 07:59:13 UTC
Created attachment 1735129 [details]
no alert under "Monitoring -> Alerts"

Comment 10 Junqi Zhao 2020-12-01 08:00:18 UTC
Created attachment 1735130 [details]
thanos-querier logs

Comment 11 Junqi Zhao 2020-12-02 01:57:25 UTC
this also triggered ThanosQueryHighDNSFailures alert, "description": "Thanos Query thanos-querier have 100% of failing DNS queries for store endpoints."

Comment 12 Junqi Zhao 2020-12-02 03:33:17 UTC
should be caused by bug 1903423, we need to fix it first

Comment 13 Simon Pasquier 2020-12-02 09:14:27 UTC
*** Bug 1903423 has been marked as a duplicate of this bug. ***

Comment 16 Junqi Zhao 2020-12-04 02:53:36 UTC
tested with 4.7.0-0.nightly-2020-12-03-205004, and monitored the alerts in "Monitoring -> Alerts" for some time, could see the alerts in console

Comment 20 errata-xmlrpc 2021-02-24 15:32:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 21 Simon Pasquier 2021-04-16 15:47:42 UTC
*** Bug 1913961 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.