Bug 1897252

Summary:

Firing alerts are not showing up in console UI after cluster is up for some time

Product:

OpenShift Container Platform

Reporter:

Lili Cosic <lcosic>

Component:

Monitoring

Assignee:

Lili Cosic <lcosic>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.7

CC:

alegrand, anpicker, cruhm, dtaylor, erooth, juzhao, kakkoyun, lcosic, lsm5, pkrupa, surbania

Target Milestone:

---

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-02-24 15:32:41 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1885946

Attachments:

Description	Flags
no alert under "Monitoring -> Alerts"	none
thanos-querier logs	none

Description Lili Cosic 2020-11-12 15:56:17 UTC

Description of problem:

In console the alerts are not showing up that are firing after some time of cluster being up.

Version-Release number of selected component (if applicable):

4.7.0-0.ci-2020-11-12-053309 on GCP.

How reproducible:

Have not launched another cluster, so don't know.

Steps to Reproduce:
1. Launch 4.7 cluster and see Alert firing
2. Alert stops showing up in console after some time

Actual results:


Expected results:

See all alerts firing in console, that fire in either prom. instances


Additional info:

Comment 2 Andrew Pickering 2020-11-18 01:02:16 UTC

Seems to be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1870287

One of the Thanos pods was hitting `failed to resolve addresses for rulesAPIs` errors. This caused the response from the `/rules` endpoint to sometimes be empty (when the call hit the failing pod).

Comment 4 Lili Cosic 2020-11-25 16:38:46 UTC

*** Bug 1901618 has been marked as a duplicate of this bug. ***

Comment 7 Andrew Pickering 2020-12-01 00:45:27 UTC

*** Bug 1870287 has been marked as a duplicate of this bug. ***

Comment 8 Junqi Zhao 2020-12-01 07:58:15 UTC

tested with 4.7.0-0.nightly-2020-11-30-172451, can't find the alerts in UI now, see from the picture
there're errors in thanos-querier
level=error ts=2020-12-01T07:53:22.984451181Z caller=query.go:394 msg="failed to resolve addresses for storeAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": lookup _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local on 172.30.0.10:53: dial udp 172.30.0.10:53: operation was canceled"
level=error ts=2020-12-01T07:53:22.984914317Z caller=query.go:397 msg="failed to resolve addresses for rulesAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": lookup _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local on 172.30.0.10:53: dial udp 172.30.0.10:53: operation was canceled"
level=warn ts=2020-12-01T07:53:24.466067505Z caller=storeset.go:456 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.129.2.25:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.129.2.25:10901: connect: no route to host\"" address=10.129.2.25:10901

Comment 9 Junqi Zhao 2020-12-01 07:59:13 UTC

Created attachment 1735129 [details]
no alert under "Monitoring -> Alerts"

Comment 10 Junqi Zhao 2020-12-01 08:00:18 UTC

Created attachment 1735130 [details]
thanos-querier logs

Comment 11 Junqi Zhao 2020-12-02 01:57:25 UTC

this also triggered ThanosQueryHighDNSFailures alert, "description": "Thanos Query thanos-querier have 100% of failing DNS queries for store endpoints."

Comment 12 Junqi Zhao 2020-12-02 03:33:17 UTC

should be caused by bug 1903423, we need to fix it first

Comment 13 Simon Pasquier 2020-12-02 09:14:27 UTC

*** Bug 1903423 has been marked as a duplicate of this bug. ***

Comment 16 Junqi Zhao 2020-12-04 02:53:36 UTC

tested with 4.7.0-0.nightly-2020-12-03-205004, and monitored the alerts in "Monitoring -> Alerts" for some time, could see the alerts in console

Comment 20 errata-xmlrpc 2021-02-24 15:32:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 21 Simon Pasquier 2021-04-16 15:47:42 UTC

*** Bug 1913961 has been marked as a duplicate of this bug. ***