Description of problem: In console the alerts are not showing up that are firing after some time of cluster being up. Version-Release number of selected component (if applicable): 4.7.0-0.ci-2020-11-12-053309 on GCP. How reproducible: Have not launched another cluster, so don't know. Steps to Reproduce: 1. Launch 4.7 cluster and see Alert firing 2. Alert stops showing up in console after some time Actual results: Expected results: See all alerts firing in console, that fire in either prom. instances Additional info:
Seems to be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1870287 One of the Thanos pods was hitting `failed to resolve addresses for rulesAPIs` errors. This caused the response from the `/rules` endpoint to sometimes be empty (when the call hit the failing pod).
*** Bug 1901618 has been marked as a duplicate of this bug. ***
*** Bug 1870287 has been marked as a duplicate of this bug. ***
tested with 4.7.0-0.nightly-2020-11-30-172451, can't find the alerts in UI now, see from the picture there're errors in thanos-querier level=error ts=2020-12-01T07:53:22.984451181Z caller=query.go:394 msg="failed to resolve addresses for storeAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": lookup _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local on 172.30.0.10:53: dial udp 172.30.0.10:53: operation was canceled" level=error ts=2020-12-01T07:53:22.984914317Z caller=query.go:397 msg="failed to resolve addresses for rulesAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": lookup _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local on 172.30.0.10:53: dial udp 172.30.0.10:53: operation was canceled" level=warn ts=2020-12-01T07:53:24.466067505Z caller=storeset.go:456 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.129.2.25:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.129.2.25:10901: connect: no route to host\"" address=10.129.2.25:10901
Created attachment 1735129 [details] no alert under "Monitoring -> Alerts"
Created attachment 1735130 [details] thanos-querier logs
this also triggered ThanosQueryHighDNSFailures alert, "description": "Thanos Query thanos-querier have 100% of failing DNS queries for store endpoints."
should be caused by bug 1903423, we need to fix it first
*** Bug 1903423 has been marked as a duplicate of this bug. ***
tested with 4.7.0-0.nightly-2020-12-03-205004, and monitored the alerts in "Monitoring -> Alerts" for some time, could see the alerts in console
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
*** Bug 1913961 has been marked as a duplicate of this bug. ***