Bug 1897252
Summary: | Firing alerts are not showing up in console UI after cluster is up for some time | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Lili Cosic <lcosic> | ||||||
Component: | Monitoring | Assignee: | Lili Cosic <lcosic> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.7 | CC: | alegrand, anpicker, cruhm, dtaylor, erooth, juzhao, kakkoyun, lcosic, lsm5, pkrupa, surbania | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.7.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-02-24 15:32:41 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1885946 | ||||||||
Attachments: |
|
Description
Lili Cosic
2020-11-12 15:56:17 UTC
Seems to be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1870287 One of the Thanos pods was hitting `failed to resolve addresses for rulesAPIs` errors. This caused the response from the `/rules` endpoint to sometimes be empty (when the call hit the failing pod). *** Bug 1901618 has been marked as a duplicate of this bug. *** *** Bug 1870287 has been marked as a duplicate of this bug. *** tested with 4.7.0-0.nightly-2020-11-30-172451, can't find the alerts in UI now, see from the picture there're errors in thanos-querier level=error ts=2020-12-01T07:53:22.984451181Z caller=query.go:394 msg="failed to resolve addresses for storeAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": lookup _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local on 172.30.0.10:53: dial udp 172.30.0.10:53: operation was canceled" level=error ts=2020-12-01T07:53:22.984914317Z caller=query.go:397 msg="failed to resolve addresses for rulesAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": lookup _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local on 172.30.0.10:53: dial udp 172.30.0.10:53: operation was canceled" level=warn ts=2020-12-01T07:53:24.466067505Z caller=storeset.go:456 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.129.2.25:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.129.2.25:10901: connect: no route to host\"" address=10.129.2.25:10901 Created attachment 1735129 [details]
no alert under "Monitoring -> Alerts"
Created attachment 1735130 [details]
thanos-querier logs
this also triggered ThanosQueryHighDNSFailures alert, "description": "Thanos Query thanos-querier have 100% of failing DNS queries for store endpoints." should be caused by bug 1903423, we need to fix it first *** Bug 1903423 has been marked as a duplicate of this bug. *** tested with 4.7.0-0.nightly-2020-12-03-205004, and monitored the alerts in "Monitoring -> Alerts" for some time, could see the alerts in console Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 *** Bug 1913961 has been marked as a duplicate of this bug. *** |