Bug 2068973 - etcdHighNumberOfFailedGRPCRequests critical alert firing often for etcdserverpb.Watch gRPC service
Summary: etcdHighNumberOfFailedGRPCRequests critical alert firing often for etcdserver...
Keywords:
Status: CLOSED DUPLICATE of bug 2095579
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.10
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Dean West
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-28 00:02 UTC by Ravi Trivedi
Modified: 2023-09-15 01:53 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-30 08:07:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6964544 0 None None None 2022-06-23 09:59:05 UTC

Description Ravi Trivedi 2022-03-28 00:02:03 UTC
Description of problem:

As part of PR https://github.com/openshift/cluster-etcd-operator/pull/654, etcdHighNumberOfFailedGRPCRequests was reintroduced for non metal IPI clusters in 4.10. For SRE managed clusters, this alert has been noise where the critical alert fires and resolves in 5-10 minutes by itself.

Some of the labels for these alerts are the following:

Labels:
 - alertname = etcdHighNumberOfFailedGRPCRequests
 - endpoint = etcd-metrics
 - grpc_method = Watch
 - grpc_service = etcdserverpb.Watch
 - job = etcd
 - namespace = openshift-etcd
 - openshift_io_alert_source = platform
 - prometheus = openshift-monitoring/k8s
 - service = etcd
 - severity = critical

To highlight, the gRPC service that the alert fires the most for is 'etcdserverpb.Watch'. 

Following runbook https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighNumberOfFailedGRPCRequests.md hasn't helped resolve the issue.


Version-Release number of selected component (if applicable):
4.10

Actual results:
- etcdHighNumberOfFailedGRPCRequests critical alert is quite noisy which doesn't seem actionable by end user. 

Expected results:
- etcdHighNumberOfFailedGRPCRequests critical alert should fire with valid cause that is actionable by end user.


Additional info:
- https://bugzilla.redhat.com/show_bug.cgi?id=1701154 is long running bugzilla for same alert and is targeted for 4.10 as well.
- The CPU and memory usage across the 3 control plane nodes has remained fairly constant for considering timespan when the alert fired multiple times.
- The single cluster under consideration has fired the critical alert about 25 times.
- Cluster is AWS IPI 4.10.3.

Comment 11 Red Hat Bugzilla 2023-09-15 01:53:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.