Bug 2068973 - etcdHighNumberOfFailedGRPCRequests critical alert firing often for etcdserverpb.Watch gRPC service [NEEDINFO]
Summary: etcdHighNumberOfFailedGRPCRequests critical alert firing often for etcdserver...
Status: CLOSED DUPLICATE of bug 2095579
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.10
Hardware: x86_64
OS: Linux
Target Milestone: ---
: ---
Assignee: Dean West
QA Contact: ge liu
Depends On:
TreeView+ depends on / blocked
Reported: 2022-03-28 00:02 UTC by Ravi Trivedi
Modified: 2022-07-15 02:09 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2022-06-30 08:07:23 UTC
Target Upstream Version:
travi: needinfo? (dwest)
tjungblu: needinfo? (geliu)

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6964544 0 None None None 2022-06-23 09:59:05 UTC

Description Ravi Trivedi 2022-03-28 00:02:03 UTC
Description of problem:

As part of PR https://github.com/openshift/cluster-etcd-operator/pull/654, etcdHighNumberOfFailedGRPCRequests was reintroduced for non metal IPI clusters in 4.10. For SRE managed clusters, this alert has been noise where the critical alert fires and resolves in 5-10 minutes by itself.

Some of the labels for these alerts are the following:

 - alertname = etcdHighNumberOfFailedGRPCRequests
 - endpoint = etcd-metrics
 - grpc_method = Watch
 - grpc_service = etcdserverpb.Watch
 - job = etcd
 - namespace = openshift-etcd
 - openshift_io_alert_source = platform
 - prometheus = openshift-monitoring/k8s
 - service = etcd
 - severity = critical

To highlight, the gRPC service that the alert fires the most for is 'etcdserverpb.Watch'. 

Following runbook https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighNumberOfFailedGRPCRequests.md hasn't helped resolve the issue.

Version-Release number of selected component (if applicable):

Actual results:
- etcdHighNumberOfFailedGRPCRequests critical alert is quite noisy which doesn't seem actionable by end user. 

Expected results:
- etcdHighNumberOfFailedGRPCRequests critical alert should fire with valid cause that is actionable by end user.

Additional info:
- https://bugzilla.redhat.com/show_bug.cgi?id=1701154 is long running bugzilla for same alert and is targeted for 4.10 as well.
- The CPU and memory usage across the 3 control plane nodes has remained fairly constant for considering timespan when the alert fired multiple times.
- The single cluster under consideration has fired the critical alert about 25 times.
- Cluster is AWS IPI 4.10.3.

Note You need to log in before you can comment on or make changes to this bug.