Bug 2068973

Summary: etcdHighNumberOfFailedGRPCRequests critical alert firing often for etcdserverpb.Watch gRPC service
Product: OpenShift Container Platform Reporter: Ravi Trivedi <travi>
Component: EtcdAssignee: Dean West <dwest>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: dwest, geliu, sreber, tjungblu, wlewis
Target Milestone: ---Keywords: ServiceDeliveryImpact
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-30 08:07:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ravi Trivedi 2022-03-28 00:02:03 UTC
Description of problem:

As part of PR https://github.com/openshift/cluster-etcd-operator/pull/654, etcdHighNumberOfFailedGRPCRequests was reintroduced for non metal IPI clusters in 4.10. For SRE managed clusters, this alert has been noise where the critical alert fires and resolves in 5-10 minutes by itself.

Some of the labels for these alerts are the following:

Labels:
 - alertname = etcdHighNumberOfFailedGRPCRequests
 - endpoint = etcd-metrics
 - grpc_method = Watch
 - grpc_service = etcdserverpb.Watch
 - job = etcd
 - namespace = openshift-etcd
 - openshift_io_alert_source = platform
 - prometheus = openshift-monitoring/k8s
 - service = etcd
 - severity = critical

To highlight, the gRPC service that the alert fires the most for is 'etcdserverpb.Watch'. 

Following runbook https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighNumberOfFailedGRPCRequests.md hasn't helped resolve the issue.


Version-Release number of selected component (if applicable):
4.10

Actual results:
- etcdHighNumberOfFailedGRPCRequests critical alert is quite noisy which doesn't seem actionable by end user. 

Expected results:
- etcdHighNumberOfFailedGRPCRequests critical alert should fire with valid cause that is actionable by end user.


Additional info:
- https://bugzilla.redhat.com/show_bug.cgi?id=1701154 is long running bugzilla for same alert and is targeted for 4.10 as well.
- The CPU and memory usage across the 3 control plane nodes has remained fairly constant for considering timespan when the alert fired multiple times.
- The single cluster under consideration has fired the critical alert about 25 times.
- Cluster is AWS IPI 4.10.3.

Comment 11 Red Hat Bugzilla 2023-09-15 01:53:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days