Bug 2068973

Summary:	etcdHighNumberOfFailedGRPCRequests critical alert firing often for etcdserverpb.Watch gRPC service
Product:	OpenShift Container Platform	Reporter:	Ravi Trivedi <travi>
Component:	Etcd	Assignee:	Dean West <dwest>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.10	CC:	dwest, geliu, sreber, tjungblu, wlewis
Target Milestone:	---	Keywords:	ServiceDeliveryImpact
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-06-30 08:07:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ravi Trivedi 2022-03-28 00:02:03 UTC

Description of problem:

As part of PR https://github.com/openshift/cluster-etcd-operator/pull/654, etcdHighNumberOfFailedGRPCRequests was reintroduced for non metal IPI clusters in 4.10. For SRE managed clusters, this alert has been noise where the critical alert fires and resolves in 5-10 minutes by itself.

Some of the labels for these alerts are the following:

Labels:
 - alertname = etcdHighNumberOfFailedGRPCRequests
 - endpoint = etcd-metrics
 - grpc_method = Watch
 - grpc_service = etcdserverpb.Watch
 - job = etcd
 - namespace = openshift-etcd
 - openshift_io_alert_source = platform
 - prometheus = openshift-monitoring/k8s
 - service = etcd
 - severity = critical

To highlight, the gRPC service that the alert fires the most for is 'etcdserverpb.Watch'. 

Following runbook https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighNumberOfFailedGRPCRequests.md hasn't helped resolve the issue.


Version-Release number of selected component (if applicable):
4.10

Actual results:
- etcdHighNumberOfFailedGRPCRequests critical alert is quite noisy which doesn't seem actionable by end user. 

Expected results:
- etcdHighNumberOfFailedGRPCRequests critical alert should fire with valid cause that is actionable by end user.


Additional info:
- https://bugzilla.redhat.com/show_bug.cgi?id=1701154 is long running bugzilla for same alert and is targeted for 4.10 as well.
- The CPU and memory usage across the 3 control plane nodes has remained fairly constant for considering timespan when the alert fired multiple times.
- The single cluster under consideration has fired the critical alert about 25 times.
- Cluster is AWS IPI 4.10.3.

Comment 11 Red Hat Bugzilla 2023-09-15 01:53:19 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days