2068973 – etcdHighNumberOfFailedGRPCRequests critical alert firing often for etcdserverpb.Watch gRPC service

Bug 2068973 - etcdHighNumberOfFailedGRPCRequests critical alert firing often for etcdserverpb.Watch gRPC service

Summary: etcdHighNumberOfFailedGRPCRequests critical alert firing often for etcdserver...

Keywords:
Status:	CLOSED DUPLICATE of bug 2095579
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Dean West
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-28 00:02 UTC by Ravi Trivedi
Modified:	2023-09-15 01:53 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-30 08:07:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6964544	0	None	None	None	2022-06-23 09:59:05 UTC

Description Ravi Trivedi 2022-03-28 00:02:03 UTC

Description of problem:

As part of PR https://github.com/openshift/cluster-etcd-operator/pull/654, etcdHighNumberOfFailedGRPCRequests was reintroduced for non metal IPI clusters in 4.10. For SRE managed clusters, this alert has been noise where the critical alert fires and resolves in 5-10 minutes by itself.

Some of the labels for these alerts are the following:

Labels:
 - alertname = etcdHighNumberOfFailedGRPCRequests
 - endpoint = etcd-metrics
 - grpc_method = Watch
 - grpc_service = etcdserverpb.Watch
 - job = etcd
 - namespace = openshift-etcd
 - openshift_io_alert_source = platform
 - prometheus = openshift-monitoring/k8s
 - service = etcd
 - severity = critical

To highlight, the gRPC service that the alert fires the most for is 'etcdserverpb.Watch'. 

Following runbook https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighNumberOfFailedGRPCRequests.md hasn't helped resolve the issue.


Version-Release number of selected component (if applicable):
4.10

Actual results:
- etcdHighNumberOfFailedGRPCRequests critical alert is quite noisy which doesn't seem actionable by end user. 

Expected results:
- etcdHighNumberOfFailedGRPCRequests critical alert should fire with valid cause that is actionable by end user.


Additional info:
- https://bugzilla.redhat.com/show_bug.cgi?id=1701154 is long running bugzilla for same alert and is targeted for 4.10 as well.
- The CPU and memory usage across the 3 control plane nodes has remained fairly constant for considering timespan when the alert fired multiple times.
- The single cluster under consideration has fired the critical alert about 25 times.
- Cluster is AWS IPI 4.10.3.

Comment 11 Red Hat Bugzilla 2023-09-15 01:53:19 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.