Description of problem: As part of PR https://github.com/openshift/cluster-etcd-operator/pull/654, etcdHighNumberOfFailedGRPCRequests was reintroduced for non metal IPI clusters in 4.10. For SRE managed clusters, this alert has been noise where the critical alert fires and resolves in 5-10 minutes by itself. Some of the labels for these alerts are the following: Labels: - alertname = etcdHighNumberOfFailedGRPCRequests - endpoint = etcd-metrics - grpc_method = Watch - grpc_service = etcdserverpb.Watch - job = etcd - namespace = openshift-etcd - openshift_io_alert_source = platform - prometheus = openshift-monitoring/k8s - service = etcd - severity = critical To highlight, the gRPC service that the alert fires the most for is 'etcdserverpb.Watch'. Following runbook https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighNumberOfFailedGRPCRequests.md hasn't helped resolve the issue. Version-Release number of selected component (if applicable): 4.10 Actual results: - etcdHighNumberOfFailedGRPCRequests critical alert is quite noisy which doesn't seem actionable by end user. Expected results: - etcdHighNumberOfFailedGRPCRequests critical alert should fire with valid cause that is actionable by end user. Additional info: - https://bugzilla.redhat.com/show_bug.cgi?id=1701154 is long running bugzilla for same alert and is targeted for 4.10 as well. - The CPU and memory usage across the 3 control plane nodes has remained fairly constant for considering timespan when the alert fired multiple times. - The single cluster under consideration has fired the critical alert about 25 times. - Cluster is AWS IPI 4.10.3.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days