Description of problem:
As part of PR https://github.com/openshift/cluster-etcd-operator/pull/654, etcdHighNumberOfFailedGRPCRequests was reintroduced for non metal IPI clusters in 4.10. For SRE managed clusters, this alert has been noise where the critical alert fires and resolves in 5-10 minutes by itself.
Some of the labels for these alerts are the following:
- alertname = etcdHighNumberOfFailedGRPCRequests
- endpoint = etcd-metrics
- grpc_method = Watch
- grpc_service = etcdserverpb.Watch
- job = etcd
- namespace = openshift-etcd
- openshift_io_alert_source = platform
- prometheus = openshift-monitoring/k8s
- service = etcd
- severity = critical
To highlight, the gRPC service that the alert fires the most for is 'etcdserverpb.Watch'.
Following runbook https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighNumberOfFailedGRPCRequests.md hasn't helped resolve the issue.
Version-Release number of selected component (if applicable):
- etcdHighNumberOfFailedGRPCRequests critical alert is quite noisy which doesn't seem actionable by end user.
- etcdHighNumberOfFailedGRPCRequests critical alert should fire with valid cause that is actionable by end user.
- https://bugzilla.redhat.com/show_bug.cgi?id=1701154 is long running bugzilla for same alert and is targeted for 4.10 as well.
- The CPU and memory usage across the 3 control plane nodes has remained fairly constant for considering timespan when the alert fired multiple times.
- The single cluster under consideration has fired the critical alert about 25 times.
- Cluster is AWS IPI 4.10.3.