Description of problem: https://search.ci.openshift.org/?search=etcdHighNumberOfFailedGRPCRequests&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job The etcdHighNumberOfFailedGRPCRequests alert was reverted as it started to fire on this env. Investigation showed it had failed requests for etcd. We suspect maybe there is a network issue and requests are too slow, but someone would need to dig it up a bit more. Maybe look around sos, etc. and look for dropped packets MTU issues or any odd things that stand out. Version-Release number of selected component (if applicable): 4.9 CI. How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I've looked into this some, but I don't have much in the way of answers. It doesn't seem to reproduce in my local cluster. I have no alerts for anything related to etcd. I do see some connection errors in the etcd logs from a ci job, but I'm not sure whether that's the source of the alerts. They're connection refused errors, and they appear to correspond to the time just before etcd started on the target node so I'm not sure that's unexpected. Perhaps there was an unexpected restart of etcd? I don't see anything in the logs to indicate that though. The sosreport is from the virt host, so it isn't going to tell us much about what's going on with the networking on the masters. Next week we might want to grab a Packet machine and see if we can reproduce this alert in a manual run where we can get on the nodes and look around.
I have also looked at the logs and I could not find anything relevant regarding the problem. Could you also look at the logs and if you find something related to the problem assign it to our team for fixing. Thanks.
Checked both etcd and etcd-operator log on both regular and metal cluster. No errors related to "etcdHighNumberOfFailedGRPCRequests alert" was found. Steps followed : [skundu@skundu ~]$ for i in $(oc get ns | grep etcd | awk '{print $1}'); do oc -n $i get po; done NAME READY STATUS RESTARTS AGE etcd-ip-10-0-134-0.us-east-2.compute.internal 4/4 Running 0 4h11m etcd-ip-10-0-172-160.us-east-2.compute.internal 4/4 Running 0 4h14m etcd-ip-10-0-215-182.us-east-2.compute.internal 4/4 Running 0 4h12m etcd-quorum-guard-6f5966d9b-2wt7p 1/1 Running 0 4h22m etcd-quorum-guard-6f5966d9b-g4547 1/1 Running 0 4h22m etcd-quorum-guard-6f5966d9b-tsdq6 1/1 Running 0 4h22m installer-2-ip-10-0-134-0.us-east-2.compute.internal 0/1 Completed 0 4h20m installer-2-ip-10-0-172-160.us-east-2.compute.internal 0/1 Completed 0 4h21m installer-2-ip-10-0-215-182.us-east-2.compute.internal 0/1 Completed 0 4h20m installer-3-ip-10-0-134-0.us-east-2.compute.internal 0/1 Completed 0 4h11m installer-3-ip-10-0-172-160.us-east-2.compute.internal 0/1 Completed 0 4h14m installer-3-ip-10-0-215-182.us-east-2.compute.internal 0/1 Completed 0 4h12m NAME READY STATUS RESTARTS AGE etcd-operator-7c57d5b65c-5d82s 1/1 Running 1 (4h21m ago) 4h25m Checked the container logs of etcd-operator. [skundu@skundu ~]$ oc -n openshift-etcd-operator logs etcd-operator-7c57d5b65c-5d82s -c etcd-operator | grep -i etcdHighNumberOfFailedGRPCRequests [skundu@skundu ~]$ checked all 4 container logs of all the etcd pods [skundu@skundu ~]$ oc -n openshift-etcd logs etcd-ip-10-0-134-0.us-east-2.compute.internal -c etcdctl | grep -i etcdHighNumberOfFailedGRPCRequests [skundu@skundu ~]$ [skundu@skundu ~]$ oc -n openshift-etcd logs etcd-ip-10-0-134-0.us-east-2.compute.internal -c etcd | grep -i etcdHighNumberOfFailedGRPCRequests [skundu@skundu ~]$ [skundu@skundu ~]$ oc -n openshift-etcd logs etcd-ip-10-0-134-0.us-east-2.compute.internal -c etcd-metrics | grep -i etcdHighNumberOfFailedGRPCRequests [skundu@skundu ~]$ [skundu@skundu ~]$ oc -n openshift-etcd logs etcd-ip-10-0-134-0.us-east-2.compute.internal -c etcd-health-monitor | grep -i etcdHighNumberOfFailedGRPCRequests No etcdHighNumberOfFailedGRPCRequests found on both regular and metal clusters. Moving it to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056