1990489 – etcdHighNumberOfFailedGRPCRequests fires only on metal env in CI

Bug 1990489 - etcdHighNumberOfFailedGRPCRequests fires only on metal env in CI

Summary: etcdHighNumberOfFailedGRPCRequests fires only on metal env in CI

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Nobody
QA Contact:	Sandeep
Docs Contact:
URL:
Whiteboard:	tag-ci
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-05 14:04 UTC by Lili Cosic
Modified:	2022-03-12 04:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-12 04:37:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 654	0	None	Merged	Bug 1990489: Reintroduce etcdHighNumberOfFailedGRPCRequests alert for non metal ipi clusters	2021-10-29 20:47:58 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:37:13 UTC

Description Lili Cosic 2021-08-05 14:04:44 UTC

Description of problem:
https://search.ci.openshift.org/?search=etcdHighNumberOfFailedGRPCRequests&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 

The etcdHighNumberOfFailedGRPCRequests alert was reverted as it started to fire on this env. Investigation showed it had failed requests for etcd. We suspect maybe there is a network issue and requests are too slow, but someone would need to dig it up a bit more. Maybe look around sos, etc. and look for dropped packets MTU issues or any odd things that stand out.


Version-Release number of selected component (if applicable):
4.9 CI.

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ben Nemec 2021-08-06 19:35:37 UTC

I've looked into this some, but I don't have much in the way of answers. It doesn't seem to reproduce in my local cluster. I have no alerts for anything related to etcd. I do see some connection errors in the etcd logs from a ci job, but I'm not sure whether that's the source of the alerts. They're connection refused errors, and they appear to correspond to the time just before etcd started on the target node so I'm not sure that's unexpected. Perhaps there was an unexpected restart of etcd? I don't see anything in the logs to indicate that though.

The sosreport is from the virt host, so it isn't going to tell us much about what's going on with the networking on the masters. Next week we might want to grab a Packet machine and see if we can reproduce this alert in a manual run where we can get on the nodes and look around.

Comment 2 Arda Guclu 2021-08-10 11:21:58 UTC

I have also looked at the logs and I could not find anything relevant regarding the problem. Could you also look at the logs and if you find something related to the problem assign it to our team for fixing. Thanks.

Comment 5 Sandeep 2021-10-07 09:56:18 UTC

Checked both etcd and etcd-operator log on both regular and metal cluster. 
No errors related to "etcdHighNumberOfFailedGRPCRequests alert" was found.

Steps followed :
[skundu@skundu ~]$ for i in $(oc get ns | grep etcd | awk '{print $1}'); do oc -n $i get po; done
NAME                                                     READY   STATUS      RESTARTS   AGE
etcd-ip-10-0-134-0.us-east-2.compute.internal            4/4     Running     0          4h11m
etcd-ip-10-0-172-160.us-east-2.compute.internal          4/4     Running     0          4h14m
etcd-ip-10-0-215-182.us-east-2.compute.internal          4/4     Running     0          4h12m
etcd-quorum-guard-6f5966d9b-2wt7p                        1/1     Running     0          4h22m
etcd-quorum-guard-6f5966d9b-g4547                        1/1     Running     0          4h22m
etcd-quorum-guard-6f5966d9b-tsdq6                        1/1     Running     0          4h22m
installer-2-ip-10-0-134-0.us-east-2.compute.internal     0/1     Completed   0          4h20m
installer-2-ip-10-0-172-160.us-east-2.compute.internal   0/1     Completed   0          4h21m
installer-2-ip-10-0-215-182.us-east-2.compute.internal   0/1     Completed   0          4h20m
installer-3-ip-10-0-134-0.us-east-2.compute.internal     0/1     Completed   0          4h11m
installer-3-ip-10-0-172-160.us-east-2.compute.internal   0/1     Completed   0          4h14m
installer-3-ip-10-0-215-182.us-east-2.compute.internal   0/1     Completed   0          4h12m
NAME                             READY   STATUS    RESTARTS        AGE
etcd-operator-7c57d5b65c-5d82s   1/1     Running   1 (4h21m ago)   4h25m



Checked the container logs of etcd-operator.
[skundu@skundu ~]$ oc -n openshift-etcd-operator logs etcd-operator-7c57d5b65c-5d82s -c etcd-operator | grep -i etcdHighNumberOfFailedGRPCRequests

[skundu@skundu ~]$ 

checked all 4 container logs of all the etcd pods
[skundu@skundu ~]$ oc -n openshift-etcd logs etcd-ip-10-0-134-0.us-east-2.compute.internal -c etcdctl | grep -i etcdHighNumberOfFailedGRPCRequests
[skundu@skundu ~]$ 
[skundu@skundu ~]$ oc -n openshift-etcd logs etcd-ip-10-0-134-0.us-east-2.compute.internal -c etcd | grep -i etcdHighNumberOfFailedGRPCRequests

[skundu@skundu ~]$ 
[skundu@skundu ~]$ oc -n openshift-etcd logs etcd-ip-10-0-134-0.us-east-2.compute.internal -c etcd-metrics | grep -i etcdHighNumberOfFailedGRPCRequests

[skundu@skundu ~]$ 
[skundu@skundu ~]$ oc -n openshift-etcd logs etcd-ip-10-0-134-0.us-east-2.compute.internal -c etcd-health-monitor | grep -i etcdHighNumberOfFailedGRPCRequests



No etcdHighNumberOfFailedGRPCRequests found on both regular and metal clusters. 


Moving it to verified.

Comment 10 errata-xmlrpc 2022-03-12 04:37:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.