1701154 – etcdHighNumberOfFailedGRPCRequests constantly firing

Bug 1701154 - etcdHighNumberOfFailedGRPCRequests constantly firing

Summary: etcdHighNumberOfFailedGRPCRequests constantly firing

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.z
Assignee:	melbeher
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:	tag-ci
Duplicates (3):	1701805 1772446 1989487 (view as bug list)
Depends On:	1677689
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-18 08:43 UTC by Frederic Branczyk
Modified:	2022-05-09 11:08 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-09 11:08:52 UTC
Target Upstream Version:
Embargoed:
Flags:	melbeher: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	etcd-io etcd issues 10289	None	closed	gRPC code Unavailable instead Canceled	2021-10-29 20:46:17 UTC
Github	openshift cluster-etcd-operator pull 626	None	Merged	Bug 1701154: Enable etcdHighNumberOfFailedGRPCRequests alerts	2021-10-29 20:46:16 UTC
Github	openshift cluster-etcd-operator pull 637	None	Merged	Revert "Bug 1701154: Enable etcdHighNumberOfFailedGRPCRequests alerts"	2021-10-29 20:46:15 UTC
Github	openshift cluster-etcd-operator pull 654	None	Merged	Bug 1990489: Reintroduce etcdHighNumberOfFailedGRPCRequests alert for non metal ipi clusters	2021-10-29 20:46:14 UTC

Description Frederic Branczyk 2019-04-18 08:43:33 UTC

Description of problem:

Out of the box the "etcdHighNumberOfFailedGRPCRequests" alert is firing. This is a known upstream issue: https://github.com/etcd-io/etcd/issues/10289. Requests serving correctly is one of the best metrics to tell whether a service is behaving as expected. We should fix this.

If we deem this too complicated to fix for 4.1, we should remove the alert. I'd prefer to see a fix, but letting you make the call on that. :)

Version-Release number of selected component (if applicable):

4.1

How reproducible:

Always. Go to "Alerts" page in the Prometheus UI on any cluster with recent payload (as etcd integration was recently fixed/finished).

Steps to Reproduce:
1.
2.
3.

Actual results:

"etcdHighNumberOfFailedGRPCRequests" firing

Expected results:

"etcdHighNumberOfFailedGRPCRequests" not firing, or only firing when requests are truly failing.

Additional info:

Comment 1 Seth Jennings 2019-04-18 14:31:14 UTC

Interestingly, I only recently saw this become an issue.  Did something change in the past 72hrs wrt this alert/metric/etcd?

Comment 2 Frederic Branczyk 2019-04-18 15:21:55 UTC

Etcd metrics were fixed/became available, before no etcd metrics were successfully being collected, hence no data to alert on. (it's a little more involved than that, but those were the symptoms of the circumstances & bugs around this topic)

Comment 3 Frederic Branczyk 2019-04-23 07:56:05 UTC

I don't think 4.1.z is an acceptable choice here (sorry if I didn't make the priority clear enough before). Any false-positively firing alert out of the box is unacceptable, and this one fires 100% of the time. We need to make a call to either remove the alert or fix the underlying issue.

Comment 4 Sam Batschelet 2019-04-23 12:43:02 UTC

*** Bug 1701805 has been marked as a duplicate of this bug. ***

Comment 5 Seth Jennings 2019-04-23 13:05:02 UTC

I agree this should be handled for 4.1.0

Comment 6 Seth Jennings 2019-04-25 17:43:32 UTC

I'm not seeing this any more as of 4.1.0-0.ci-2019-04-25-121304.  Did something fix it?

Comment 7 Frederic Branczyk 2019-04-26 08:11:23 UTC

The etcd integration was recently broken, https://github.com/openshift/cluster-monitoring-operator/pull/336 fixes this. And we're adding an additional e2e-aws check to prevent this in the future: https://github.com/openshift/origin/pull/22661

Comment 11 ge liu 2019-04-30 01:55:36 UTC

Try to verify it, it blocked by Bug 1703727

Comment 12 Junqi Zhao 2019-04-30 05:52:19 UTC

(In reply to ge liu from comment #11)
> Try to verify it, it blocked by Bug 1703727

Bug 1703727 is fixed, blocked by bug 1704573

Comment 13 Frederic Branczyk 2019-04-30 07:57:00 UTC

I don't think this is fixed even with the blocking bug being fixed. The alert is still firing and we need to get to some solution so it's not.

Comment 14 Junqi Zhao 2019-04-30 09:47:07 UTC

yes, etcdHighNumberOfFailedGRPCRequests is still firing although bug 1704573 exist

Comment 16 Greg Blomquist 2019-05-01 13:02:17 UTC

In 4.1.0, the monitoring team is going to disable this check.  The etcd team will revisit the underlying issue here for 4.1.z.

Comment 17 Frederic Branczyk 2019-05-02 09:12:24 UTC

The removal has been merged: https://github.com/openshift/cluster-monitoring-operator/pull/340, but let's keep this BZ to make sure we re-enable it once the underlying issue has been fixed.

Comment 18 Sam Batschelet 2019-05-02 12:24:55 UTC

I agree with Frederic, thanks for short term fix.

Comment 22 Frederic Branczyk 2019-11-07 11:54:16 UTC

This is a critically important metric and we're unable to alert on it in a meaningful way, this is very important to be addressed. I am ok with it being addressed in a future release, but it must be addressed.

Comment 28 ge liu 2020-03-09 08:34:09 UTC

Verified with 4.4.0-0.nightly-2020-03-08-001205

Comment 29 Frederic Branczyk 2020-04-17 12:38:49 UTC

As noted in https://bugzilla.redhat.com/show_bug.cgi?id=1701154#c22, this is not a valid fix, we are unable to alert on etcd correctly without this metric being reported correctly. The alert being removed is not a fix for the metric, it just removed the noise, the bug is still present.

Comment 32 Michal Fojtik 2020-05-20 10:49:17 UTC

I’m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available. If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 33 Michal Fojtik 2020-05-27 00:01:46 UTC

This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.

Comment 34 Frederic Branczyk 2020-05-27 07:14:44 UTC

Can we pin this issue somehow, to prevent it from being closed? This is extremely important, as because of it, we are unable to properly monitor etcd. The urgency and importance of this will not change, in fact it will just become more and more so with every day that we're unable to monitor etcd in an appropriate manner.

Comment 35 Dan Mace 2020-06-09 15:18:31 UTC

https://github.com/etcd-io/etcd/pull/11375 is an attempt to fix this upstream in etcd.

Comment 36 Dan Mace 2020-06-09 15:26:26 UTC

*** Bug 1772446 has been marked as a duplicate of this bug. ***

Comment 37 Dan Mace 2020-06-09 15:30:06 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1677689 already exists to track the upstream connection event fix. Can we keep this bug to track re-enabling the alert once https://bugzilla.redhat.com/show_bug.cgi?id=1677689 is resolved?

I made this bug depend on https://bugzilla.redhat.com/show_bug.cgi?id=1677689 for that purpose.

Comment 40 Dan Mace 2020-08-18 13:57:08 UTC

This is close, but I don't think we can get it done in 4.6 at this point. Moving to 4.7.

Comment 41 Sam Batschelet 2020-09-11 21:25:03 UTC

This bug is activly being worked on.

Comment 46 Lili Cosic 2021-06-11 13:04:03 UTC

Waiting for 4.9 to open so we can bring in latest 3.4 etcd which fixes this.

Comment 49 W. Trevor King 2021-08-03 15:45:27 UTC

This broke 4.9's metal-ipi blocking job, and I've opened a PR to revert.

Comment 50 W. Trevor King 2021-08-03 15:47:50 UTC

*** Bug 1989487 has been marked as a duplicate of this bug. ***

Comment 52 W. Trevor King 2021-08-04 12:51:17 UTC

etcd#637's revert moves this back to ASSIGNED.

Comment 54 Stephen Benjamin 2021-10-20 11:30:57 UTC

It looks like that on some platforms (ovirt, metal upi, compact on many platforms), etcdHighNumberOfFailedGRPCRequests fires a high percentage of the time. On others, it nearly never fails (e.g. aws, gcp).

Some example failures:
	ovirt: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt/1450250565182296064
	metal upi: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.10/1450661146720735232
	metal upi compact: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.10/1448735290150621184
	
	More from search.ci: https://search.ci.openshift.org/?search=etcdHighNumberOfFailedGRPCRequests+fired&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=%28aws%7Cgcp%29&maxMatches=5&maxBytes=20971520&groupBy=job
	

Interesting that ovirt jobs use the ipi-conf-etcd-on-ramfs step -- is that supposed to improve performance? 


@Lili, could you have a look? Thank you!

Comment 59 melbeher 2022-04-28 16:08:14 UTC

Do we still have this alert firing ?

Note You need to log in before you can comment on or make changes to this bug.