Bug 1701154 - etcdHighNumberOfFailedGRPCRequests constantly firing [NEEDINFO]
Summary: etcdHighNumberOfFailedGRPCRequests constantly firing
Keywords:
Status: POST
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.10.0
Assignee: Lili Cosic
QA Contact: ge liu
URL:
Whiteboard: tag-ci
: 1701805 1772446 1989487 (view as bug list)
Depends On: 1677689
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-18 08:43 UTC by Frederic Branczyk
Modified: 2021-10-20 13:38 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-27 00:01:46 UTC
Target Upstream Version:
stbenjam: needinfo? (lcosic)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github etcd-io etcd issues 10289 0 None open gRPC code Unavailable instead Canceled 2021-02-05 19:13:09 UTC
Github openshift cluster-etcd-operator pull 626 0 None None None 2021-08-02 13:40:07 UTC
Github openshift cluster-etcd-operator pull 637 0 None None None 2021-08-03 15:45:55 UTC
Github openshift cluster-etcd-operator pull 654 0 None None None 2021-09-03 12:41:06 UTC

Description Frederic Branczyk 2019-04-18 08:43:33 UTC
Description of problem:

Out of the box the "etcdHighNumberOfFailedGRPCRequests" alert is firing. This is a known upstream issue: https://github.com/etcd-io/etcd/issues/10289. Requests serving correctly is one of the best metrics to tell whether a service is behaving as expected. We should fix this.

If we deem this too complicated to fix for 4.1, we should remove the alert. I'd prefer to see a fix, but letting you make the call on that. :)

Version-Release number of selected component (if applicable):

4.1

How reproducible:

Always. Go to "Alerts" page in the Prometheus UI on any cluster with recent payload (as etcd integration was recently fixed/finished).

Steps to Reproduce:
1.
2.
3.

Actual results:

"etcdHighNumberOfFailedGRPCRequests" firing

Expected results:

"etcdHighNumberOfFailedGRPCRequests" not firing, or only firing when requests are truly failing.

Additional info:

Comment 1 Seth Jennings 2019-04-18 14:31:14 UTC
Interestingly, I only recently saw this become an issue.  Did something change in the past 72hrs wrt this alert/metric/etcd?

Comment 2 Frederic Branczyk 2019-04-18 15:21:55 UTC
Etcd metrics were fixed/became available, before no etcd metrics were successfully being collected, hence no data to alert on. (it's a little more involved than that, but those were the symptoms of the circumstances & bugs around this topic)

Comment 3 Frederic Branczyk 2019-04-23 07:56:05 UTC
I don't think 4.1.z is an acceptable choice here (sorry if I didn't make the priority clear enough before). Any false-positively firing alert out of the box is unacceptable, and this one fires 100% of the time. We need to make a call to either remove the alert or fix the underlying issue.

Comment 4 Sam Batschelet 2019-04-23 12:43:02 UTC
*** Bug 1701805 has been marked as a duplicate of this bug. ***

Comment 5 Seth Jennings 2019-04-23 13:05:02 UTC
I agree this should be handled for 4.1.0

Comment 6 Seth Jennings 2019-04-25 17:43:32 UTC
I'm not seeing this any more as of 4.1.0-0.ci-2019-04-25-121304.  Did something fix it?

Comment 7 Frederic Branczyk 2019-04-26 08:11:23 UTC
The etcd integration was recently broken, https://github.com/openshift/cluster-monitoring-operator/pull/336 fixes this. And we're adding an additional e2e-aws check to prevent this in the future: https://github.com/openshift/origin/pull/22661

Comment 11 ge liu 2019-04-30 01:55:36 UTC
Try to verify it, it blocked by Bug 1703727

Comment 12 Junqi Zhao 2019-04-30 05:52:19 UTC
(In reply to ge liu from comment #11)
> Try to verify it, it blocked by Bug 1703727

Bug 1703727 is fixed, blocked by bug 1704573

Comment 13 Frederic Branczyk 2019-04-30 07:57:00 UTC
I don't think this is fixed even with the blocking bug being fixed. The alert is still firing and we need to get to some solution so it's not.

Comment 14 Junqi Zhao 2019-04-30 09:47:07 UTC
yes, etcdHighNumberOfFailedGRPCRequests is still firing although bug 1704573 exist

Comment 16 Greg Blomquist 2019-05-01 13:02:17 UTC
In 4.1.0, the monitoring team is going to disable this check.  The etcd team will revisit the underlying issue here for 4.1.z.

Comment 17 Frederic Branczyk 2019-05-02 09:12:24 UTC
The removal has been merged: https://github.com/openshift/cluster-monitoring-operator/pull/340, but let's keep this BZ to make sure we re-enable it once the underlying issue has been fixed.

Comment 18 Sam Batschelet 2019-05-02 12:24:55 UTC
I agree with Frederic, thanks for short term fix.

Comment 22 Frederic Branczyk 2019-11-07 11:54:16 UTC
This is a critically important metric and we're unable to alert on it in a meaningful way, this is very important to be addressed. I am ok with it being addressed in a future release, but it must be addressed.

Comment 28 ge liu 2020-03-09 08:34:09 UTC
Verified with 4.4.0-0.nightly-2020-03-08-001205

Comment 29 Frederic Branczyk 2020-04-17 12:38:49 UTC
As noted in https://bugzilla.redhat.com/show_bug.cgi?id=1701154#c22, this is not a valid fix, we are unable to alert on etcd correctly without this metric being reported correctly. The alert being removed is not a fix for the metric, it just removed the noise, the bug is still present.

Comment 32 Michal Fojtik 2020-05-20 10:49:17 UTC
Iā€™m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available. If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 33 Michal Fojtik 2020-05-27 00:01:46 UTC
This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.

Comment 34 Frederic Branczyk 2020-05-27 07:14:44 UTC
Can we pin this issue somehow, to prevent it from being closed? This is extremely important, as because of it, we are unable to properly monitor etcd. The urgency and importance of this will not change, in fact it will just become more and more so with every day that we're unable to monitor etcd in an appropriate manner.

Comment 35 Dan Mace 2020-06-09 15:18:31 UTC
https://github.com/etcd-io/etcd/pull/11375 is an attempt to fix this upstream in etcd.

Comment 36 Dan Mace 2020-06-09 15:26:26 UTC
*** Bug 1772446 has been marked as a duplicate of this bug. ***

Comment 37 Dan Mace 2020-06-09 15:30:06 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1677689 already exists to track the upstream connection event fix. Can we keep this bug to track re-enabling the alert once https://bugzilla.redhat.com/show_bug.cgi?id=1677689 is resolved?

I made this bug depend on https://bugzilla.redhat.com/show_bug.cgi?id=1677689 for that purpose.

Comment 40 Dan Mace 2020-08-18 13:57:08 UTC
This is close, but I don't think we can get it done in 4.6 at this point. Moving to 4.7.

Comment 41 Sam Batschelet 2020-09-11 21:25:03 UTC
This bug is activly being worked on.

Comment 46 Lili Cosic 2021-06-11 13:04:03 UTC
Waiting for 4.9 to open so we can bring in latest 3.4 etcd which fixes this.

Comment 49 W. Trevor King 2021-08-03 15:45:27 UTC
This broke 4.9's metal-ipi blocking job, and I've opened a PR to revert.

Comment 50 W. Trevor King 2021-08-03 15:47:50 UTC
*** Bug 1989487 has been marked as a duplicate of this bug. ***

Comment 52 W. Trevor King 2021-08-04 12:51:17 UTC
etcd#637's revert moves this back to ASSIGNED.

Comment 54 Stephen Benjamin 2021-10-20 11:30:57 UTC
It looks like that on some platforms (ovirt, metal upi, compact on many platforms), etcdHighNumberOfFailedGRPCRequests fires a high percentage of the time. On others, it nearly never fails (e.g. aws, gcp).

Some example failures:
	ovirt: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt/1450250565182296064
	metal upi: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.10/1450661146720735232
	metal upi compact: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.10/1448735290150621184
	
	More from search.ci: https://search.ci.openshift.org/?search=etcdHighNumberOfFailedGRPCRequests+fired&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=%28aws%7Cgcp%29&maxMatches=5&maxBytes=20971520&groupBy=job
	

Interesting that ovirt jobs use the ipi-conf-etcd-on-ramfs step -- is that supposed to improve performance? 


@Lili, could you have a look? Thank you!


Note You need to log in before you can comment on or make changes to this bug.