Description of problem:
Out of the box the "etcdHighNumberOfFailedGRPCRequests" alert is firing. This is a known upstream issue: https://github.com/etcd-io/etcd/issues/10289. Requests serving correctly is one of the best metrics to tell whether a service is behaving as expected. We should fix this.
If we deem this too complicated to fix for 4.1, we should remove the alert. I'd prefer to see a fix, but letting you make the call on that. :)
Version-Release number of selected component (if applicable):
Always. Go to "Alerts" page in the Prometheus UI on any cluster with recent payload (as etcd integration was recently fixed/finished).
Steps to Reproduce:
"etcdHighNumberOfFailedGRPCRequests" not firing, or only firing when requests are truly failing.
Interestingly, I only recently saw this become an issue. Did something change in the past 72hrs wrt this alert/metric/etcd?
Etcd metrics were fixed/became available, before no etcd metrics were successfully being collected, hence no data to alert on. (it's a little more involved than that, but those were the symptoms of the circumstances & bugs around this topic)
I don't think 4.1.z is an acceptable choice here (sorry if I didn't make the priority clear enough before). Any false-positively firing alert out of the box is unacceptable, and this one fires 100% of the time. We need to make a call to either remove the alert or fix the underlying issue.
*** Bug 1701805 has been marked as a duplicate of this bug. ***
I agree this should be handled for 4.1.0
I'm not seeing this any more as of 4.1.0-0.ci-2019-04-25-121304. Did something fix it?
The etcd integration was recently broken, https://github.com/openshift/cluster-monitoring-operator/pull/336 fixes this. And we're adding an additional e2e-aws check to prevent this in the future: https://github.com/openshift/origin/pull/22661
Try to verify it, it blocked by Bug 1703727
(In reply to ge liu from comment #11)
> Try to verify it, it blocked by Bug 1703727
Bug 1703727 is fixed, blocked by bug 1704573
I don't think this is fixed even with the blocking bug being fixed. The alert is still firing and we need to get to some solution so it's not.
yes, etcdHighNumberOfFailedGRPCRequests is still firing although bug 1704573 exist
In 4.1.0, the monitoring team is going to disable this check. The etcd team will revisit the underlying issue here for 4.1.z.
The removal has been merged: https://github.com/openshift/cluster-monitoring-operator/pull/340, but let's keep this BZ to make sure we re-enable it once the underlying issue has been fixed.
I agree with Frederic, thanks for short term fix.
This is a critically important metric and we're unable to alert on it in a meaningful way, this is very important to be addressed. I am ok with it being addressed in a future release, but it must be addressed.
Verified with 4.4.0-0.nightly-2020-03-08-001205
As noted in https://bugzilla.redhat.com/show_bug.cgi?id=1701154#c22, this is not a valid fix, we are unable to alert on etcd correctly without this metric being reported correctly. The alert being removed is not a fix for the metric, it just removed the noise, the bug is still present.
I’m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available. If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.
Can we pin this issue somehow, to prevent it from being closed? This is extremely important, as because of it, we are unable to properly monitor etcd. The urgency and importance of this will not change, in fact it will just become more and more so with every day that we're unable to monitor etcd in an appropriate manner.
https://github.com/etcd-io/etcd/pull/11375 is an attempt to fix this upstream in etcd.
*** Bug 1772446 has been marked as a duplicate of this bug. ***
https://bugzilla.redhat.com/show_bug.cgi?id=1677689 already exists to track the upstream connection event fix. Can we keep this bug to track re-enabling the alert once https://bugzilla.redhat.com/show_bug.cgi?id=1677689 is resolved?
I made this bug depend on https://bugzilla.redhat.com/show_bug.cgi?id=1677689 for that purpose.
This is close, but I don't think we can get it done in 4.6 at this point. Moving to 4.7.
This bug is activly being worked on.
Waiting for 4.9 to open so we can bring in latest 3.4 etcd which fixes this.
This broke 4.9's metal-ipi blocking job, and I've opened a PR to revert.
*** Bug 1989487 has been marked as a duplicate of this bug. ***
etcd#637's revert moves this back to ASSIGNED.
It looks like that on some platforms (ovirt, metal upi, compact on many platforms), etcdHighNumberOfFailedGRPCRequests fires a high percentage of the time. On others, it nearly never fails (e.g. aws, gcp).
Some example failures:
metal upi: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.10/1450661146720735232
metal upi compact: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.10/1448735290150621184
More from search.ci: https://search.ci.openshift.org/?search=etcdHighNumberOfFailedGRPCRequests+fired&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=%28aws%7Cgcp%29&maxMatches=5&maxBytes=20971520&groupBy=job
Interesting that ovirt jobs use the ipi-conf-etcd-on-ramfs step -- is that supposed to improve performance?
@Lili, could you have a look? Thank you!
Do we still have this alert firing ?