Bug 1677689 - Etcd cluster "etcd": 100% of requests for Watch failed on etcd instance <ETCD>:2379. grpc_service="etcdserverpb.Watch"
Summary: Etcd cluster "etcd": 100% of requests for Watch failed on etcd instance <ETCD...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 3.11.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Dan Mace
QA Contact: ge liu
URL:
Whiteboard: LifecycleFrozen
Depends On:
Blocks: 1701154 1772446
TreeView+ depends on / blocked
 
Reported: 2019-02-15 15:20 UTC by Josh Foots
Modified: 2024-01-06 04:26 UTC (History)
31 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1772446 (view as bug list)
Environment:
Last Closed: 2021-09-15 18:33:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
EtcdHighNumberOfFailedGRPCRequests alerts are shown after enabling etcd monitoring (46.87 KB, image/png)
2019-02-20 07:40 UTC, Junqi Zhao
no flags Details
silence.png (25.83 KB, image/png)
2019-12-25 13:08 UTC, Anand Paladugu
no flags Details
silence-example (48.94 KB, image/png)
2020-01-02 21:56 UTC, Sam Batschelet
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github etcd-io etcd issues 10289 0 'None' open gRPC code Unavailable instead Canceled 2021-02-15 09:28:30 UTC
Github etcd-io etcd pull 11375 0 'None' open etcdserver: fix watch metrics 2021-02-15 09:28:30 UTC
Github etcd-io etcd pull 12196 0 None open etcdserver: fix incorrect metrics generated when clients cancel watches 2021-02-15 09:28:31 UTC
Github openshift origin issues 20311 0 'None' open etcd - All grpc_code for grpc_method "Watch" is "Unavailable" 2021-02-15 09:28:39 UTC
Red Hat Knowledge Base (Solution) 3358691 0 None None OpenShift etcd: failed to receive watch request from gRPC stream 2019-03-28 12:57:11 UTC

Description Josh Foots 2019-02-15 15:20:33 UTC
Description of problem:

Following alert is seen in logs:
Etcd cluster "etcd": 100% of requests for Watch failed on etcd instance <ETCD>:2379.
grpc_service="etcdserverpb.Watch"


How reproducible:

Very

Steps to Reproduce:
1. Follow https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html#configuring-etcd-monitoring
2.
3.

Actual results:

Message is logged even though etcd cluster is healthy and there are no issues

Expected results:

For there to be an actual problem if a message like this is logged

Additional info:

This issue was opened upstream at https://github.com/openshift/origin/issues/20311

Comment 1 Junqi Zhao 2019-02-20 07:40:29 UTC
Created attachment 1536598 [details]
EtcdHighNumberOfFailedGRPCRequests alerts are shown after enabling etcd monitoring

Comment 2 Frederic Branczyk 2019-02-20 09:05:01 UTC
This doesn't appear to be a problem with the monitoring parts, but rather etcd incorrectly increasing the metric for this particular case. Following origin issue, there is the linked issue https://github.com/etcd-io/etcd/issues/10289, according to which the etcd team is already looking into this.

On monitoring side, I would prefer to keep this alert for all the other endpoints to be monitored, in order to not get notified about these, I recommend silencing this particular alert with the label selectors grpc_method="Watch" and grpc_service="etcdserverpb.Watch". The silence can then be lifted when the etcd parts are fixed.

Moving this to the masters component, which etcd is part of.

Comment 3 Sam Batschelet 2019-03-05 13:00:33 UTC
Yes this has been something in limbo, I will take over of the etcd issue and resolution.

Comment 12 Ivan Bazulic 2019-08-30 11:17:42 UTC
Hi Sam,

Has there been any updates on this particular issue?

Thanks!

Comment 13 kedar 2019-09-17 10:36:00 UTC
Hello,

Any updates so far on this issue.


Thanks.

Comment 15 kedar 2019-10-22 06:24:09 UTC
Hello,

Following are the new alerts being received on api-server

The api-server on master1 prints the following error logs:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I1022 03:12:01.034785       1 trace.go:76] Trace[5187212]: "Patch /api/v1/nodes/infra1/status" (started: 2019-10-22 03:11:59.4951043 +0000 UTC m=+8874845.051249201) (total time: 1.5396641s):
Trace[5187212]: [1.5394755s] [1.5386814s] Object stored in database
E1022 03:12:03.733831       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E1022 03:12:08.403631       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I1022 03:12:17.040558       1 trace.go:76] Trace[67662901]: "List /api/v1/nodes" (started: 2019-10-22 03:12:15.1894578 +0000 UTC m=+8874860.745602601) (total time: 1.8510764s):
Trace[67662901]: [1.8488305s] [1.8488275s] Listing from storage done
I1022 03:12:17.070722       1 trace.go:76] Trace[232382444]: "List /api/v1/nodes" (started: 2019-10-22 03:12:15.2934429 +0000 UTC m=+8874860.849587701) (total time: 1.7772594s):
Trace[232382444]: [1.7763236s] [1.7763211s] Listing from storage done
E1022 03:12:38.282960       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E1022 03:12:47.607365       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:12:47.607644       1 reflector.go:272] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: watch of *core.Secret ended with: The resourceVersion for the provided watch is too old.
I1022 03:12:58.367110       1 trace.go:76] Trace[1242358499]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:12:51.6703164 +0000 UTC m=+8874897.226461301) (total time: 6.6967659s):
Trace[1242358499]: [6.6967091s] [6.6956035s] Transaction committed
I1022 03:12:58.367313       1 trace.go:76] Trace[370739828]: "Patch /api/v1/nodes/infra1/status" (started: 2019-10-22 03:12:51.6701923 +0000 UTC m=+8874897.226337101) (total time: 6.6971042s):
Trace[370739828]: [6.696939s] [6.6960374s] Object stored in database
E1022 03:12:58.440852       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I1022 03:13:00.057143       1 trace.go:76] Trace[748664276]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:12:57.5711421 +0000 UTC m=+8874903.127286901) (total time: 2.4859358s):
Trace[748664276]: [2.4857825s] [2.4846646s] Transaction committed
I1022 03:13:00.057486       1 trace.go:76] Trace[1507531935]: "Patch /api/v1/nodes/node4/status" (started: 2019-10-22 03:12:57.571003 +0000 UTC m=+8874903.127147801) (total time: 2.4864376s):
Trace[1507531935]: [2.4861956s] [2.4852747s] Object stored in database
I1022 03:13:00.569616       1 trace.go:76] Trace[1525894554]: "List /api/v1/nodes" (started: 2019-10-22 03:12:57.1007057 +0000 UTC m=+8874902.656850501) (total time: 3.4688708s):
Trace[1525894554]: [3.4679513s] [3.4679476s] Listing from storage done
E1022 03:13:00.949689       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:13:00.949991       1 reflector.go:272] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: watch of *scheduling.PriorityClass ended with: The resourceVersion for the provided watch is too old.
E1022 03:13:17.647389       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:13:17.647721       1 reflector.go:272] k8s.io/client-go/informers/factory.go:130: watch of *v1.Role ended with: The resourceVersion for the provided watch is too old.
E1022 03:13:18.416142       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E1022 03:13:19.499862       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I1022 03:13:20.627860       1 trace.go:76] Trace[1003567705]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:13:20.1032859 +0000 UTC m=+8874925.659430801) (total time: 524.5352ms):
Trace[1003567705]: [524.4168ms] [523.532ms] Transaction committed
I1022 03:13:20.628279       1 trace.go:76] Trace[662350473]: "Patch /api/v1/nodes/node4/status" (started: 2019-10-22 03:13:20.1031699 +0000 UTC m=+8874925.659314701) (total time: 525.0734ms):
Trace[662350473]: [524.7447ms] [524.0183ms] Object stored in database
E1022 03:13:24.318010       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:13:24.318304       1 reflector.go:272] k8s.io/client-go/informers/factory.go:130: watch of *v1.Namespace ended with: The resourceVersion for the provided watch is too old.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks,

Comment 16 kedar 2019-11-06 05:19:55 UTC
Hello,

Any updates so far on this issue.


Thanks.

Comment 17 Sam Batschelet 2019-11-12 15:45:13 UTC
I am digging into this again now.

Comment 19 Anand Paladugu 2019-12-02 19:41:01 UTC
Another customer (case # https://gss--c.na94.visual.force.com/apex/Case_View?id=5002K00000iPAkBQAW&sfdc.override=1) has implemented the silence with the specified labels, but he is still seeing the alert.

Any thought on why the workaround is not helping?

Thanks

Anand

Comment 21 Anand Paladugu 2019-12-25 13:07:22 UTC

Any thought on why the workaround is not helping?  I have also updated the silence that the customer has created.

Thanks

Anand

Comment 22 Anand Paladugu 2019-12-25 13:08:23 UTC
Created attachment 1647632 [details]
silence.png

Comment 24 Sam Batschelet 2020-01-02 21:56:55 UTC
Created attachment 1649319 [details]
silence-example

> Any thought on why the workaround is not helping?  I have also updated the silence that the customer has created.

I believe the issue is with your silence format.  Could you try the below which is outlined in the attachment?

`alertname=EtcdHighNumberOfFailedGRPCRequests`

Also as of 3.11.141[1],[2] this alert has been removed until we resolve the issue upstream. So customers would have the option of upgrading as well.

[1]: https://docs.openshift.com/container-platform/3.11/release_notes/ocp_3_11_release_notes.html#ocp-3-11-141
[2]: https://access.redhat.com/errata/RHBA-2019:2580

Comment 25 Michal Fojtik 2020-05-19 13:17:50 UTC
This bug hasn't had any engineering activity in the last ~30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale".

If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 26 Michal Fojtik 2020-05-20 10:46:05 UTC
This bug is actively worked on.

Comment 27 Dan Mace 2020-06-09 15:28:41 UTC
https://github.com/etcd-io/etcd/pull/11375 is the upstream fix. Then we need a bump to the openshift etcd fork.

Comment 28 Dan Mace 2020-06-09 15:31:29 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1677689 tracks re-enabling the alert, which is only possible once this bug is resolved. So this bug now blocks 1677689.

Comment 31 Dan Mace 2020-08-18 13:56:59 UTC
This is close, but I don't think we can get it done in 4.6 at this point. Moving to 4.7.

Comment 32 Dan Mace 2020-10-01 17:11:33 UTC
Here's the latest attempt at a fix upstream: https://github.com/etcd-io/etcd/pull/12196

Comment 48 Sam Batschelet 2021-09-15 18:33:38 UTC
etcd 3.2 as part of rhel 7 is in CVE maintenance mode. This bug is now fixed in OCP 4.9+. As a workaround for the bug exists (disable alert) and all customer cases attached are now closed. We are closing this bug as won't fix.

Comment 49 Red Hat Bugzilla 2024-01-06 04:26:03 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.