1677689 – Etcd cluster "etcd": 100% of requests for Watch failed on etcd instance <ETCD>:2379. grpc_service="etcdserverpb.Watch"

Bug 1677689 - Etcd cluster "etcd": 100% of requests for Watch failed on etcd instance <ETCD>:2379. grpc_service="etcdserverpb.Watch"

Summary: Etcd cluster "etcd": 100% of requests for Watch failed on etcd instance <ETCD...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	3.11.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Dan Mace
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:	LifecycleFrozen
Depends On:
Blocks:	1701154 1772446
TreeView+	depends on / blocked

Reported:	2019-02-15 15:20 UTC by Josh Foots
Modified:	2024-06-13 22:03 UTC (History)
CC List:	31 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1772446 (view as bug list)
Environment:
Last Closed:	2021-09-15 18:33:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
EtcdHighNumberOfFailedGRPCRequests alerts are shown after enabling etcd monitoring (46.87 KB, image/png) 2019-02-20 07:40 UTC, Junqi Zhao	no flags	Details
silence.png (25.83 KB, image/png) 2019-12-25 13:08 UTC, Anand Paladugu	no flags	Details
silence-example (48.94 KB, image/png) 2020-01-02 21:56 UTC, Sam Batschelet	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	etcd-io etcd issues 10289	'None'	open	gRPC code Unavailable instead Canceled	2021-02-15 09:28:30 UTC
Github	etcd-io etcd pull 11375	'None'	open	etcdserver: fix watch metrics	2021-02-15 09:28:30 UTC
Github	etcd-io etcd pull 12196	None	open	etcdserver: fix incorrect metrics generated when clients cancel watches	2021-02-15 09:28:31 UTC
Github	openshift origin issues 20311	'None'	open	etcd - All grpc_code for grpc_method "Watch" is "Unavailable"	2021-02-15 09:28:39 UTC
Red Hat Knowledge Base (Solution)	3358691	None	None	OpenShift etcd: failed to receive watch request from gRPC stream	2019-03-28 12:57:11 UTC

Description Josh Foots 2019-02-15 15:20:33 UTC

Description of problem:

Following alert is seen in logs:
Etcd cluster "etcd": 100% of requests for Watch failed on etcd instance <ETCD>:2379.
grpc_service="etcdserverpb.Watch"


How reproducible:

Very

Steps to Reproduce:
1. Follow https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html#configuring-etcd-monitoring
2.
3.

Actual results:

Message is logged even though etcd cluster is healthy and there are no issues

Expected results:

For there to be an actual problem if a message like this is logged

Additional info:

This issue was opened upstream at https://github.com/openshift/origin/issues/20311

Comment 1 Junqi Zhao 2019-02-20 07:40:29 UTC

Created attachment 1536598 [details]
EtcdHighNumberOfFailedGRPCRequests alerts are shown after enabling etcd monitoring

Comment 2 Frederic Branczyk 2019-02-20 09:05:01 UTC

This doesn't appear to be a problem with the monitoring parts, but rather etcd incorrectly increasing the metric for this particular case. Following origin issue, there is the linked issue https://github.com/etcd-io/etcd/issues/10289, according to which the etcd team is already looking into this.

On monitoring side, I would prefer to keep this alert for all the other endpoints to be monitored, in order to not get notified about these, I recommend silencing this particular alert with the label selectors grpc_method="Watch" and grpc_service="etcdserverpb.Watch". The silence can then be lifted when the etcd parts are fixed.

Moving this to the masters component, which etcd is part of.

Comment 3 Sam Batschelet 2019-03-05 13:00:33 UTC

Yes this has been something in limbo, I will take over of the etcd issue and resolution.

Comment 12 Ivan Bazulic 2019-08-30 11:17:42 UTC

Hi Sam,

Has there been any updates on this particular issue?

Thanks!

Comment 13 kedar 2019-09-17 10:36:00 UTC

Hello,

Any updates so far on this issue.


Thanks.

Comment 15 kedar 2019-10-22 06:24:09 UTC

Hello,

Following are the new alerts being received on api-server

The api-server on master1 prints the following error logs:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I1022 03:12:01.034785       1 trace.go:76] Trace[5187212]: "Patch /api/v1/nodes/infra1/status" (started: 2019-10-22 03:11:59.4951043 +0000 UTC m=+8874845.051249201) (total time: 1.5396641s):
Trace[5187212]: [1.5394755s] [1.5386814s] Object stored in database
E1022 03:12:03.733831       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E1022 03:12:08.403631       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I1022 03:12:17.040558       1 trace.go:76] Trace[67662901]: "List /api/v1/nodes" (started: 2019-10-22 03:12:15.1894578 +0000 UTC m=+8874860.745602601) (total time: 1.8510764s):
Trace[67662901]: [1.8488305s] [1.8488275s] Listing from storage done
I1022 03:12:17.070722       1 trace.go:76] Trace[232382444]: "List /api/v1/nodes" (started: 2019-10-22 03:12:15.2934429 +0000 UTC m=+8874860.849587701) (total time: 1.7772594s):
Trace[232382444]: [1.7763236s] [1.7763211s] Listing from storage done
E1022 03:12:38.282960       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E1022 03:12:47.607365       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:12:47.607644       1 reflector.go:272] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: watch of *core.Secret ended with: The resourceVersion for the provided watch is too old.
I1022 03:12:58.367110       1 trace.go:76] Trace[1242358499]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:12:51.6703164 +0000 UTC m=+8874897.226461301) (total time: 6.6967659s):
Trace[1242358499]: [6.6967091s] [6.6956035s] Transaction committed
I1022 03:12:58.367313       1 trace.go:76] Trace[370739828]: "Patch /api/v1/nodes/infra1/status" (started: 2019-10-22 03:12:51.6701923 +0000 UTC m=+8874897.226337101) (total time: 6.6971042s):
Trace[370739828]: [6.696939s] [6.6960374s] Object stored in database
E1022 03:12:58.440852       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I1022 03:13:00.057143       1 trace.go:76] Trace[748664276]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:12:57.5711421 +0000 UTC m=+8874903.127286901) (total time: 2.4859358s):
Trace[748664276]: [2.4857825s] [2.4846646s] Transaction committed
I1022 03:13:00.057486       1 trace.go:76] Trace[1507531935]: "Patch /api/v1/nodes/node4/status" (started: 2019-10-22 03:12:57.571003 +0000 UTC m=+8874903.127147801) (total time: 2.4864376s):
Trace[1507531935]: [2.4861956s] [2.4852747s] Object stored in database
I1022 03:13:00.569616       1 trace.go:76] Trace[1525894554]: "List /api/v1/nodes" (started: 2019-10-22 03:12:57.1007057 +0000 UTC m=+8874902.656850501) (total time: 3.4688708s):
Trace[1525894554]: [3.4679513s] [3.4679476s] Listing from storage done
E1022 03:13:00.949689       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:13:00.949991       1 reflector.go:272] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: watch of *scheduling.PriorityClass ended with: The resourceVersion for the provided watch is too old.
E1022 03:13:17.647389       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:13:17.647721       1 reflector.go:272] k8s.io/client-go/informers/factory.go:130: watch of *v1.Role ended with: The resourceVersion for the provided watch is too old.
E1022 03:13:18.416142       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E1022 03:13:19.499862       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I1022 03:13:20.627860       1 trace.go:76] Trace[1003567705]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:13:20.1032859 +0000 UTC m=+8874925.659430801) (total time: 524.5352ms):
Trace[1003567705]: [524.4168ms] [523.532ms] Transaction committed
I1022 03:13:20.628279       1 trace.go:76] Trace[662350473]: "Patch /api/v1/nodes/node4/status" (started: 2019-10-22 03:13:20.1031699 +0000 UTC m=+8874925.659314701) (total time: 525.0734ms):
Trace[662350473]: [524.7447ms] [524.0183ms] Object stored in database
E1022 03:13:24.318010       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:13:24.318304       1 reflector.go:272] k8s.io/client-go/informers/factory.go:130: watch of *v1.Namespace ended with: The resourceVersion for the provided watch is too old.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks,

Comment 16 kedar 2019-11-06 05:19:55 UTC

Hello,

Any updates so far on this issue.


Thanks.

Comment 17 Sam Batschelet 2019-11-12 15:45:13 UTC

I am digging into this again now.

Comment 19 Anand Paladugu 2019-12-02 19:41:01 UTC

Another customer (case # https://gss--c.na94.visual.force.com/apex/Case_View?id=5002K00000iPAkBQAW&sfdc.override=1) has implemented the silence with the specified labels, but he is still seeing the alert.

Any thought on why the workaround is not helping?

Thanks

Anand

Comment 21 Anand Paladugu 2019-12-25 13:07:22 UTC


Any thought on why the workaround is not helping?  I have also updated the silence that the customer has created.

Thanks

Anand

Comment 22 Anand Paladugu 2019-12-25 13:08:23 UTC

Created attachment 1647632 [details]
silence.png

Comment 24 Sam Batschelet 2020-01-02 21:56:55 UTC

Created attachment 1649319 [details]
silence-example

> Any thought on why the workaround is not helping?  I have also updated the silence that the customer has created.

I believe the issue is with your silence format.  Could you try the below which is outlined in the attachment?

`alertname=EtcdHighNumberOfFailedGRPCRequests`

Also as of 3.11.141[1],[2] this alert has been removed until we resolve the issue upstream. So customers would have the option of upgrading as well.

[1]: https://docs.openshift.com/container-platform/3.11/release_notes/ocp_3_11_release_notes.html#ocp-3-11-141
[2]: https://access.redhat.com/errata/RHBA-2019:2580

Comment 25 Michal Fojtik 2020-05-19 13:17:50 UTC

This bug hasn't had any engineering activity in the last ~30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale".

If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 26 Michal Fojtik 2020-05-20 10:46:05 UTC

This bug is actively worked on.

Comment 27 Dan Mace 2020-06-09 15:28:41 UTC

https://github.com/etcd-io/etcd/pull/11375 is the upstream fix. Then we need a bump to the openshift etcd fork.

Comment 28 Dan Mace 2020-06-09 15:31:29 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1677689 tracks re-enabling the alert, which is only possible once this bug is resolved. So this bug now blocks 1677689.

Comment 31 Dan Mace 2020-08-18 13:56:59 UTC

This is close, but I don't think we can get it done in 4.6 at this point. Moving to 4.7.

Comment 32 Dan Mace 2020-10-01 17:11:33 UTC

Here's the latest attempt at a fix upstream: https://github.com/etcd-io/etcd/pull/12196

Comment 48 Sam Batschelet 2021-09-15 18:33:38 UTC

etcd 3.2 as part of rhel 7 is in CVE maintenance mode. This bug is now fixed in OCP 4.9+. As a workaround for the bug exists (disable alert) and all customer cases attached are now closed. We are closing this bug as won't fix.

Comment 49 Red Hat Bugzilla 2024-01-06 04:26:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

aos-bugs
apaladug
bhershbe
chris.liles
dahernan
dkulkarn
dmace
dyocum
erich
fkrohn
hgomes
ibazulic
jokerman
krohn
ksalunkh
lcosic
lukasz.czarnota
mfojtik
mirollin
mjahangi
mmccomas
rvokal
sauchter
sbatsche
simon
sreber
ssadhale
steven.barre
vjaypurk
wking
wlewis