Bug 1677689

Summary: Etcd cluster "etcd": 100% of requests for Watch failed on etcd instance <ETCD>:2379. grpc_service="etcdserverpb.Watch"
Product: OpenShift Container Platform Reporter: Josh Foots <jfoots>
Component: EtcdAssignee: Dan Mace <dmace>
Status: CLOSED WONTFIX QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 3.11.0CC: aos-bugs, apaladug, bhershbe, chris.liles, dahernan, dkulkarn, dmace, dyocum, erich, fkrohn, hgomes, ibazulic, jokerman, krohn, ksalunkh, lcosic, lukasz.czarnota, mfojtik, mirollin, mjahangi, mmccomas, rvokal, sauchter, sbatsche, simon, sreber, ssadhale, steven.barre, vjaypurk, wking, wlewis
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: LifecycleFrozen
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1772446 (view as bug list) Environment:
Last Closed: 2021-09-15 18:33:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1701154, 1772446    
Attachments:
Description Flags
EtcdHighNumberOfFailedGRPCRequests alerts are shown after enabling etcd monitoring
none
silence.png
none
silence-example none

Description Josh Foots 2019-02-15 15:20:33 UTC
Description of problem:

Following alert is seen in logs:
Etcd cluster "etcd": 100% of requests for Watch failed on etcd instance <ETCD>:2379.
grpc_service="etcdserverpb.Watch"


How reproducible:

Very

Steps to Reproduce:
1. Follow https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html#configuring-etcd-monitoring
2.
3.

Actual results:

Message is logged even though etcd cluster is healthy and there are no issues

Expected results:

For there to be an actual problem if a message like this is logged

Additional info:

This issue was opened upstream at https://github.com/openshift/origin/issues/20311

Comment 1 Junqi Zhao 2019-02-20 07:40:29 UTC
Created attachment 1536598 [details]
EtcdHighNumberOfFailedGRPCRequests alerts are shown after enabling etcd monitoring

Comment 2 Frederic Branczyk 2019-02-20 09:05:01 UTC
This doesn't appear to be a problem with the monitoring parts, but rather etcd incorrectly increasing the metric for this particular case. Following origin issue, there is the linked issue https://github.com/etcd-io/etcd/issues/10289, according to which the etcd team is already looking into this.

On monitoring side, I would prefer to keep this alert for all the other endpoints to be monitored, in order to not get notified about these, I recommend silencing this particular alert with the label selectors grpc_method="Watch" and grpc_service="etcdserverpb.Watch". The silence can then be lifted when the etcd parts are fixed.

Moving this to the masters component, which etcd is part of.

Comment 3 Sam Batschelet 2019-03-05 13:00:33 UTC
Yes this has been something in limbo, I will take over of the etcd issue and resolution.

Comment 12 Ivan Bazulic 2019-08-30 11:17:42 UTC
Hi Sam,

Has there been any updates on this particular issue?

Thanks!

Comment 13 kedar 2019-09-17 10:36:00 UTC
Hello,

Any updates so far on this issue.


Thanks.

Comment 15 kedar 2019-10-22 06:24:09 UTC
Hello,

Following are the new alerts being received on api-server

The api-server on master1 prints the following error logs:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I1022 03:12:01.034785       1 trace.go:76] Trace[5187212]: "Patch /api/v1/nodes/infra1/status" (started: 2019-10-22 03:11:59.4951043 +0000 UTC m=+8874845.051249201) (total time: 1.5396641s):
Trace[5187212]: [1.5394755s] [1.5386814s] Object stored in database
E1022 03:12:03.733831       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E1022 03:12:08.403631       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I1022 03:12:17.040558       1 trace.go:76] Trace[67662901]: "List /api/v1/nodes" (started: 2019-10-22 03:12:15.1894578 +0000 UTC m=+8874860.745602601) (total time: 1.8510764s):
Trace[67662901]: [1.8488305s] [1.8488275s] Listing from storage done
I1022 03:12:17.070722       1 trace.go:76] Trace[232382444]: "List /api/v1/nodes" (started: 2019-10-22 03:12:15.2934429 +0000 UTC m=+8874860.849587701) (total time: 1.7772594s):
Trace[232382444]: [1.7763236s] [1.7763211s] Listing from storage done
E1022 03:12:38.282960       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E1022 03:12:47.607365       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:12:47.607644       1 reflector.go:272] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: watch of *core.Secret ended with: The resourceVersion for the provided watch is too old.
I1022 03:12:58.367110       1 trace.go:76] Trace[1242358499]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:12:51.6703164 +0000 UTC m=+8874897.226461301) (total time: 6.6967659s):
Trace[1242358499]: [6.6967091s] [6.6956035s] Transaction committed
I1022 03:12:58.367313       1 trace.go:76] Trace[370739828]: "Patch /api/v1/nodes/infra1/status" (started: 2019-10-22 03:12:51.6701923 +0000 UTC m=+8874897.226337101) (total time: 6.6971042s):
Trace[370739828]: [6.696939s] [6.6960374s] Object stored in database
E1022 03:12:58.440852       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I1022 03:13:00.057143       1 trace.go:76] Trace[748664276]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:12:57.5711421 +0000 UTC m=+8874903.127286901) (total time: 2.4859358s):
Trace[748664276]: [2.4857825s] [2.4846646s] Transaction committed
I1022 03:13:00.057486       1 trace.go:76] Trace[1507531935]: "Patch /api/v1/nodes/node4/status" (started: 2019-10-22 03:12:57.571003 +0000 UTC m=+8874903.127147801) (total time: 2.4864376s):
Trace[1507531935]: [2.4861956s] [2.4852747s] Object stored in database
I1022 03:13:00.569616       1 trace.go:76] Trace[1525894554]: "List /api/v1/nodes" (started: 2019-10-22 03:12:57.1007057 +0000 UTC m=+8874902.656850501) (total time: 3.4688708s):
Trace[1525894554]: [3.4679513s] [3.4679476s] Listing from storage done
E1022 03:13:00.949689       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:13:00.949991       1 reflector.go:272] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: watch of *scheduling.PriorityClass ended with: The resourceVersion for the provided watch is too old.
E1022 03:13:17.647389       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:13:17.647721       1 reflector.go:272] k8s.io/client-go/informers/factory.go:130: watch of *v1.Role ended with: The resourceVersion for the provided watch is too old.
E1022 03:13:18.416142       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E1022 03:13:19.499862       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I1022 03:13:20.627860       1 trace.go:76] Trace[1003567705]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:13:20.1032859 +0000 UTC m=+8874925.659430801) (total time: 524.5352ms):
Trace[1003567705]: [524.4168ms] [523.532ms] Transaction committed
I1022 03:13:20.628279       1 trace.go:76] Trace[662350473]: "Patch /api/v1/nodes/node4/status" (started: 2019-10-22 03:13:20.1031699 +0000 UTC m=+8874925.659314701) (total time: 525.0734ms):
Trace[662350473]: [524.7447ms] [524.0183ms] Object stored in database
E1022 03:13:24.318010       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W1022 03:13:24.318304       1 reflector.go:272] k8s.io/client-go/informers/factory.go:130: watch of *v1.Namespace ended with: The resourceVersion for the provided watch is too old.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks,

Comment 16 kedar 2019-11-06 05:19:55 UTC
Hello,

Any updates so far on this issue.


Thanks.

Comment 17 Sam Batschelet 2019-11-12 15:45:13 UTC
I am digging into this again now.

Comment 19 Anand Paladugu 2019-12-02 19:41:01 UTC
Another customer (case # https://gss--c.na94.visual.force.com/apex/Case_View?id=5002K00000iPAkBQAW&sfdc.override=1) has implemented the silence with the specified labels, but he is still seeing the alert.

Any thought on why the workaround is not helping?

Thanks

Anand

Comment 21 Anand Paladugu 2019-12-25 13:07:22 UTC

Any thought on why the workaround is not helping?  I have also updated the silence that the customer has created.

Thanks

Anand

Comment 22 Anand Paladugu 2019-12-25 13:08:23 UTC
Created attachment 1647632 [details]
silence.png

Comment 24 Sam Batschelet 2020-01-02 21:56:55 UTC
Created attachment 1649319 [details]
silence-example

> Any thought on why the workaround is not helping?  I have also updated the silence that the customer has created.

I believe the issue is with your silence format.  Could you try the below which is outlined in the attachment?

`alertname=EtcdHighNumberOfFailedGRPCRequests`

Also as of 3.11.141[1],[2] this alert has been removed until we resolve the issue upstream. So customers would have the option of upgrading as well.

[1]: https://docs.openshift.com/container-platform/3.11/release_notes/ocp_3_11_release_notes.html#ocp-3-11-141
[2]: https://access.redhat.com/errata/RHBA-2019:2580

Comment 25 Michal Fojtik 2020-05-19 13:17:50 UTC
This bug hasn't had any engineering activity in the last ~30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale".

If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 26 Michal Fojtik 2020-05-20 10:46:05 UTC
This bug is actively worked on.

Comment 27 Dan Mace 2020-06-09 15:28:41 UTC
https://github.com/etcd-io/etcd/pull/11375 is the upstream fix. Then we need a bump to the openshift etcd fork.

Comment 28 Dan Mace 2020-06-09 15:31:29 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1677689 tracks re-enabling the alert, which is only possible once this bug is resolved. So this bug now blocks 1677689.

Comment 31 Dan Mace 2020-08-18 13:56:59 UTC
This is close, but I don't think we can get it done in 4.6 at this point. Moving to 4.7.

Comment 32 Dan Mace 2020-10-01 17:11:33 UTC
Here's the latest attempt at a fix upstream: https://github.com/etcd-io/etcd/pull/12196

Comment 48 Sam Batschelet 2021-09-15 18:33:38 UTC
etcd 3.2 as part of rhel 7 is in CVE maintenance mode. This bug is now fixed in OCP 4.9+. As a workaround for the bug exists (disable alert) and all customer cases attached are now closed. We are closing this bug as won't fix.

Comment 49 Red Hat Bugzilla 2024-01-06 04:26:03 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days