Description of problem: Following alert is seen in logs: Etcd cluster "etcd": 100% of requests for Watch failed on etcd instance <ETCD>:2379. grpc_service="etcdserverpb.Watch" How reproducible: Very Steps to Reproduce: 1. Follow https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html#configuring-etcd-monitoring 2. 3. Actual results: Message is logged even though etcd cluster is healthy and there are no issues Expected results: For there to be an actual problem if a message like this is logged Additional info: This issue was opened upstream at https://github.com/openshift/origin/issues/20311
Created attachment 1536598 [details] EtcdHighNumberOfFailedGRPCRequests alerts are shown after enabling etcd monitoring
This doesn't appear to be a problem with the monitoring parts, but rather etcd incorrectly increasing the metric for this particular case. Following origin issue, there is the linked issue https://github.com/etcd-io/etcd/issues/10289, according to which the etcd team is already looking into this. On monitoring side, I would prefer to keep this alert for all the other endpoints to be monitored, in order to not get notified about these, I recommend silencing this particular alert with the label selectors grpc_method="Watch" and grpc_service="etcdserverpb.Watch". The silence can then be lifted when the etcd parts are fixed. Moving this to the masters component, which etcd is part of.
Yes this has been something in limbo, I will take over of the etcd issue and resolution.
Hi Sam, Has there been any updates on this particular issue? Thanks!
Hello, Any updates so far on this issue. Thanks.
Hello, Following are the new alerts being received on api-server The api-server on master1 prints the following error logs: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I1022 03:12:01.034785 1 trace.go:76] Trace[5187212]: "Patch /api/v1/nodes/infra1/status" (started: 2019-10-22 03:11:59.4951043 +0000 UTC m=+8874845.051249201) (total time: 1.5396641s): Trace[5187212]: [1.5394755s] [1.5386814s] Object stored in database E1022 03:12:03.733831 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted E1022 03:12:08.403631 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted I1022 03:12:17.040558 1 trace.go:76] Trace[67662901]: "List /api/v1/nodes" (started: 2019-10-22 03:12:15.1894578 +0000 UTC m=+8874860.745602601) (total time: 1.8510764s): Trace[67662901]: [1.8488305s] [1.8488275s] Listing from storage done I1022 03:12:17.070722 1 trace.go:76] Trace[232382444]: "List /api/v1/nodes" (started: 2019-10-22 03:12:15.2934429 +0000 UTC m=+8874860.849587701) (total time: 1.7772594s): Trace[232382444]: [1.7763236s] [1.7763211s] Listing from storage done E1022 03:12:38.282960 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted E1022 03:12:47.607365 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted W1022 03:12:47.607644 1 reflector.go:272] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: watch of *core.Secret ended with: The resourceVersion for the provided watch is too old. I1022 03:12:58.367110 1 trace.go:76] Trace[1242358499]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:12:51.6703164 +0000 UTC m=+8874897.226461301) (total time: 6.6967659s): Trace[1242358499]: [6.6967091s] [6.6956035s] Transaction committed I1022 03:12:58.367313 1 trace.go:76] Trace[370739828]: "Patch /api/v1/nodes/infra1/status" (started: 2019-10-22 03:12:51.6701923 +0000 UTC m=+8874897.226337101) (total time: 6.6971042s): Trace[370739828]: [6.696939s] [6.6960374s] Object stored in database E1022 03:12:58.440852 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted I1022 03:13:00.057143 1 trace.go:76] Trace[748664276]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:12:57.5711421 +0000 UTC m=+8874903.127286901) (total time: 2.4859358s): Trace[748664276]: [2.4857825s] [2.4846646s] Transaction committed I1022 03:13:00.057486 1 trace.go:76] Trace[1507531935]: "Patch /api/v1/nodes/node4/status" (started: 2019-10-22 03:12:57.571003 +0000 UTC m=+8874903.127147801) (total time: 2.4864376s): Trace[1507531935]: [2.4861956s] [2.4852747s] Object stored in database I1022 03:13:00.569616 1 trace.go:76] Trace[1525894554]: "List /api/v1/nodes" (started: 2019-10-22 03:12:57.1007057 +0000 UTC m=+8874902.656850501) (total time: 3.4688708s): Trace[1525894554]: [3.4679513s] [3.4679476s] Listing from storage done E1022 03:13:00.949689 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted W1022 03:13:00.949991 1 reflector.go:272] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: watch of *scheduling.PriorityClass ended with: The resourceVersion for the provided watch is too old. E1022 03:13:17.647389 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted W1022 03:13:17.647721 1 reflector.go:272] k8s.io/client-go/informers/factory.go:130: watch of *v1.Role ended with: The resourceVersion for the provided watch is too old. E1022 03:13:18.416142 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted E1022 03:13:19.499862 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted I1022 03:13:20.627860 1 trace.go:76] Trace[1003567705]: "GuaranteedUpdate etcd3: *core.Node" (started: 2019-10-22 03:13:20.1032859 +0000 UTC m=+8874925.659430801) (total time: 524.5352ms): Trace[1003567705]: [524.4168ms] [523.532ms] Transaction committed I1022 03:13:20.628279 1 trace.go:76] Trace[662350473]: "Patch /api/v1/nodes/node4/status" (started: 2019-10-22 03:13:20.1031699 +0000 UTC m=+8874925.659314701) (total time: 525.0734ms): Trace[662350473]: [524.7447ms] [524.0183ms] Object stored in database E1022 03:13:24.318010 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted W1022 03:13:24.318304 1 reflector.go:272] k8s.io/client-go/informers/factory.go:130: watch of *v1.Namespace ended with: The resourceVersion for the provided watch is too old. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Thanks,
I am digging into this again now.
Another customer (case # https://gss--c.na94.visual.force.com/apex/Case_View?id=5002K00000iPAkBQAW&sfdc.override=1) has implemented the silence with the specified labels, but he is still seeing the alert. Any thought on why the workaround is not helping? Thanks Anand
Any thought on why the workaround is not helping? I have also updated the silence that the customer has created. Thanks Anand
Created attachment 1647632 [details] silence.png
Created attachment 1649319 [details] silence-example > Any thought on why the workaround is not helping? I have also updated the silence that the customer has created. I believe the issue is with your silence format. Could you try the below which is outlined in the attachment? `alertname=EtcdHighNumberOfFailedGRPCRequests` Also as of 3.11.141[1],[2] this alert has been removed until we resolve the issue upstream. So customers would have the option of upgrading as well. [1]: https://docs.openshift.com/container-platform/3.11/release_notes/ocp_3_11_release_notes.html#ocp-3-11-141 [2]: https://access.redhat.com/errata/RHBA-2019:2580
This bug hasn't had any engineering activity in the last ~30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale". If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
This bug is actively worked on.
https://github.com/etcd-io/etcd/pull/11375 is the upstream fix. Then we need a bump to the openshift etcd fork.
https://bugzilla.redhat.com/show_bug.cgi?id=1677689 tracks re-enabling the alert, which is only possible once this bug is resolved. So this bug now blocks 1677689.
This is close, but I don't think we can get it done in 4.6 at this point. Moving to 4.7.
Here's the latest attempt at a fix upstream: https://github.com/etcd-io/etcd/pull/12196
etcd 3.2 as part of rhel 7 is in CVE maintenance mode. This bug is now fixed in OCP 4.9+. As a workaround for the bug exists (disable alert) and all customer cases attached are now closed. We are closing this bug as won't fix.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days