Description of problem: When etcd goes through a leader election by design all clients should fail fast and retry. The reason for this is during an election technically no leader exists so we can not server liniearized transaction. Instead of timing out fail fast and retry. We need a full understanding of the affects of leader election with the apiserver and other clients. Questions: 1.) Does the leader election itself result in additional I/O latencies? Do we replay wal file for example. 2.) Since we just told all clients to retry does this retry all disrupt etcds ability to respond timely? 3.) Do all clients actually fail then retry? If the client chooses to just timeout it should be explored why and if it can be changed. 4.) how much time does a leader election take before we can serve requests on average. For example from below, why do we timeout a 10s context? Why didn't the client just fail and retry when it saw the "leader change"? ``` aft2020/08/17 13:33:53 INFO: 468accc2d74de166 is starting a new election at term 6 raft2020/08/17 13:33:53 INFO: 468accc2d74de166 became candidate at term 7 raft2020/08/17 13:33:53 INFO: 468accc2d74de166 received MsgVoteResp from 468accc2d74de166 at term 7 raft2020/08/17 13:33:53 INFO: 468accc2d74de166 [logterm: 6, index: 180670] sent MsgVote request to 2dc33b3f4b2746bd at term 7 raft2020/08/17 13:33:53 INFO: 468accc2d74de166 [logterm: 6, index: 180670] sent MsgVote request to 88cdfdfb2b77134c at term 7 raft2020/08/17 13:33:53 INFO: raft.node: 468accc2d74de166 lost leader 88cdfdfb2b77134c at term 7 raft2020/08/17 13:33:53 INFO: 468accc2d74de166 received MsgVoteResp from 2dc33b3f4b2746bd at term 7 raft2020/08/17 13:33:53 INFO: 468accc2d74de166 has received 2 MsgVoteResp votes and 0 vote rejections raft2020/08/17 13:33:53 INFO: 468accc2d74de166 became leader at term 7 raft2020/08/17 13:33:53 INFO: raft.node: 468accc2d74de166 elected leader 468accc2d74de166 at term 7 2020-08-17 13:33:53.899755 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-controller-manager-operator/csr-signer-ca\" " with result "error:etcdserver: leader changed" took too long (2.370206078s) to execute 2020-08-17 13:33:56.996289 W | etcdserver: read-only range request "key:\"/openshift.io/oauth/accesstokens/missing\" " with result "error:context canceled" took too long (5.00077352s) to execute WARNING: 2020/08/17 13:33:56 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2020-08-17 13:33:57.135293 I | etcdserver/api/etcdhttp: /health OK (status code 200) 2020-08-17 13:33:57.302295 W | etcdserver: read-only range request "key:\"/kubernetes.io/crd-publish-openapi-test-common-group.example.com/e2e-test-crd-publish-openapi-6547-crds/\" range_end:\"/kubernetes.io/crd-publish-openapi-test-common-group.example.com/e2e-test-crd-publish-openapi-6547-crds0\" " with result "error:context canceled" took too long (1.058356927s) to execute WARNING: 2020/08/17 13:33:57 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2020-08-17 13:34:01.992695 W | etcdserver: read-only range request "key:\"/kubernetes.io/namespaces/kube-system\" " with result "error:context canceled" took too long (4.996294912s) to execute WARNING: 2020/08/17 13:34:01 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2020-08-17 13:34:01.994451 W | etcdserver: read-only range request "key:\"/openshift.io/imagestreams/openshift-apiserver/missing\" " with result "error:context canceled" took too long (4.998089735s) to execute WARNING: 2020/08/17 13:34:01 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2020-08-17 13:34:02.136726 I | etcdserver/api/etcdhttp: /health OK (status code 200) 2020-08-17 13:34:02.624425 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/kube-scheduler\" " with result "error:context canceled" took too long (9.999407116s) to execute WARNING: 2020/08/17 13:34:02 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2020-08-17 13:34:02.996870 W | etcdserver: read-only range request "key:\"/openshift.io/oauth/accesstokens/missing\" " with result "error:context canceled" took too long (4.998579146s) to execute WARNING: 2020/08/17 13:34:02 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2020-08-17 13:34:03.069760 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/kube-system/kube-controller-manager\" " with result "error:context canceled" took too long (9.986944846s) to execute WARNING: 2020/08/17 13:34:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2020-08-17 13:34:03.220114 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/kube-scheduler\" " with result "error:context canceled" took too long (9.997289849s) to execute WARNING: 2020/08/17 13:34:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2020-08-17 13:34:03.347028 W | etcdserver: read-only range request "key:\"/openshift.io/health\" " with result "error:context deadline exceeded" took too long (9.999999885s) to execute 2020-08-17 13:34:03.568258 W | etcdserver: read-only range request "key:\"/openshift.io/health\" " with result "error:context canceled" took too long (9.99991721s) to execute WARNING: 2020/08/17 13:34:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" WARNING: 2020/08/17 13:34:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2020-08-17 13:34:03.899867 W | etcdserver: timed out waiting for read index response (local node might have slow network) ``` ref: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1295334717188476928 Version-Release number of selected component (if applicable): How reproducible: most leader elections Steps to Reproduce: 1. induce leader election 2. observe the reaction of clients 3. Actual results: it appears that some clients are timing out instead of failing and retry is that true and why? Expected results: when a leader election happens clients should retry and not timeout. etcd should minimize disruption of leader election if at all possible. Additional info:
Hi Sam, I changed to expected release to 4.6, and pls correct me if you have any concern, thanks
Iām adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
Aniket - I read through the case attached to this bug and I have concerns about what is generating the high load across multiple nodes and ultimately resulting in the leader election. Do you have a must-gather from the time when the issue occurs? A prom dump would be helpful as well. Our current plan is to close this BZ in favor of a Jira ticket for an upcoming OCP release to improve the system behavior during a leader election. However, I want to make sure we get this customer issue on the right path before doing so. Thanks, Wally
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
@Wally We do have must-gather but it contains GB of data, can you specify which logs you want so i can ask from the customer or provide it from the must-gather. Thank you.
Created https://issues.redhat.com/browse/ETCD-193 to track investigative/eng work associated with the problem described in at the top of this BZ.
@kedar can you provide logs for etcd from all masters, and etcd-operator? In addition logs from kube-apiserver (and its operator) would be useful to see how the client is responding.
In the SOS report, etcd isn't showing 100% CPU at all. At 19.7% it looks normal. Here is the `ps` sorted by CPU usage: ``` USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 1001 7324 3.2 7.5 4569364 2477300 ? Ssl Apr20 101:05 /bin/olm --namespace openshift-operator-lifecycle-manager --writeStatusName operator-lifecycle-manager --writePackageServerStatusName operator-lifecycle-manager-packageserver --tls-cert /var/run/secrets/serving-cert/tls.crt --tls-key /var/run/secrets/serving-cert/tls.key root 21725 6.9 7.3 3243080 2428492 ? Ssl Apr20 218:55 kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig --authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig --authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig --client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=2 --tls-cert-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.crt --tls-private-key-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.key root 9361 11.0 8.5 5206088 2800432 ? Ssl Apr20 346:28 openshift-apiserver start --config=/var/run/configmaps/config/config.yaml -v=2 root 1801 13.8 0.8 4579800 267340 ? Ssl Apr20 433:59 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --minimum-container-ttl-duration=6m0s --cloud-provider=vsphere --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-config=/etc/kubernetes/cloud.conf --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa3725668126a3b00f9c7a9e635c45d86445ed63c1a524a243ab1e9e08f2708 --v=4 root 4958 19.7 9.6 4341112 3167128 ? Ssl Apr20 619:01 etcd --initial-advertise-peer-urls=https://10.70.49.8:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-lxmaster01.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-lxmaster01.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-lxmaster01.crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-lxmaster01.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.70.49.8:2379 --listen-client-urls=https://0.0.0.0:2379 --listen-peer-urls=https://0.0.0.0:2380 --listen-metrics-urls=https://0.0.0.0:9978 root 4603 37.3 18.3 7613928 6032032 ? Sl Apr20 1170:22 kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=10.70.49.8 -v=2 root 12454 40.8 1.0 2081600 347456 ? Ssl Apr20 1279:43 /usr/bin/ruby /usr/local/bin/fluentd --suppress-config-dump --no-supervisor -r /usr/local/share/gems/gems/fluent-plugin-elasticsearch-4.1.1/lib/fluent/plugin/elasticsearch_simple_sniffer.rb root 3197547 56.6 0.6 1126164 218736 pts/0 Sl+ 18:20 1:09 /usr/libexec/platform-python -s /usr/sbin/sosreport root 3208281 61.3 0.2 1604040 70504 pts/0 Sl 18:22 0:03 podman images root 3205579 69.7 0.6 612392 204124 pts/0 R+ 18:21 0:39 journalctl --no-pager --catalog --boot ```
Since the original description was a request for enhancement to minimize disruptions due to etcd leader elections, I am closing this BZ, and opening an RFE for planning the enhancement in future releases. The attached customer cases are not directly related to the problem described, but I looked into them and attempted to suggest the performance bottlenecks in the cluster. However, those performance issues are not directly related to the leader elections. If those cases still pending, please open a new BZ to address each of them.
The issue is now tracked via: https://issues.redhat.com/browse/ETCD-196 This bug is closed.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days