Bug 1870274 - minimize disruption from etcd leader elections
Summary: minimize disruption from etcd leader elections
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Suresh Kolichala
QA Contact: ge liu
URL:
Whiteboard: LifecycleReset
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-19 15:56 UTC by Sam Batschelet
Modified: 2024-10-01 16:47 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-15 12:34:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sam Batschelet 2020-08-19 15:56:40 UTC
Description of problem: When etcd goes through a leader election by design all clients should fail fast and retry. The reason for this is during an election technically no leader exists so we can not server liniearized transaction. Instead of timing out fail fast and retry. We need a full understanding of the affects of leader election with the apiserver and other clients.

Questions:

1.) Does the leader election itself result in additional I/O latencies? Do we replay wal file for example.

2.) Since we just told all clients to retry does this retry all disrupt etcds ability to respond timely?

3.) Do all clients actually fail then retry? If the client chooses to just timeout it should be explored why and if it can be changed.

4.) how much time does a leader election take before we can serve requests on average.

For example from below, why do we timeout a 10s context? Why didn't the client just fail and retry when it saw the "leader change"?

```
aft2020/08/17 13:33:53 INFO: 468accc2d74de166 is starting a new election at term 6
raft2020/08/17 13:33:53 INFO: 468accc2d74de166 became candidate at term 7
raft2020/08/17 13:33:53 INFO: 468accc2d74de166 received MsgVoteResp from 468accc2d74de166 at term 7
raft2020/08/17 13:33:53 INFO: 468accc2d74de166 [logterm: 6, index: 180670] sent MsgVote request to 2dc33b3f4b2746bd at term 7
raft2020/08/17 13:33:53 INFO: 468accc2d74de166 [logterm: 6, index: 180670] sent MsgVote request to 88cdfdfb2b77134c at term 7
raft2020/08/17 13:33:53 INFO: raft.node: 468accc2d74de166 lost leader 88cdfdfb2b77134c at term 7
raft2020/08/17 13:33:53 INFO: 468accc2d74de166 received MsgVoteResp from 2dc33b3f4b2746bd at term 7
raft2020/08/17 13:33:53 INFO: 468accc2d74de166 has received 2 MsgVoteResp votes and 0 vote rejections
raft2020/08/17 13:33:53 INFO: 468accc2d74de166 became leader at term 7
raft2020/08/17 13:33:53 INFO: raft.node: 468accc2d74de166 elected leader 468accc2d74de166 at term 7
2020-08-17 13:33:53.899755 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-controller-manager-operator/csr-signer-ca\" " with result "error:etcdserver: leader changed" took too long (2.370206078s) to execute
2020-08-17 13:33:56.996289 W | etcdserver: read-only range request "key:\"/openshift.io/oauth/accesstokens/missing\" " with result "error:context canceled" took too long (5.00077352s) to execute
WARNING: 2020/08/17 13:33:56 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-17 13:33:57.135293 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2020-08-17 13:33:57.302295 W | etcdserver: read-only range request "key:\"/kubernetes.io/crd-publish-openapi-test-common-group.example.com/e2e-test-crd-publish-openapi-6547-crds/\" range_end:\"/kubernetes.io/crd-publish-openapi-test-common-group.example.com/e2e-test-crd-publish-openapi-6547-crds0\" " with result "error:context canceled" took too long (1.058356927s) to execute
WARNING: 2020/08/17 13:33:57 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-17 13:34:01.992695 W | etcdserver: read-only range request "key:\"/kubernetes.io/namespaces/kube-system\" " with result "error:context canceled" took too long (4.996294912s) to execute
WARNING: 2020/08/17 13:34:01 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-17 13:34:01.994451 W | etcdserver: read-only range request "key:\"/openshift.io/imagestreams/openshift-apiserver/missing\" " with result "error:context canceled" took too long (4.998089735s) to execute
WARNING: 2020/08/17 13:34:01 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-17 13:34:02.136726 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2020-08-17 13:34:02.624425 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/kube-scheduler\" " with result "error:context canceled" took too long (9.999407116s) to execute
WARNING: 2020/08/17 13:34:02 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-17 13:34:02.996870 W | etcdserver: read-only range request "key:\"/openshift.io/oauth/accesstokens/missing\" " with result "error:context canceled" took too long (4.998579146s) to execute
WARNING: 2020/08/17 13:34:02 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-17 13:34:03.069760 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/kube-system/kube-controller-manager\" " with result "error:context canceled" took too long (9.986944846s) to execute
WARNING: 2020/08/17 13:34:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-17 13:34:03.220114 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/kube-scheduler\" " with result "error:context canceled" took too long (9.997289849s) to execute
WARNING: 2020/08/17 13:34:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-17 13:34:03.347028 W | etcdserver: read-only range request "key:\"/openshift.io/health\" " with result "error:context deadline exceeded" took too long (9.999999885s) to execute
2020-08-17 13:34:03.568258 W | etcdserver: read-only range request "key:\"/openshift.io/health\" " with result "error:context canceled" took too long (9.99991721s) to execute
WARNING: 2020/08/17 13:34:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2020/08/17 13:34:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-08-17 13:34:03.899867 W | etcdserver: timed out waiting for read index response (local node might have slow network)
```

ref: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1295334717188476928




Version-Release number of selected component (if applicable):


How reproducible: most leader elections


Steps to Reproduce:
1. induce leader election
2. observe the reaction of clients
3.

Actual results: it appears that some clients are timing out instead of failing and retry is that true and why?


Expected results: when a leader election happens clients should retry and not timeout. etcd should minimize disruption of leader election if at all possible.


Additional info:

Comment 6 ge liu 2020-09-11 04:39:53 UTC
Hi Sam, I changed to expected release to 4.6, and pls correct me if you have any concern, thanks

Comment 11 Sam Batschelet 2020-10-26 02:13:06 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 15 Michal Fojtik 2021-03-22 17:20:23 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 16 Wally 2021-05-05 13:26:35 UTC
Aniket - 

I read through the case attached to this bug and I have concerns about what is generating the high load across multiple nodes and ultimately resulting in the leader election.  Do you have a must-gather from the time when the issue occurs?  A prom dump would be helpful as well.

Our current plan is to close this BZ in favor of a Jira ticket for an upcoming OCP release to improve the system behavior during a leader election.  However, I want to make sure we get this customer issue on the right path before doing so.

Thanks,

Wally

Comment 17 Michal Fojtik 2021-05-05 14:14:36 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 18 kedar 2021-05-11 20:34:39 UTC
@Wally

We do have must-gather but it contains GB of data, can you specify which logs you want so i can ask from the customer or provide it from the must-gather.

Thank you.

Comment 19 Wally 2021-05-20 15:05:37 UTC
Created https://issues.redhat.com/browse/ETCD-193 to track investigative/eng work associated with the problem described in at the top of this BZ.

Comment 20 Suresh Kolichala 2021-05-20 15:16:05 UTC
@kedar can you provide logs for etcd from all masters, and etcd-operator? In addition logs from kube-apiserver (and its operator) would be useful to see how the client is responding.

Comment 22 Suresh Kolichala 2021-05-24 11:37:40 UTC
In the SOS report, etcd isn't showing 100% CPU at all. At 19.7% it looks normal. Here is the `ps` sorted by CPU usage:

```
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1001        7324  3.2  7.5 4569364 2477300 ?     Ssl  Apr20 101:05 /bin/olm --namespace openshift-operator-lifecycle-manager --writeStatusName operator-lifecycle-manager --writePackageServerStatusName operator-lifecycle-manager-packageserver --tls-cert /var/run/secrets/serving-cert/tls.crt --tls-key /var/run/secrets/serving-cert/tls.key
root       21725  6.9  7.3 3243080 2428492 ?     Ssl  Apr20 218:55 kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig --authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig --authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig --client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=2 --tls-cert-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.crt --tls-private-key-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.key
root        9361 11.0  8.5 5206088 2800432 ?     Ssl  Apr20 346:28 openshift-apiserver start --config=/var/run/configmaps/config/config.yaml -v=2
root        1801 13.8  0.8 4579800 267340 ?      Ssl  Apr20 433:59 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --minimum-container-ttl-duration=6m0s --cloud-provider=vsphere --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-config=/etc/kubernetes/cloud.conf --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa3725668126a3b00f9c7a9e635c45d86445ed63c1a524a243ab1e9e08f2708 --v=4
root        4958 19.7  9.6 4341112 3167128 ?     Ssl  Apr20 619:01 etcd --initial-advertise-peer-urls=https://10.70.49.8:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-lxmaster01.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-lxmaster01.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-lxmaster01.crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-lxmaster01.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.70.49.8:2379 --listen-client-urls=https://0.0.0.0:2379 --listen-peer-urls=https://0.0.0.0:2380 --listen-metrics-urls=https://0.0.0.0:9978
root        4603 37.3 18.3 7613928 6032032 ?     Sl   Apr20 1170:22 kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=10.70.49.8 -v=2
root       12454 40.8  1.0 2081600 347456 ?      Ssl  Apr20 1279:43 /usr/bin/ruby /usr/local/bin/fluentd --suppress-config-dump --no-supervisor -r /usr/local/share/gems/gems/fluent-plugin-elasticsearch-4.1.1/lib/fluent/plugin/elasticsearch_simple_sniffer.rb
root     3197547 56.6  0.6 1126164 218736 pts/0  Sl+  18:20   1:09 /usr/libexec/platform-python -s /usr/sbin/sosreport
root     3208281 61.3  0.2 1604040 70504 pts/0   Sl   18:22   0:03 podman images
root     3205579 69.7  0.6 612392 204124 pts/0   R+   18:21   0:39 journalctl --no-pager --catalog --boot
```

Comment 23 Suresh Kolichala 2021-06-15 12:34:35 UTC
Since the original description was a request for enhancement to minimize disruptions due to etcd leader elections, I am closing this BZ, and opening an RFE for planning the enhancement in future releases.

The attached customer cases are not directly related to the problem described, but I looked into them and attempted to suggest the performance bottlenecks in the cluster. However, those performance issues are not directly related to the leader elections. If those cases still pending, please open a new BZ to address each of them.

Comment 24 Suresh Kolichala 2021-06-15 13:07:06 UTC
The issue is now tracked via: https://issues.redhat.com/browse/ETCD-196

This bug is closed.

Comment 25 Red Hat Bugzilla 2023-09-15 00:46:36 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.