Bug 1870274
| Summary: | minimize disruption from etcd leader elections | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Sam Batschelet <sbatsche> |
| Component: | Etcd | Assignee: | Suresh Kolichala <skolicha> |
| Status: | CLOSED NOTABUG | QA Contact: | ge liu <geliu> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.6 | CC: | aghadge, ccoleman, cstark, cvats, erich, ksalunkh, skolicha, travi, wking, wlewis |
| Target Milestone: | --- | Keywords: | Performance |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | LifecycleReset | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-06-15 12:34:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Sam Batschelet
2020-08-19 15:56:40 UTC
Hi Sam, I changed to expected release to 4.6, and pls correct me if you have any concern, thanks I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. Aniket - I read through the case attached to this bug and I have concerns about what is generating the high load across multiple nodes and ultimately resulting in the leader election. Do you have a must-gather from the time when the issue occurs? A prom dump would be helpful as well. Our current plan is to close this BZ in favor of a Jira ticket for an upcoming OCP release to improve the system behavior during a leader election. However, I want to make sure we get this customer issue on the right path before doing so. Thanks, Wally The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified. @Wally We do have must-gather but it contains GB of data, can you specify which logs you want so i can ask from the customer or provide it from the must-gather. Thank you. Created https://issues.redhat.com/browse/ETCD-193 to track investigative/eng work associated with the problem described in at the top of this BZ. @kedar can you provide logs for etcd from all masters, and etcd-operator? In addition logs from kube-apiserver (and its operator) would be useful to see how the client is responding. In the SOS report, etcd isn't showing 100% CPU at all. At 19.7% it looks normal. Here is the `ps` sorted by CPU usage: ``` USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 1001 7324 3.2 7.5 4569364 2477300 ? Ssl Apr20 101:05 /bin/olm --namespace openshift-operator-lifecycle-manager --writeStatusName operator-lifecycle-manager --writePackageServerStatusName operator-lifecycle-manager-packageserver --tls-cert /var/run/secrets/serving-cert/tls.crt --tls-key /var/run/secrets/serving-cert/tls.key root 21725 6.9 7.3 3243080 2428492 ? Ssl Apr20 218:55 kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig --authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig --authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig --client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=2 --tls-cert-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.crt --tls-private-key-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.key root 9361 11.0 8.5 5206088 2800432 ? Ssl Apr20 346:28 openshift-apiserver start --config=/var/run/configmaps/config/config.yaml -v=2 root 1801 13.8 0.8 4579800 267340 ? Ssl Apr20 433:59 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --minimum-container-ttl-duration=6m0s --cloud-provider=vsphere --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-config=/etc/kubernetes/cloud.conf --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa3725668126a3b00f9c7a9e635c45d86445ed63c1a524a243ab1e9e08f2708 --v=4 root 4958 19.7 9.6 4341112 3167128 ? Ssl Apr20 619:01 etcd --initial-advertise-peer-urls=https://10.70.49.8:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-lxmaster01.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-lxmaster01.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-lxmaster01.crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-lxmaster01.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.70.49.8:2379 --listen-client-urls=https://0.0.0.0:2379 --listen-peer-urls=https://0.0.0.0:2380 --listen-metrics-urls=https://0.0.0.0:9978 root 4603 37.3 18.3 7613928 6032032 ? Sl Apr20 1170:22 kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=10.70.49.8 -v=2 root 12454 40.8 1.0 2081600 347456 ? Ssl Apr20 1279:43 /usr/bin/ruby /usr/local/bin/fluentd --suppress-config-dump --no-supervisor -r /usr/local/share/gems/gems/fluent-plugin-elasticsearch-4.1.1/lib/fluent/plugin/elasticsearch_simple_sniffer.rb root 3197547 56.6 0.6 1126164 218736 pts/0 Sl+ 18:20 1:09 /usr/libexec/platform-python -s /usr/sbin/sosreport root 3208281 61.3 0.2 1604040 70504 pts/0 Sl 18:22 0:03 podman images root 3205579 69.7 0.6 612392 204124 pts/0 R+ 18:21 0:39 journalctl --no-pager --catalog --boot ``` Since the original description was a request for enhancement to minimize disruptions due to etcd leader elections, I am closing this BZ, and opening an RFE for planning the enhancement in future releases. The attached customer cases are not directly related to the problem described, but I looked into them and attempted to suggest the performance bottlenecks in the cluster. However, those performance issues are not directly related to the leader elections. If those cases still pending, please open a new BZ to address each of them. The issue is now tracked via: https://issues.redhat.com/browse/ETCD-196 This bug is closed. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |