Bug 1706103
| Summary: | etcd overloaded when there are large number of objects, cluster goes down when one etcd member is down | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> | ||||
| Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.1.0 | CC: | gblomqui, mfojtik, mifiedle, nelluri | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.3.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Linux | ||||||
| Whiteboard: | aos-scalability-41 | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: |
Cause: The etcd client balancer had a bug which did not properly facilitate peer failover if a client connection were to timeout.
Fix: Bumping etcd to 3.3.17 resolved client balancer bug. The bump includes a redesigned balancer which now properly handles peer failover.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-01-23 11:03:45 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Hello, could u check the role of downed etcd, is it a etcd leader or etcd member? because the situation seems similar with https://bugzilla.redhat.com/show_bug.cgi?id=1698456, and the payload report this bug is xxx2019-04-22-xxx, the fix have not fixed as i speulation. It's an etcd member. We installed this cluster using beta4 build two weeks ago to run large scale tests, so there is definitely a possibility this bug might have been fixed in the latest build. (In reply to ge liu from comment #2) > Hello, could u check the role of downed etcd, is it a etcd leader or etcd > member? because the situation seems similar with > https://bugzilla.redhat.com/show_bug.cgi?id=1698456, and the payload report > this bug is xxx2019-04-22-xxx, the fix have not fixed as i speulation. Although the symptoms are similar to the referenced bug, the actual bug here is that under heavy load (that is sustainable in OCP3.11) on etcd-0, etcd-0 went down and thus the entire cluster. The entire cluster going down is just a symptom of the wildcard cert issue which I believe was already fixed, just after this build. > cluster goes down when one etcd member is down
Upstream this should be resolved in mid-August when 3.3.14 is released. Once this happens we will backport downstream and cut a new release for openshift.
kube 1.16.0 landed, moving this to MODIFIED Verified on 4.3.0-0.nightly-2019-11-13-233341 running etcd 3.3.17. Cluster is fully operational with 1 etcd member down and remaining 2 healthy. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Created attachment 1562508 [details] etcd member log Description of problem: We ran a scale test which tried to load a 250 node cluster with bunch of objects ( 10k projects, 50k pods ) and etcd got overloaded around 8500 projects. Looking at the etcd logs, there were bunch of tls: bad certificate errors( rejected connection from "10.0.151.95:39534" (error "remote error: tls: bad certificate", ServerName "etcd-0.scalability-cluster.perf-testing.devcluster.openshift.com" ). Etcd server was unable to send out a heartbeat on time, requests were taking too long to execute and one etcd member in particular ( etcd-0 ) was using lots of memory and CPU (58G memory and around 4 cores ). This brought the cluster down, We had to delete the projects to stabilize the cluster. Version-Release number of selected component (if applicable): Etcd Version: 3.3.10 OCP: 4.1 beta4/4.1.0-0.nightly-2019-04-22-005054 Installer: v4.1.0-201904211700-dirty How reproducible: Steps to Reproduce: 1. Install a cluster using the beta4 build (4.1.0-0.nightly-2019-04-22-005054). 2. Load the cluster with bunch of objects. In this case, there were about 8500 projects with bunch of pods running. 3. Check if the cluster components are stable/healthy. Actual results: - Cluster went down, oc client was unable to talk to apiserver. - etcd-0 was using around 58G of memory and 4 cores. Expected results: - Cluster and etcd are stable/healthy. Additional info: