Bug 1706103 - etcd overloaded when there are large number of objects, cluster goes down when one etcd member is down [NEEDINFO]
Summary: etcd overloaded when there are large number of objects, cluster goes down whe...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.1.0
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: 4.3.0
Assignee: Sam Batschelet
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-41
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-03 14:38 UTC by Naga Ravi Chaitanya Elluri
Modified: 2020-01-23 11:04 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Cause: The etcd client balancer had a bug which did not properly facilitate peer failover if a client connection were to timeout. Fix: Bumping etcd to 3.3.17 resolved client balancer bug. The bump includes a redesigned balancer which now properly handles peer failover.
Clone Of:
Environment:
Last Closed: 2020-01-23 11:03:45 UTC
Target Upstream Version:
sbatsche: needinfo? (nelluri)


Attachments (Terms of Use)
etcd member log (4.91 MB, text/plain)
2019-05-03 14:38 UTC, Naga Ravi Chaitanya Elluri
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 None None None 2020-01-23 11:04:13 UTC

Description Naga Ravi Chaitanya Elluri 2019-05-03 14:38:24 UTC
Created attachment 1562508 [details]
etcd member log

Description of problem:
We ran a scale test which tried to load a 250 node cluster with bunch of objects ( 10k projects, 50k pods ) and etcd got overloaded around 8500 projects. Looking at the etcd logs, there were bunch of tls: bad certificate errors( rejected connection from "10.0.151.95:39534" (error "remote error: tls: bad certificate", ServerName "etcd-0.scalability-cluster.perf-testing.devcluster.openshift.com" ). Etcd server was unable to send out a heartbeat on time, requests were taking too long to execute and one etcd member in particular ( etcd-0 ) was using lots of memory and CPU (58G memory and around 4 cores ). This brought the cluster down, We had to delete the projects to stabilize the cluster.

Version-Release number of selected component (if applicable):
Etcd Version: 3.3.10
OCP: 4.1 beta4/4.1.0-0.nightly-2019-04-22-005054
Installer: v4.1.0-201904211700-dirty

How reproducible:


Steps to Reproduce:
1. Install a cluster using the beta4 build (4.1.0-0.nightly-2019-04-22-005054).
2. Load the cluster with bunch of objects. In this case, there were about 8500 projects with bunch of pods running.
3. Check if the cluster components are stable/healthy.

Actual results:
- Cluster went down, oc client was unable to talk to apiserver.
- etcd-0 was using around 58G of memory and 4 cores.

Expected results:
- Cluster and etcd are stable/healthy.

Additional info:

Comment 2 ge liu 2019-05-05 02:30:58 UTC
Hello, could u check the role of downed etcd, is it a etcd leader or etcd member? because the situation seems similar with https://bugzilla.redhat.com/show_bug.cgi?id=1698456, and the payload report this bug is xxx2019-04-22-xxx, the fix have not fixed as i speulation.

Comment 3 Naga Ravi Chaitanya Elluri 2019-05-06 12:38:48 UTC
It's an etcd member. We installed this cluster using beta4 build two weeks ago to run large scale tests, so there is definitely a possibility this bug might have been fixed in the latest build.

Comment 4 Alex Krzos 2019-05-07 12:48:07 UTC
(In reply to ge liu from comment #2)
> Hello, could u check the role of downed etcd, is it a etcd leader or etcd
> member? because the situation seems similar with
> https://bugzilla.redhat.com/show_bug.cgi?id=1698456, and the payload report
> this bug is xxx2019-04-22-xxx, the fix have not fixed as i speulation.

Although the symptoms are similar to the referenced bug, the actual bug here is that under heavy load (that is sustainable in OCP3.11) on etcd-0, etcd-0 went down and thus the entire cluster.  The entire cluster going down is just a symptom of the wildcard cert issue which I believe was already fixed, just after this build.

Comment 6 Sam Batschelet 2019-08-07 13:33:54 UTC
> cluster goes down when one etcd member is down

Upstream this should be resolved in mid-August when 3.3.14 is released. Once this happens we will backport downstream and cut a new release for openshift.

Comment 8 Michal Fojtik 2019-11-07 09:43:52 UTC
kube 1.16.0 landed, moving this to MODIFIED

Comment 11 Mike Fiedler 2019-11-15 16:01:29 UTC
Verified on 4.3.0-0.nightly-2019-11-13-233341 running etcd 3.3.17.   Cluster is fully operational with 1 etcd member down and remaining 2 healthy.

Comment 13 errata-xmlrpc 2020-01-23 11:03:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.