1706103 – etcd overloaded when there are large number of objects, cluster goes down when one etcd member is down

Bug 1706103 - etcd overloaded when there are large number of objects, cluster goes down when one etcd member is down

Summary: etcd overloaded when there are large number of objects, cluster goes down whe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Sam Batschelet
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	aos-scalability-41
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-03 14:38 UTC by Naga Ravi Chaitanya Elluri
Modified:	2023-09-14 05:28 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	Cause: The etcd client balancer had a bug which did not properly facilitate peer failover if a client connection were to timeout. Fix: Bumping etcd to 3.3.17 resolved client balancer bug. The bump includes a redesigned balancer which now properly handles peer failover.
Clone Of:
Environment:
Last Closed:	2020-01-23 11:03:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
etcd member log (4.91 MB, text/plain) 2019-05-03 14:38 UTC, Naga Ravi Chaitanya Elluri	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:04:13 UTC

Description Naga Ravi Chaitanya Elluri 2019-05-03 14:38:24 UTC

Created attachment 1562508 [details]
etcd member log

Description of problem:
We ran a scale test which tried to load a 250 node cluster with bunch of objects ( 10k projects, 50k pods ) and etcd got overloaded around 8500 projects. Looking at the etcd logs, there were bunch of tls: bad certificate errors( rejected connection from "10.0.151.95:39534" (error "remote error: tls: bad certificate", ServerName "etcd-0.scalability-cluster.perf-testing.devcluster.openshift.com" ). Etcd server was unable to send out a heartbeat on time, requests were taking too long to execute and one etcd member in particular ( etcd-0 ) was using lots of memory and CPU (58G memory and around 4 cores ). This brought the cluster down, We had to delete the projects to stabilize the cluster.

Version-Release number of selected component (if applicable):
Etcd Version: 3.3.10
OCP: 4.1 beta4/4.1.0-0.nightly-2019-04-22-005054
Installer: v4.1.0-201904211700-dirty

How reproducible:


Steps to Reproduce:
1. Install a cluster using the beta4 build (4.1.0-0.nightly-2019-04-22-005054).
2. Load the cluster with bunch of objects. In this case, there were about 8500 projects with bunch of pods running.
3. Check if the cluster components are stable/healthy.

Actual results:
- Cluster went down, oc client was unable to talk to apiserver.
- etcd-0 was using around 58G of memory and 4 cores.

Expected results:
- Cluster and etcd are stable/healthy.

Additional info:

Comment 2 ge liu 2019-05-05 02:30:58 UTC

Hello, could u check the role of downed etcd, is it a etcd leader or etcd member? because the situation seems similar with https://bugzilla.redhat.com/show_bug.cgi?id=1698456, and the payload report this bug is xxx2019-04-22-xxx, the fix have not fixed as i speulation.

Comment 3 Naga Ravi Chaitanya Elluri 2019-05-06 12:38:48 UTC

It's an etcd member. We installed this cluster using beta4 build two weeks ago to run large scale tests, so there is definitely a possibility this bug might have been fixed in the latest build.

Comment 4 Alex Krzos 2019-05-07 12:48:07 UTC

(In reply to ge liu from comment #2)
> Hello, could u check the role of downed etcd, is it a etcd leader or etcd
> member? because the situation seems similar with
> https://bugzilla.redhat.com/show_bug.cgi?id=1698456, and the payload report
> this bug is xxx2019-04-22-xxx, the fix have not fixed as i speulation.

Although the symptoms are similar to the referenced bug, the actual bug here is that under heavy load (that is sustainable in OCP3.11) on etcd-0, etcd-0 went down and thus the entire cluster.  The entire cluster going down is just a symptom of the wildcard cert issue which I believe was already fixed, just after this build.

Comment 6 Sam Batschelet 2019-08-07 13:33:54 UTC

> cluster goes down when one etcd member is down

Upstream this should be resolved in mid-August when 3.3.14 is released. Once this happens we will backport downstream and cut a new release for openshift.

Comment 8 Michal Fojtik 2019-11-07 09:43:52 UTC

kube 1.16.0 landed, moving this to MODIFIED

Comment 11 Mike Fiedler 2019-11-15 16:01:29 UTC

Verified on 4.3.0-0.nightly-2019-11-13-233341 running etcd 3.3.17.   Cluster is fully operational with 1 etcd member down and remaining 2 healthy.

Comment 13 errata-xmlrpc 2020-01-23 11:03:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 14 Red Hat Bugzilla 2023-09-14 05:28:00 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.