Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1832261

Summary: Frequent etcd leader changes
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED NOTABUG QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: anachand, lmohanty, sdodson, wking, wlewis
Target Milestone: ---Keywords: Upgrades
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-18 10:29:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vadim Rutkovsky 2020-05-06 12:08:56 UTC
Description of problem:

https://telemeter-lts-dashboards.datahub.redhat.com/d/kSomLV7Wk/pre-production-cluster-by-id?orgId=1&var-_id=953ca121-2e53-40b4-8375-2c9e2ae90aa2&var-datasource=recent

Version-Release number of selected component (if applicable):
4.4.3

How reproducible:
~30% of the upgraded cluster

Comment 1 Vadim Rutkovsky 2020-05-06 12:11:18 UTC
must-gather - https://drive.google.com/file/d/1iC6vG6c0P9op1BxWeNoCwBsnEd_FmI0T/view

Comment 2 Sam Batschelet 2020-05-07 01:23:20 UTC
I dug through these logs extensively, I did see a few quorum-guard timeouts before the upgrade which eliminates the general theory that this cluster was working fine before the upgrade. My understanding is the cluster upgraded on May 05.

```
May 04 15:43:55.471700 okd4-khbpx-master-1.novalocal (failure)
May 04 20:06:17.906200 okd4-khbpx-master-1.novalocal (failure)
May 04 23:12:47.544943 okd4-khbpx-master-1.novalocal (failure)
May 04 23:13:08.338182 okd4-khbpx-master-1.novalocal (failure)
May 04 23:13:08.969935 okd4-khbpx-master-1.novalocal (failure)
May 04 23:26:07.456137 okd4-khbpx-master-1.novalocal (failure)
May 04 23:26:17.337510 okd4-khbpx-master-1.novalocal (failure)
May 04 23:26:27.355816 okd4-khbpx-master-1.novalocal (failure)
May 04 23:26:37.224832 okd4-khbpx-master-1.novalocal (failure)
May 04 23:26:47.397717 okd4-khbpx-master-1.novalocal (failure)
```

Also etcd logs are full of `etcdserver: server is likely overloaded` pointing to a potential resource issue.

```
$ grep -rn 'etcdserver: server is likely overloaded' | wc -l
374
``

Given the above, I am lowering this to medium we can try to get more performance metrics on etcd. But without a better understanding of storage, there is not much we can do here. We do need a way to rollout custom runtimes for etcd to help customers with underperforming storage avoid leader elections. Moving to 4.6 as consideration for this feature, 4.5 is closed.

Comment 5 Sam Batschelet 2020-06-20 12:52:50 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 8 Sam Batschelet 2020-08-18 10:29:08 UTC
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1832261#c2 I am going to close this bug as it seems the cluster had performance issues before the upgrade. In general we should try to improve clusters with performance issues by attempting to adjust etcd raft tunables to tolerate its environment better. I created an RFE around autotune feature lets track that moving forward.

https://issues.redhat.com/browse/ETCD-121

Comment 9 Sam Batschelet 2020-08-18 10:39:08 UTC
I wanted to also think this RFE on trying to mitigate leader elections during upgrade as possibly related.

https://issues.redhat.com/browse/ETCD-98