Bug 1832261
| Summary: | Frequent etcd leader changes | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vadim Rutkovsky <vrutkovs> |
| Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> |
| Status: | CLOSED NOTABUG | QA Contact: | ge liu <geliu> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.5 | CC: | anachand, lmohanty, sdodson, wking, wlewis |
| Target Milestone: | --- | Keywords: | Upgrades |
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-08-18 10:29:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Vadim Rutkovsky
2020-05-06 12:08:56 UTC
I dug through these logs extensively, I did see a few quorum-guard timeouts before the upgrade which eliminates the general theory that this cluster was working fine before the upgrade. My understanding is the cluster upgraded on May 05. ``` May 04 15:43:55.471700 okd4-khbpx-master-1.novalocal (failure) May 04 20:06:17.906200 okd4-khbpx-master-1.novalocal (failure) May 04 23:12:47.544943 okd4-khbpx-master-1.novalocal (failure) May 04 23:13:08.338182 okd4-khbpx-master-1.novalocal (failure) May 04 23:13:08.969935 okd4-khbpx-master-1.novalocal (failure) May 04 23:26:07.456137 okd4-khbpx-master-1.novalocal (failure) May 04 23:26:17.337510 okd4-khbpx-master-1.novalocal (failure) May 04 23:26:27.355816 okd4-khbpx-master-1.novalocal (failure) May 04 23:26:37.224832 okd4-khbpx-master-1.novalocal (failure) May 04 23:26:47.397717 okd4-khbpx-master-1.novalocal (failure) ``` Also etcd logs are full of `etcdserver: server is likely overloaded` pointing to a potential resource issue. ``` $ grep -rn 'etcdserver: server is likely overloaded' | wc -l 374 `` Given the above, I am lowering this to medium we can try to get more performance metrics on etcd. But without a better understanding of storage, there is not much we can do here. We do need a way to rollout custom runtimes for etcd to help customers with underperforming storage avoid leader elections. Moving to 4.6 as consideration for this feature, 4.5 is closed. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. Based on https://bugzilla.redhat.com/show_bug.cgi?id=1832261#c2 I am going to close this bug as it seems the cluster had performance issues before the upgrade. In general we should try to improve clusters with performance issues by attempting to adjust etcd raft tunables to tolerate its environment better. I created an RFE around autotune feature lets track that moving forward. https://issues.redhat.com/browse/ETCD-121 I wanted to also think this RFE on trying to mitigate leader elections during upgrade as possibly related. https://issues.redhat.com/browse/ETCD-98 |