Bug 1806700

Summary:	Large number of etcd leader elections on Azure
Product:	OpenShift Container Platform	Reporter:	Jim Minter <jminter>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	asimonel, bleanhar, dcain, dkinkead, jtaleric, mfojtik, mharri, mifiedle, mjudeiki, nelluri, sbatsche, serena.cortopassi, vlaad, wking
Target Milestone:	---
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1807278 1807279 (view as bug list)		Environment:
Last Closed:	2020-05-21 17:59:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1807278, 1807279

Description Jim Minter 2020-02-24 19:10:23 UTC

Even when running an idle OCP 4 cluster on Azure there are a lot of etcd leadership elections.  Example:

2020-02-21 01:26:19.140279 I | raft: cf452c7e4ed8ffa9 is starting a new election at term 105
2020-02-21 01:26:20.440293 I | raft: cf452c7e4ed8ffa9 is starting a new election at term 106
2020-02-21 01:26:22.340240 I | raft: cf452c7e4ed8ffa9 is starting a new election at term 107

It seems that the fdatasync time on the Azure storage stack is regularly longer than the etcd hearbeat timeout configured by OCP.

Please ensure that etcd is tuned appropriately for the characteristics of the underlying storage stack on Azure OCP clusters in order to reduce leadership elections.

I don't know if more needs to be tuned than the heartbeat timeout; I also don't know what a suitable heartbeat timeout value for Azure is or what the tradeoff is between hard-coding an alternative value or making it tunable.

I also don't know if there are any monitoring/alerting configuration changes that are needed if the heartbeat timeout is changed?

Please ensure this work goes into 4.3.z.

Comment 8 Sam Batschelet 2020-04-02 21:16:33 UTC

*** Bug 1798785 has been marked as a duplicate of this bug. ***

Comment 11 Red Hat Bugzilla 2023-09-14 05:53:16 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days