Even when running an idle OCP 4 cluster on Azure there are a lot of etcd leadership elections. Example:
2020-02-21 01:26:19.140279 I | raft: cf452c7e4ed8ffa9 is starting a new election at term 105
2020-02-21 01:26:20.440293 I | raft: cf452c7e4ed8ffa9 is starting a new election at term 106
2020-02-21 01:26:22.340240 I | raft: cf452c7e4ed8ffa9 is starting a new election at term 107
It seems that the fdatasync time on the Azure storage stack is regularly longer than the etcd hearbeat timeout configured by OCP.
Please ensure that etcd is tuned appropriately for the characteristics of the underlying storage stack on Azure OCP clusters in order to reduce leadership elections.
I don't know if more needs to be tuned than the heartbeat timeout; I also don't know what a suitable heartbeat timeout value for Azure is or what the tradeoff is between hard-coding an alternative value or making it tunable.
I also don't know if there are any monitoring/alerting configuration changes that are needed if the heartbeat timeout is changed?
Please ensure this work goes into 4.3.z.
*** Bug 1798785 has been marked as a duplicate of this bug. ***