Bug 2053596 - [IBM Cloud] Storage IOPS limitations and lack of IPI ETCD deployment options trigger leader election during cluster initialization
Summary: [IBM Cloud] Storage IOPS limitations and lack of IPI ETCD deployment options ...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.10
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.11.0
Assignee: Allen Ray
QA Contact: ge liu
Whiteboard: EmergencyRequest
Depends On:
Blocks: 2055833
TreeView+ depends on / blocked
Reported: 2022-02-11 15:14 UTC by Jeff Nowicki
Modified: 2022-08-10 10:50 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2022-08-10 10:49:30 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 746 0 None Merged Bug 2053596: Increasing election timeout for IBMCloud VPC 2022-03-30 19:20:33 UTC
Github openshift cluster-etcd-operator pull 759 0 None Merged Bug 2053596: Increase IBMCloud VPC heartbeat timeout to 500ms and leader election timeout to 2500ms 2022-04-08 18:43:42 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:50:06 UTC

Description Jeff Nowicki 2022-02-11 15:14:33 UTC
Description of problem:
During conformance testing of OpenShift on IBM Cloud VPC test failures were experienced. Log analysis indicated etcd latency which triggers leader election.

Version-Release number of selected component (if applicable): 4.10

How reproducible: Run conformance tests on an IPI deployed OpenShift cluster on IBM Cloud VPC.

Steps to Reproduce:
1. Run conformance tests on an IPI deployed OpenShift cluster on IBM Cloud VPC.

Actual results:
Test failures (indicating etcd latency and etcd leader election).

Expected results:
Test success.

Additional info:

As a tactical action, recommend bumping leader election timeout (as is done for Azure). A bump to 2000 ms, should be sufficient.

Once OpenShift IPI installer exposes options to influence etcd deployment AND IBM Cloud VPC offers better IOPS performance and more IOPS configuration options, the bump can be removed and timeout returned to default value.

Code ref: https://github.com/openshift/cluster-etcd-operator/blob/161c61762ddbc8c5ca6723f56e2dba0e46c91da2/pkg/cmd/render/env.go#L64-L93

Comment 1 Michal Fojtik 2022-02-11 15:39:41 UTC

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.


Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 2 Allen Ray 2022-02-11 16:39:27 UTC
Can you please capture and attach the must-gather?

Comment 3 Jeff Nowicki 2022-02-11 17:04:53 UTC
in conformance testing runs against IPI deployments on IBM Cloud VPC (using 4x16) we consistently see failures... here is a sample snippet

{"level":"warn","ts":"2022-02-01T13:00:03.930Z","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":10803993338213686342,"retry-timeout":"500ms"}
{"level":"warn","ts":"2022-02-01T13:00:04.228Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"a38761c097c14cd7","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"638.378354ms"}
{"level":"warn","ts":"2022-02-01T13:00:04.228Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"18358366a889f881","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"638.433744ms"}
{"level":"warn","ts":"2022-02-01T13:00:04.228Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"858.984197ms","expected-duration":"200ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/secrets/openshift-operator-lifecycle-manager/pprof-cert\" ","response":"range_response_count:1 size:5995"}
{"level":"warn","ts":"2022-02-01T13:00:04.229Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"839.173911ms","expected-duration":"200ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/namespaces/default\" serializable:true keys_only:true ","response":"range_response_count:1 size:53"}
{"level":"warn","ts":"2022-02-01T13:00:04.229Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"839.37833ms","expected-duration":"200ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/clusterroles/vpc-block-provisioner-role\" ","response":"range_response_count:1 size:933"}
{"level":"warn","ts":"2022-02-01T13:00:04.231Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"801.893666ms","expected-duration":"200ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/deployments/openshift-cluster-csi-drivers/ibm-vpc-block-csi-controller\" ","response":"range_response_count:1 size:8317"}
{"level":"warn","ts":"2022-02-01T13:00:04.234Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"780.450358ms","expected-duration":"200ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/health\" ","response":"range_response_count:0 size:6"}

Comment 4 Jeff Nowicki 2022-02-11 17:08:07 UTC
Due to boot volume IOPS restrictions on IBM Cloud VPC (set to 3000) and no IPI installer option to influence ETCD deployment, we are left with bumped leader election timeout (tactically).

IBM Cloud VPC does have boot volume IOPS options in roadmap (timeline TBD) and RH has plans to offer IPI installer options to influence ETCD deployment. Once we are able to leverage those features, the timeout bump can be returned to default.

Comment 5 Allen Ray 2022-02-11 19:38:30 UTC
Moving this to 4.11 since the fix will be going there initially.

Comment 8 ge liu 2022-02-28 06:54:02 UTC
@jnowicki.com, could u please try it again to confirm this issue fixed? thanks

Comment 13 ge liu 2022-04-01 02:20:22 UTC
@alray, you may reuse it for a new pr for 4.11, if you want to backport to 4.10, you may clone a new bug for 4.10 only

Comment 18 errata-xmlrpc 2022-08-10 10:49:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.