Bug 2055801 - [IBM Cloud] Storage IOPS limitations and lack of IPI ETCD deployment options trigger leader election during cluster initialization
Summary: [IBM Cloud] Storage IOPS limitations and lack of IPI ETCD deployment options ...
Keywords:
Status: CLOSED DUPLICATE of bug 2055833
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Allen Ray
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-17 16:32 UTC by Jeff Nowicki
Modified: 2022-02-17 18:45 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-17 18:45:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jeff Nowicki 2022-02-17 16:32:53 UTC
Description of problem:
This issue has been addressed in 4.11 and we need this cherry-picked into 4.10. It's scoped to IBM Cloud. Without this fix customers will see degraded stability/reliability when deploying Openshift to IBM Cloud VPC. Storage IOPS improvements are in the IBM Cloud VPC roadmap and there are plans in a future RH Openshift release to expose installer options to influence the etcd deployment. Without this fix in 4.10 we are opening ourselves up to customer complaints/tickets that we can't mitigate (this fix is only option at this point).

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Consistently/high-rate

Steps to Reproduce:
1. Run conformance tests on an IPI deployed OpenShift cluster on IBM Cloud VPC.

Actual results:
in conformance testing runs against IPI deployments on IBM Cloud VPC (using 4x16) we consistently see failures... here is a sample snippet

{"level":"warn","ts":"2022-02-01T13:00:03.930Z","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":10803993338213686342,"retry-timeout":"500ms"}
{"level":"warn","ts":"2022-02-01T13:00:04.228Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"a38761c097c14cd7","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"638.378354ms"}
{"level":"warn","ts":"2022-02-01T13:00:04.228Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"18358366a889f881","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"638.433744ms"}
{"level":"warn","ts":"2022-02-01T13:00:04.228Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"858.984197ms","expected-duration":"200ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/secrets/openshift-operator-lifecycle-manager/pprof-cert\" ","response":"range_response_count:1 size:5995"}
{"level":"warn","ts":"2022-02-01T13:00:04.229Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"839.173911ms","expected-duration":"200ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/namespaces/default\" serializable:true keys_only:true ","response":"range_response_count:1 size:53"}
{"level":"warn","ts":"2022-02-01T13:00:04.229Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"839.37833ms","expected-duration":"200ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/clusterroles/vpc-block-provisioner-role\" ","response":"range_response_count:1 size:933"}
{"level":"warn","ts":"2022-02-01T13:00:04.231Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"801.893666ms","expected-duration":"200ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/deployments/openshift-cluster-csi-drivers/ibm-vpc-block-csi-controller\" ","response":"range_response_count:1 size:8317"}
{"level":"warn","ts":"2022-02-01T13:00:04.234Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"780.450358ms","expected-duration":"200ms","prefix":"read-only range ","request":"key:\"/kubernetes.io/health\" ","response":"range_response_count:0 size:6"}

Expected results:
Successful test results

Additional info:
Related BZ https://bugzilla.redhat.com/show_bug.cgi?id=2053596 (with merged PR into 4.11)

Comment 2 Allen Ray 2022-02-17 18:45:12 UTC

*** This bug has been marked as a duplicate of bug 2055833 ***


Note You need to log in before you can comment on or make changes to this bug.