Bug 2063183

Summary:	DefragDialTimeout is set to low for large scale OpenShift Container Platform - Cluster
Product:	OpenShift Container Platform	Reporter:	Simon Reber <sreber>
Component:	Etcd	Assignee:	Allen Ray <alray>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.9	CC:	alray, cpassare
Target Milestone:	---	Flags:	alray: needinfo- alray: needinfo-
Target Release:	4.11.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 10:53:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2101912

Description Simon Reber 2022-03-11 13:11:48 UTC

Description of problem:

According to https://github.com/openshift/cluster-etcd-operator/blob/release-4.9/pkg/operator/defragcontroller/defragcontroller.go#L158 `DefragDialTimeout` will be used as timeout when running `etcd` defrag. When checking, this appears to be set to 45 seconds (see https://github.com/openshift/cluster-etcd-operator/blob/release-4.9/pkg/etcdcli/etcdcli.go#L41).

When running `etcd` defrag activity on large OpenShift Container Platform 4 - Cluster this does not appear to be enough as it occasionally will fail because it defrag does not complete within the expected 45 seconds.

# etcdctl endpoint status -w table --cluster
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.X.XXX.249:2379 |  aff5216e0d01c0a |   3.5.0 |  4.0 GB |     false |      false |       477 |    5084964 |            5084964 |        |
|   https://10.X.XXX.8:2379 | 6528b69686174191 |   3.5.0 |  4.3 GB |     false |      false |       477 |    5084964 |            5084964 |        |
| https://10.X.XXX.198:2379 | fff981bffaa31b53 |   3.5.0 |  4.5 GB |      true |      false |       477 |    5084964 |            5084964 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

# unset ETCDCTL_ENDPOINTS
# etcdctl --command-timeout=45s --endpoints=https://localhost:2379 defrag
{"level":"warn","ts":"2022-03-11T12:39:33.300Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00017a000/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://localhost:2379] (context deadline exceeded)

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.9.23

How reproducible:

 - Random but expected to fail more often when `etcd` is beyond 5 GB in space

Steps to Reproduce:
1. Setup OpenShift Container Platform on AWS with Master and Worker of type m5.4xlarge
2. Install Elasticsearch, Logging, Jaeger, NFD, GitOps, Pipelines, Kiali and Service Mesh Operator
3. Run `for i in {5000..7125}; do oc new-project project-$i; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt; done`
   Mind `/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt` is about 264K in size and therefore something in that area should be used

Actual results:

Once in a while `etcd` defrag will timeout with this size as timeout is set to 45 seconds. Considering that Clusters may have +5 GB in `etcd` size it's expected that defrag will fail most of the time and therefore never happen. Hence increasing the timeout is required or another approach needs to be found.

Expected results:

If `etcd` defrag is suppose to happen is should be able to complete no matter of the size of the `etcd` database.

Additional info:

Comment 13 errata-xmlrpc 2022-08-10 10:53:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069