Bug 2063183

Summary: DefragDialTimeout is set to low for large scale OpenShift Container Platform - Cluster
Product: OpenShift Container Platform Reporter: Simon Reber <sreber>
Component: EtcdAssignee: Allen Ray <alray>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: alray, cpassare
Target Milestone: ---Flags: alray: needinfo-
alray: needinfo-
Target Release: 4.11.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:53:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2101912    

Description Simon Reber 2022-03-11 13:11:48 UTC
Description of problem:

According to https://github.com/openshift/cluster-etcd-operator/blob/release-4.9/pkg/operator/defragcontroller/defragcontroller.go#L158 `DefragDialTimeout` will be used as timeout when running `etcd` defrag. When checking, this appears to be set to 45 seconds (see https://github.com/openshift/cluster-etcd-operator/blob/release-4.9/pkg/etcdcli/etcdcli.go#L41).

When running `etcd` defrag activity on large OpenShift Container Platform 4 - Cluster this does not appear to be enough as it occasionally will fail because it defrag does not complete within the expected 45 seconds.

# etcdctl endpoint status -w table --cluster
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.X.XXX.249:2379 |  aff5216e0d01c0a |   3.5.0 |  4.0 GB |     false |      false |       477 |    5084964 |            5084964 |        |
|   https://10.X.XXX.8:2379 | 6528b69686174191 |   3.5.0 |  4.3 GB |     false |      false |       477 |    5084964 |            5084964 |        |
| https://10.X.XXX.198:2379 | fff981bffaa31b53 |   3.5.0 |  4.5 GB |      true |      false |       477 |    5084964 |            5084964 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

# unset ETCDCTL_ENDPOINTS
# etcdctl --command-timeout=45s --endpoints=https://localhost:2379 defrag
{"level":"warn","ts":"2022-03-11T12:39:33.300Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00017a000/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://localhost:2379] (context deadline exceeded)

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.9.23

How reproducible:

 - Random but expected to fail more often when `etcd` is beyond 5 GB in space

Steps to Reproduce:
1. Setup OpenShift Container Platform on AWS with Master and Worker of type m5.4xlarge
2. Install Elasticsearch, Logging, Jaeger, NFD, GitOps, Pipelines, Kiali and Service Mesh Operator
3. Run `for i in {5000..7125}; do oc new-project project-$i; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt; done`
   Mind `/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt` is about 264K in size and therefore something in that area should be used

Actual results:

Once in a while `etcd` defrag will timeout with this size as timeout is set to 45 seconds. Considering that Clusters may have +5 GB in `etcd` size it's expected that defrag will fail most of the time and therefore never happen. Hence increasing the timeout is required or another approach needs to be found.

Expected results:

If `etcd` defrag is suppose to happen is should be able to complete no matter of the size of the `etcd` database.

Additional info:

Comment 13 errata-xmlrpc 2022-08-10 10:53:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069