Description of problem: According to https://github.com/openshift/cluster-etcd-operator/blob/release-4.9/pkg/operator/defragcontroller/defragcontroller.go#L158 `DefragDialTimeout` will be used as timeout when running `etcd` defrag. When checking, this appears to be set to 45 seconds (see https://github.com/openshift/cluster-etcd-operator/blob/release-4.9/pkg/etcdcli/etcdcli.go#L41). When running `etcd` defrag activity on large OpenShift Container Platform 4 - Cluster this does not appear to be enough as it occasionally will fail because it defrag does not complete within the expected 45 seconds. # etcdctl endpoint status -w table --cluster +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.X.XXX.249:2379 | aff5216e0d01c0a | 3.5.0 | 4.0 GB | false | false | 477 | 5084964 | 5084964 | | | https://10.X.XXX.8:2379 | 6528b69686174191 | 3.5.0 | 4.3 GB | false | false | 477 | 5084964 | 5084964 | | | https://10.X.XXX.198:2379 | fff981bffaa31b53 | 3.5.0 | 4.5 GB | true | false | 477 | 5084964 | 5084964 | | +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ # unset ETCDCTL_ENDPOINTS # etcdctl --command-timeout=45s --endpoints=https://localhost:2379 defrag {"level":"warn","ts":"2022-03-11T12:39:33.300Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00017a000/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Failed to defragment etcd member[https://localhost:2379] (context deadline exceeded) Version-Release number of selected component (if applicable): - OpenShift Container Platform 4.9.23 How reproducible: - Random but expected to fail more often when `etcd` is beyond 5 GB in space Steps to Reproduce: 1. Setup OpenShift Container Platform on AWS with Master and Worker of type m5.4xlarge 2. Install Elasticsearch, Logging, Jaeger, NFD, GitOps, Pipelines, Kiali and Service Mesh Operator 3. Run `for i in {5000..7125}; do oc new-project project-$i; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt; done` Mind `/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt` is about 264K in size and therefore something in that area should be used Actual results: Once in a while `etcd` defrag will timeout with this size as timeout is set to 45 seconds. Considering that Clusters may have +5 GB in `etcd` size it's expected that defrag will fail most of the time and therefore never happen. Hence increasing the timeout is required or another approach needs to be found. Expected results: If `etcd` defrag is suppose to happen is should be able to complete no matter of the size of the `etcd` database. Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069