Bug 2063183 - DefragDialTimeout is set to low for large scale OpenShift Container Platform - Cluster
Summary: DefragDialTimeout is set to low for large scale OpenShift Container Platform ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.9
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.11.0
Assignee: Allen Ray
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 2101912
TreeView+ depends on / blocked
 
Reported: 2022-03-11 13:11 UTC by Simon Reber
Modified: 2022-08-10 10:54 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:53:18 UTC
Target Upstream Version:
Embargoed:
alray: needinfo-
alray: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 762 0 None open WIP: Bug 2063183: Upping defrag timeout to 1 minute 2022-03-15 13:43:18 UTC
Red Hat Issue Tracker INSIGHTOCP-767 0 None None None 2022-06-14 08:55:48 UTC
Red Hat Knowledge Base (Solution) 6840041 0 None None None 2022-03-24 11:19:29 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:54:10 UTC

Description Simon Reber 2022-03-11 13:11:48 UTC
Description of problem:

According to https://github.com/openshift/cluster-etcd-operator/blob/release-4.9/pkg/operator/defragcontroller/defragcontroller.go#L158 `DefragDialTimeout` will be used as timeout when running `etcd` defrag. When checking, this appears to be set to 45 seconds (see https://github.com/openshift/cluster-etcd-operator/blob/release-4.9/pkg/etcdcli/etcdcli.go#L41).

When running `etcd` defrag activity on large OpenShift Container Platform 4 - Cluster this does not appear to be enough as it occasionally will fail because it defrag does not complete within the expected 45 seconds.

# etcdctl endpoint status -w table --cluster
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.X.XXX.249:2379 |  aff5216e0d01c0a |   3.5.0 |  4.0 GB |     false |      false |       477 |    5084964 |            5084964 |        |
|   https://10.X.XXX.8:2379 | 6528b69686174191 |   3.5.0 |  4.3 GB |     false |      false |       477 |    5084964 |            5084964 |        |
| https://10.X.XXX.198:2379 | fff981bffaa31b53 |   3.5.0 |  4.5 GB |      true |      false |       477 |    5084964 |            5084964 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

# unset ETCDCTL_ENDPOINTS
# etcdctl --command-timeout=45s --endpoints=https://localhost:2379 defrag
{"level":"warn","ts":"2022-03-11T12:39:33.300Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00017a000/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://localhost:2379] (context deadline exceeded)

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.9.23

How reproducible:

 - Random but expected to fail more often when `etcd` is beyond 5 GB in space

Steps to Reproduce:
1. Setup OpenShift Container Platform on AWS with Master and Worker of type m5.4xlarge
2. Install Elasticsearch, Logging, Jaeger, NFD, GitOps, Pipelines, Kiali and Service Mesh Operator
3. Run `for i in {5000..7125}; do oc new-project project-$i; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt; done`
   Mind `/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt` is about 264K in size and therefore something in that area should be used

Actual results:

Once in a while `etcd` defrag will timeout with this size as timeout is set to 45 seconds. Considering that Clusters may have +5 GB in `etcd` size it's expected that defrag will fail most of the time and therefore never happen. Hence increasing the timeout is required or another approach needs to be found.

Expected results:

If `etcd` defrag is suppose to happen is should be able to complete no matter of the size of the `etcd` database.

Additional info:

Comment 13 errata-xmlrpc 2022-08-10 10:53:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.