2063183 – DefragDialTimeout is set to low for large scale OpenShift Container Platform - Cluster

Bug 2063183 - DefragDialTimeout is set to low for large scale OpenShift Container Platform - Cluster

Summary: DefragDialTimeout is set to low for large scale OpenShift Container Platform ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.9
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Allen Ray
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2101912
TreeView+	depends on / blocked

Reported:	2022-03-11 13:11 UTC by Simon Reber
Modified:	2022-08-10 10:54 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:53:18 UTC
Target Upstream Version:
Embargoed:
Flags:	alray: needinfo- alray: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 762	None	open	WIP: Bug 2063183: Upping defrag timeout to 1 minute	2022-03-15 13:43:18 UTC
Red Hat Issue Tracker	INSIGHTOCP-767	None	None	None	2022-06-14 08:55:48 UTC
Red Hat Knowledge Base (Solution)	6840041	None	None	None	2022-03-24 11:19:29 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:54:10 UTC

Description Simon Reber 2022-03-11 13:11:48 UTC

Description of problem:

According to https://github.com/openshift/cluster-etcd-operator/blob/release-4.9/pkg/operator/defragcontroller/defragcontroller.go#L158 `DefragDialTimeout` will be used as timeout when running `etcd` defrag. When checking, this appears to be set to 45 seconds (see https://github.com/openshift/cluster-etcd-operator/blob/release-4.9/pkg/etcdcli/etcdcli.go#L41).

When running `etcd` defrag activity on large OpenShift Container Platform 4 - Cluster this does not appear to be enough as it occasionally will fail because it defrag does not complete within the expected 45 seconds.

# etcdctl endpoint status -w table --cluster
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.X.XXX.249:2379 |  aff5216e0d01c0a |   3.5.0 |  4.0 GB |     false |      false |       477 |    5084964 |            5084964 |        |
|   https://10.X.XXX.8:2379 | 6528b69686174191 |   3.5.0 |  4.3 GB |     false |      false |       477 |    5084964 |            5084964 |        |
| https://10.X.XXX.198:2379 | fff981bffaa31b53 |   3.5.0 |  4.5 GB |      true |      false |       477 |    5084964 |            5084964 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

# unset ETCDCTL_ENDPOINTS
# etcdctl --command-timeout=45s --endpoints=https://localhost:2379 defrag
{"level":"warn","ts":"2022-03-11T12:39:33.300Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00017a000/#initially=[https://localhost:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://localhost:2379] (context deadline exceeded)

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.9.23

How reproducible:

 - Random but expected to fail more often when `etcd` is beyond 5 GB in space

Steps to Reproduce:
1. Setup OpenShift Container Platform on AWS with Master and Worker of type m5.4xlarge
2. Install Elasticsearch, Logging, Jaeger, NFD, GitOps, Pipelines, Kiali and Service Mesh Operator
3. Run `for i in {5000..7125}; do oc new-project project-$i; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt; done`
   Mind `/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt` is about 264K in size and therefore something in that area should be used

Actual results:

Once in a while `etcd` defrag will timeout with this size as timeout is set to 45 seconds. Considering that Clusters may have +5 GB in `etcd` size it's expected that defrag will fail most of the time and therefore never happen. Hence increasing the timeout is required or another approach needs to be found.

Expected results:

If `etcd` defrag is suppose to happen is should be able to complete no matter of the size of the `etcd` database.

Additional info:

Comment 13 errata-xmlrpc 2022-08-10 10:53:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.