2070277 – cluster-autoscaler-default will fail when automated etcd defrag is running on large scale OpenShift Container Platform 4 - Cluster

Bug 2070277 - cluster-autoscaler-default will fail when automated etcd defrag is running on large scale OpenShift Container Platform 4 - Cluster

Summary: cluster-autoscaler-default will fail when automated etcd defrag is running on...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.z
Assignee:	Michael McCune
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:	2069095
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-30 18:45 UTC by Michael McCune
Modified:	2022-04-20 14:50 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2069095
Environment:
Last Closed:	2022-04-20 14:49:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-autoscaler-operator pull 244	0	None	open	Bug 2070277: add leader election flags to autoscaler deployment	2022-04-08 01:01:12 UTC
Red Hat Product Errata	RHSA-2022:1363	0	None	None	None	2022-04-20 14:50:03 UTC

Comment 6 Eric Matysek 2022-04-13 18:51:51 UTC

@zhsun Can this bug be verified for 4.9 release?

Comment 7 sunzhaohua 2022-04-15 07:43:00 UTC

Verified
clusterversion: 4.9.0-0.nightly-2022-04-14-125839

1. Setup OpenShift Container Platform on AWS 
2. Enable autoscaler by clusterautoscler
3. Install Elasticsearch, Logging, Jaeger, NFD, GitOps, Pipelines, Kiali and Service Mesh Operator
4. Run `for i in {5000..7125}; do oc new-project project-$i; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt; done`
5. About 2 hours later, run https://docs.openshift.com/container-platform/4.8/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks to trigger `etcd defrag`, Run it for each `etcd` member (last the leader)  

$ oc rsh -n openshift-etcd etcd-zhsunazure-2kw7c-master-0                              [15:39:09]
Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init)
sh-4.4#  unset ETCDCTL_ENDPOINTS
sh-4.4# etcdctl endpoint status -w table --cluster
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.6:2379 |  50928b3f1313232 |   3.5.0 |  1.7 GB |     false |      false |         6 |     337966 |             337966 |        |
| https://10.0.0.8:2379 | 1ac574d30e667edc |   3.5.0 |  1.7 GB |      true |      false |         6 |     337966 |             337966 |        |
| https://10.0.0.7:2379 | ec66c6beb10bb6b3 |   3.5.0 |  1.7 GB |     false |      false |         6 |     337966 |             337966 |        |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

sh-4.4# etcdctl --command-timeout=30s --endpoints=https://10.0.0.6:2379 defrag
Finished defragmenting etcd member[https://10.0.0.6:2379]
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://10.0.0.7:2379 defrag
Finished defragmenting etcd member[https://10.0.0.7:2379]
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://10.0.0.8:2379 defrag
Finished defragmenting etcd member[https://10.0.0.8:2379]


cluster-autoscaler-default pod is not restarting at any time
$ oc get po                                      [15:41:10]
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-default-6dfb9f8bcc-jhgpp    1/1     Running   0          3h51m
cluster-autoscaler-operator-654fb45c66-4mdz6   2/2     Running   0          4h33m
cluster-baremetal-operator-569b4ff4dd-5q9wj    2/2     Running   0          4h33m
machine-api-controllers-785cc9fdf-lb969        7/7     Running   0          4h26m
machine-api-operator-ff7dd5bd7-cf4xv           2/2     Running   0          4h33m

$ oc edit po cluster-autoscaler-default-6dfb9f8bcc-jhgpp
spec:
  containers:
  - args:
    - --logtostderr
    - --v=1
    - --cloud-provider=clusterapi
    - --namespace=openshift-machine-api
    - --leader-elect-lease-duration=137s
    - --leader-elect-renew-deadline=107s
    - --leader-elect-retry-period=26s

Comment 9 errata-xmlrpc 2022-04-20 14:49:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.29 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1363

Note You need to log in before you can comment on or make changes to this bug.