2069095 – cluster-autoscaler-default will fail when automated etcd defrag is running on large scale OpenShift Container Platform 4 - Cluster

Bug 2069095 - cluster-autoscaler-default will fail when automated etcd defrag is running on large scale OpenShift Container Platform 4 - Cluster

Summary: cluster-autoscaler-default will fail when automated etcd defrag is running on...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Michael McCune
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:	2063194
Blocks:	2070277
TreeView+	depends on / blocked

Reported:	2022-03-28 09:22 UTC by OpenShift BugZilla Robot
Modified:	2022-10-12 09:15 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2070277 (view as bug list)
Environment:
Last Closed:	2022-04-21 13:16:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-autoscaler-operator pull 243	0	None	open	[release-4.10] Bug 2069095: add leader election flags to autoscaler deployment	2022-03-28 09:23:05 UTC
Red Hat Product Errata	RHSA-2022:1356	0	None	None	None	2022-04-21 13:16:19 UTC

Comment 5 Milind Yadav 2022-04-06 10:14:34 UTC

Hi Michael , Simon ,Thanks for the detailed step comment #1 (Description)

I followed all the steps and installed mentioned operators and here are the logs : ( after 2 hrs or so)
[miyadav@miyadav ~]$ oc logs -f cluster-autoscaler-operator-6589f54589-wx9jl -c cluster-autoscaler-operator
I0406 07:36:40.830761       1 main.go:13] Go Version: go1.17.5
I0406 07:36:40.831034       1 main.go:14] Go OS/Arch: linux/amd64
I0406 07:36:40.831150       1 main.go:15] Version: cluster-autoscaler-operator v4.10.0-202203311829.p0.g8bcdccc.assembly.stream-dirty
W0406 07:36:40.837248       1 leaderelection.go:51] unable to get cluster infrastructure status, using HA cluster values for leader election: infrastructures.config.openshift.io "cluster" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-autoscaler-operator" cannot get resource "infrastructures" in API group "config.openshift.io" at the cluster scope
I0406 07:36:41.887922       1 request.go:665] Waited for 1.040054041s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v1alpha2?timeout=32s
W0406 07:36:46.949503       1 machineautoscaler_controller.go:150] Removing support for unregistered target type: cluster.k8s.io/v1beta1, Kind=MachineDeployment
W0406 07:36:50.740796       1 machineautoscaler_controller.go:150] Removing support for unregistered target type: cluster.k8s.io/v1beta1, Kind=MachineSet
I0406 07:36:51.888023       1 request.go:665] Waited for 1.048171568s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/events.k8s.io/v1beta1?timeout=32s
W0406 07:36:54.542488       1 machineautoscaler_controller.go:150] Removing support for unregistered target type: machine.openshift.io/v1beta1, Kind=MachineDeployment
I0406 07:36:54.542800       1 main.go:36] Starting cluster-autoscaler-operator
I0406 07:36:54.542988       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler-operator-leader...
I0406 07:39:14.631980       1 leaderelection.go:258] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader
I0406 07:39:14.637921       1 status.go:386] No ClusterAutoscaler. Reporting available.
I0406 07:39:14.637938       1 status.go:234] Operator status available: at version 4.10.8
I0406 07:39:14.739388       1 webhookconfig.go:72] Webhook configuration status: updated
E0406 07:42:24.712791       1 leaderelection.go:330] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps cluster-autoscaler-operator-leader)
.
.

Projects that got created :
.
.
.
project-5999                                                      Active
project-6000                                                      Active
project-6001                                                      Active
project-6002                                                      Active
project-6003                                                      Active
project-6004                                                      Active
project-6005                                                      Active
project-6006                                                      Active
project-6007                                                      Active
project-6008                                                      Active
project-6009                                                      Active
project-6010                                                      Active
project-6011                                                      Active
project-6012                                                      Active
project-6013                                                      Active
project-6014                                                      Active
project-6015                                                      Active
project-6016                                                      Active
project-6017                                                      Active
project-6018                                                      Active
project-6019                                                      Active
project-6020                                                      Active
project-6021                                                      Active
project-6022                                                      Active
.
.
.
.
[miyadav@miyadav ~]$ oc debug node/ip-10-0-164-204.us-east-2.compute.internal
Starting pod/ip-10-0-164-204us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
chroot /hosPod IP: 10.0.164.204
If you don't see a command prompt, try pressing enter.
chroot /host
sh-4.4# 
sh-4.4# vi /tmp/kubeconfig
sh-4.4# oc get nodes --kubeconfig /tmp/kubeconfig 
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-128-231.us-east-2.compute.internal   Ready    worker   54m   v1.23.5+1f952b3
ip-10-0-147-246.us-east-2.compute.internal   Ready    master   58m   v1.23.5+1f952b3
ip-10-0-164-204.us-east-2.compute.internal   Ready    master   57m   v1.23.5+1f952b3
ip-10-0-172-19.us-east-2.compute.internal    Ready    worker   52m   v1.23.5+1f952b3
ip-10-0-198-56.us-east-2.compute.internal    Ready    master   59m   v1.23.5+1f952b3
ip-10-0-205-71.us-east-2.compute.internal    Ready    worker   54m   v1.23.5+1f952b3
sh-4.4# `for i in {5000..7125}; do oc new-project project-$i --kubeconfig /tmp/kubeconfig; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt --kubeconfig /tmp/kubconfig; done`


.
.
.

Additional Info :

Ran on master node the script above on master node.

[miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.8    True        False         120m    Cluster version is 4.10.8


Needed to confirm if these results looks good , or we need to check any other info as well ?

Comment 6 Milind Yadav 2022-04-06 10:16:19 UTC

in addition to above comment#5  I did not see "lost master" in logs ..

Comment 7 Michael McCune 2022-04-06 14:44:22 UTC

> Needed to confirm if these results looks good , or we need to check any other info as well ?

i think Simon is better suited to answer that for the cluster. from the perspective of the cluster-autoscaler, i think as long as it doesn't crash and hands over leadership properly, it's successful for me.

Comment 8 Michael McCune 2022-04-06 14:45:51 UTC

sorry, didn't mean to clear the needinfo

@sreber did you have any details to add to Milind's question?

Comment 9 Simon Reber 2022-04-06 15:03:46 UTC

(In reply to Milind Yadav from comment #6)
> in addition to above comment#5  I did not see "lost master" in logs ..
I think this looks good. What you want to make sure is that `cluster-autoscaler-default-*` pod is not restarting at any time. Or more specifically, once you have the setup from https://bugzilla.redhat.com/show_bug.cgi?id=2069095#c5, you can run https://docs.openshift.com/container-platform/4.8/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks to trigger `etcd defrag` (as the automated controller does only trigger it when 45% of etcd is considered fragmented). Run it for each `etcd` member (last the leader)  and verify if `cluster-autoscaler-default-*` is restarting or not. If it remains stable and working (does not restart)  we can consider the issue tracked in this Red Hat Support Case addressed to the extend possible/feasible.

The key is to verify the above but also to make sure that the rest of the ClusterAutoscaling functionality continue to work as expected.

Comment 10 Milind Yadav 2022-04-07 04:52:34 UTC

Thanks Simon for references .

In parallel to running defragmentation , I ran the workload which caused the cluster to scale and it looks good , no unexpected crashes . 

CLuster-operator logs - 
.
.
.
.
I0407 04:10:42.991317       1 validator.go:161] Validation webhook called for ClustAutoscaler: default
I0407 04:10:42.995473       1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default
I0407 04:10:43.048466       1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring
I0407 04:10:43.048489       1 clusterautoscaler_controller.go:270] Creating ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default
I0407 04:10:43.054665       1 clusterautoscaler_controller.go:224] Created ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default
I0407 04:10:43.054772       1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default
I0407 04:10:43.061560       1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring
I0407 04:10:43.067814       1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default
I0407 04:10:43.093271       1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default
I0407 04:10:43.097483       1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring
I0407 04:10:43.102242       1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default
I0407 04:10:43.107606       1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default
I0407 04:10:43.113883       1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring
I0407 04:10:43.121381       1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default
I0407 04:10:43.161828       1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default
I0407 04:10:43.166709       1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring
I0407 04:10:43.171402       1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default
I0407 04:10:48.229330       1 clusterautoscaler_controller.go:149] Reconciling ClusterAutoscaler default
I0407 04:10:48.235000       1 clusterautoscaler_controller.go:211] Ensured ClusterAutoscaler monitoring
I0407 04:10:48.242293       1 clusterautoscaler_controller.go:239] Updated ClusterAutoscaler deployment: openshift-machine-api/cluster-autoscaler-default
I0407 04:10:59.467780       1 validator.go:58] Validation webhook called for MachineAutoscaler: mas1
I0407 04:10:59.471528       1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1
I0407 04:10:59.486305       1 validator.go:58] Validation webhook called for MachineAutoscaler: mas1
I0407 04:10:59.507874       1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1
I0407 04:10:59.514366       1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1
I0407 04:22:13.219911       1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1
I0407 04:22:13.365730       1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1
I0407 04:22:13.379363       1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1
I0407 04:24:59.596571       1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1
I0407 04:40:17.355885       1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1
I0407 04:40:17.381273       1 machineautoscaler_controller.go:179] Reconciling MachineAutoscaler openshift-machine-api/mas1
.
.
.

Defragmentation :
[miyadav@miyadav ~]$  oc get pods -n openshift-etcd -o wide | grep -v quorum-guard | grep etcd
etcd-ip-10-0-138-10.us-east-2.compute.internal                 4/4     Running     0          87m    10.0.138.10    ip-10-0-138-10.us-east-2.compute.internal    <none>           <none>
etcd-ip-10-0-163-254.us-east-2.compute.internal                4/4     Running     0          90m    10.0.163.254   ip-10-0-163-254.us-east-2.compute.internal   <none>           <none>
etcd-ip-10-0-221-0.us-east-2.compute.internal                  4/4     Running     0          89m    10.0.221.0     ip-10-0-221-0.us-east-2.compute.internal     <none>           <none>
[miyadav@miyadav ~]$  oc rsh -n openshift-etcd etcd-ip-10-0-163-254.us-east-2.compute.internal etcdctl endpoint status --cluster -w table
Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|   https://10.0.221.0:2379 | 7f49a444454d5fa9 |   3.5.0 |  2.8 GB |     false |      false |        10[miyadav@miyadav ~]$ 
[miyadav@miyadav ~]$ $ oc rsh -n openshift-etcd etcd-ip-10-0-163-254.us-east-2.compute.internal
bash: $: command not found...
[miyadav@miyadav ~]$  oc rsh -n openshift-etcd etcd-ip-10-0-163-254.us-east-2.compute.internal
Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init)
sh-4.4# sh-4.4# unset ETCDCTL_ENDPOINTS
sh: sh-4.4#: command not found
sh-4.4#  unset ETCDCTL_ENDPOINTS
sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag
Finished defragmenting etcd member[https://localhost:2379]
sh-4.4#  etcdctl endpoint status -w table --cluster
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|   https://10.0.221.0:2379 | 7f49a444454d5fa9 |   3.5.0 |  3.0 GB |     false |      false |        10 |     577845 |             577845 |        |
| https://10.0.163.254:2379 | 96592f389ac22ff2 |   3.5.0 |  2.7 GB |      true |      false |        10 |     577846 |             577846 |        |
|  https://10.0.138.10:2379 | f033163df9ebfad9 |   3.5.0 |  3.0 GB |     false |      false |        10 |     577849 |             577849 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Cluster scaled up adding new machine and then scaled down after jobs completed successfully 

[miyadav@miyadav ~]$ oc get jobs
NAME               COMPLETIONS   DURATION   AGE
work-queue-4scfk   100/100       8m1s       14m
[miyadav@miyadav ~]$ oc get machines
NAME                                         PHASE     TYPE         REGION      ZONE         AGE
miyadav-0704-czr6m-master-0                  Running   m5.4xlarge   us-east-2   us-east-2a   99m
miyadav-0704-czr6m-master-1                  Running   m5.4xlarge   us-east-2   us-east-2b   99m
miyadav-0704-czr6m-master-2                  Running   m5.4xlarge   us-east-2   us-east-2c   99m
miyadav-0704-czr6m-worker-us-east-2a-lb7rj   Running   m5.4xlarge   us-east-2   us-east-2a   14m
miyadav-0704-czr6m-worker-us-east-2a-ww8lz   Running   m5.4xlarge   us-east-2   us-east-2a   97m
miyadav-0704-czr6m-worker-us-east-2b-vcbth   Running   m5.4xlarge   us-east-2   us-east-2b   97m
miyadav-0704-czr6m-worker-us-east-2c-n786l   Running   m5.4xlarge   us-east-2   us-east-2c   97m
[miyadav@miyadav ~]$ oc get machines
NAME                                         PHASE     TYPE         REGION      ZONE         AGE
miyadav-0704-czr6m-master-0                  Running   m5.4xlarge   us-east-2   us-east-2a   102m
miyadav-0704-czr6m-master-1                  Running   m5.4xlarge   us-east-2   us-east-2b   102m
miyadav-0704-czr6m-master-2                  Running   m5.4xlarge   us-east-2   us-east-2c   102m
miyadav-0704-czr6m-worker-us-east-2a-lb7rj   Running   m5.4xlarge   us-east-2   us-east-2a   17m
miyadav-0704-czr6m-worker-us-east-2a-ww8lz   Running   m5.4xlarge   us-east-2   us-east-2a   100m
miyadav-0704-czr6m-worker-us-east-2b-vcbth   Running   m5.4xlarge   us-east-2   us-east-2b   100m
miyadav-0704-czr6m-worker-us-east-2c-n786l   Running   m5.4xlarge   us-east-2   us-east-2c   100m
[miyadav@miyadav ~]$ oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-138-10.us-east-2.compute.internal    Ready    master   102m   v1.23.5+1f952b3
ip-10-0-138-112.us-east-2.compute.internal   Ready    worker   96m    v1.23.5+1f952b3
ip-10-0-144-206.us-east-2.compute.internal   Ready    worker   14m    v1.23.5+1f952b3
ip-10-0-163-254.us-east-2.compute.internal   Ready    master   101m   v1.23.5+1f952b3
ip-10-0-173-223.us-east-2.compute.internal   Ready    worker   97m    v1.23.5+1f952b3
ip-10-0-193-68.us-east-2.compute.internal    Ready    worker   96m    v1.23.5+1f952b3
ip-10-0-221-0.us-east-2.compute.internal     Ready    master   101m   v1.23.5+1f952b3
[miyadav@miyadav ~]$ oc get machines
NAME                                         PHASE     TYPE         REGION      ZONE         AGE
miyadav-0704-czr6m-master-0                  Running   m5.4xlarge   us-east-2   us-east-2a   108m
miyadav-0704-czr6m-master-1                  Running   m5.4xlarge   us-east-2   us-east-2b   108m
miyadav-0704-czr6m-master-2                  Running   m5.4xlarge   us-east-2   us-east-2c   108m
miyadav-0704-czr6m-worker-us-east-2a-ww8lz   Running   m5.4xlarge   us-east-2   us-east-2a   106m
miyadav-0704-czr6m-worker-us-east-2b-vcbth   Running   m5.4xlarge   us-east-2   us-east-2b   106m
miyadav-0704-czr6m-worker-us-east-2c-n786l   Running   m5.4xlarge   us-east-2   us-east-2c   106m
.
.
Additional info :
Based on results moving to VERIFIED

Comment 16 errata-xmlrpc 2022-04-21 13:16:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.10 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1356

Note You need to log in before you can comment on or make changes to this bug.