2063194 – cluster-autoscaler-default will fail when automated etcd defrag is running on large scale OpenShift Container Platform 4 - Cluster

Bug 2063194 - cluster-autoscaler-default will fail when automated etcd defrag is running on large scale OpenShift Container Platform 4 - Cluster

Summary: cluster-autoscaler-default will fail when automated etcd defrag is running on...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Michael McCune
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2069095
TreeView+	depends on / blocked

Reported:	2022-03-11 13:55 UTC by Simon Reber
Modified:	2022-10-12 09:15 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Cluster Autoscaler Operator not setting leader election flags on deploy Cluster Autoscaler. Consequence: Cluster Autoscaler can unexpectedly panic and exit during cluster restart events such as during an etcd defragmentation. Fix: Cluster Autoscaler Operator now deploys the Cluster Autoscaler with well defined leader election flags. Result: Cluster Autoscaler does not prematurely panic and exit when cluster node reboots are happening.
Clone Of:
Environment:
Last Closed:	2022-08-10 10:53:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cluster-autoscaler-default log when the pod is crashing (191.94 KB, text/plain) 2022-03-11 13:55 UTC, Simon Reber	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-autoscaler-operator pull 241	None	open	Bug 2063194: add leader election flags to autoscaler deployment	2022-03-11 17:04:35 UTC
Red Hat Knowledge Base (Solution)	6840041	None	None	None	2022-03-24 11:19:15 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:54:10 UTC

Description Simon Reber 2022-03-11 13:55:31 UTC

Created attachment 1865468 [details]
cluster-autoscaler-default log when the pod is crashing

Description of problem:

Starting with OpenShift Container Platform 4.9, automated etcd defrag was implemented (https://docs.openshift.com/container-platform/4.9/release_notes/ocp-4-9-release-notes.html#ocp-4-9-notable-technical-changes). When running the same on large scale OpenShift Container Platform 4 - Cluster, defrag will take some time which will cause the cluster-autoscaler-default pod to fail and restart.

E0311 13:42:41.540166       1 leaderelection.go:330] error retrieving resource lock openshift-machine-api/cluster-autoscaler: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler": context deadline exceeded
I0311 13:42:41.540217       1 leaderelection.go:283] failed to renew lease openshift-machine-api/cluster-autoscaler: timed out waiting for the condition
E0311 13:42:41.540259       1 leaderelection.go:306] Failed to release lock: resource name may not be empty
F0311 13:42:41.540267       1 main.go:457] lost master


Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.9.23

How reproducible:

 - Always

Steps to Reproduce:
1. Setup OpenShift Container Platform on AWS with Master and Worker of type m5.4xlarge
2. Install Elasticsearch, Logging, Jaeger, NFD, GitOps, Pipelines, Kiali and Service Mesh Operator
3. Run `for i in {5000..7125}; do oc new-project project-$i; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt; done`
   Mind `/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt` is about 264K in size and therefore something in that area should be used

Actual results:

The `cluster-autoscaler-default` pod will eventually fail when `etcd` defrag is taking a bit more time due to the large `etcd` database.

Expected results:

As etcd defrag is automated by the platform this should not happen and the cluster-autoscaler-default should somehow be able to cope with a defrag activity

Additional info:

RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=2063183 may be of interest

Comment 3 Michael McCune 2022-03-11 15:11:23 UTC

i took a quick look at this and it seems like there is a bug in the autoscaler related to the leader election and etcd not being reachable. i see a bunch of nasty looking panics in the autoscaler logs.

for example

```
2022-03-11T12:22:00.768444647Z E0311 12:22:00.768408       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out
2022-03-11T12:22:03.756139437Z E0311 12:22:03.756086       1 leaderelection.go:330] error retrieving resource lock openshift-machine-api/cluster-autoscaler: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler": context deadline exceeded
2022-03-11T12:22:03.756183099Z I0311 12:22:03.756150       1 leaderelection.go:283] failed to renew lease openshift-machine-api/cluster-autoscaler: timed out waiting for the condition
2022-03-11T12:22:03.756212041Z E0311 12:22:03.756198       1 leaderelection.go:306] Failed to release lock: resource name may not be empty
2022-03-11T12:22:03.756212041Z F0311 12:22:03.756207       1 main.go:457] lost master
2022-03-11T12:22:03.759130650Z goroutine 1 [running]:
2022-03-11T12:22:03.759130650Z k8s.io/klog/v2.stacks(0xc000010001, 0xc00087c690, 0x37, 0xe3)
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
2022-03-11T12:22:03.759130650Z k8s.io/klog/v2.(*loggingT).output(0x616da00, 0xc000000003, 0x0, 0x0, 0xc002944310, 0x0, 0x510b995, 0x7, 0x1c9, 0x0)
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:975 +0x1e5
2022-03-11T12:22:03.759130650Z k8s.io/klog/v2.(*loggingT).printf(0x616da00, 0x3, 0x0, 0x0, 0x0, 0x0, 0x4105596, 0xb, 0x0, 0x0, ...)
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:753 +0x19a
2022-03-11T12:22:03.759130650Z k8s.io/klog/v2.Fatalf(...)
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1514
2022-03-11T12:22:03.759130650Z main.main.func3()
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:457 +0x8f
2022-03-11T12:22:03.759130650Z k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000c46a20)
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:203 +0x29
2022-03-11T12:22:03.759130650Z k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000c46a20, 0x46c94e8, 0xc000c3c080)
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:213 +0x167
2022-03-11T12:22:03.759130650Z k8s.io/client-go/tools/leaderelection.RunOrDie(0x46c94e8, 0xc000056088, 0x46fb300, 0xc0003a4780, 0x37e11d600, 0x2540be400, 0x77359400, 0xc000c45d00, 0x42504e0, 0x0, ...)
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:226 +0x9f
2022-03-11T12:22:03.759130650Z main.main()
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:444 +0x878
2022-03-11T12:22:03.759130650Z 
2022-03-11T12:22:03.759130650Z goroutine 6 [chan receive]:
2022-03-11T12:22:03.759130650Z k8s.io/klog/v2.(*loggingT).flushDaemon(0x616da00)
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b
2022-03-11T12:22:03.759130650Z created by k8s.io/klog/v2.init.0
2022-03-11T12:22:03.759130650Z 	/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:420 +0xdf
```

Comment 4 Michael McCune 2022-03-11 15:13:31 UTC

it looks like we are not setting the leader elect flags properly on the autoscaler. so, more a bug on our part when deploying the autoscaler. i'll make a patch for the CAO.

Comment 5 Michael McCune 2022-03-11 17:09:01 UTC

@sreber i have created a patch to help address this, but due to the complexity of the setup and my unfamiliarity with some of the deployments (pretty much step 2), i most likely won't be able to manually test it until next week. i did find that we had the defaults set for the leader election, but those values are well below what we expect for services in openshift. i have updated the values to be more tolerant to the types of disruptions we expect in the cluster.

if you have a cluster setup, or can set one up, that can test this patch out, i would be happy to coordinate with you.

Comment 6 Simon Reber 2022-03-11 17:50:34 UTC

(In reply to Michael McCune from comment #5)
> @sreber i have created a patch to help address this, but due to
> the complexity of the setup and my unfamiliarity with some of the
> deployments (pretty much step 2), i most likely won't be able to manually
> test it until next week. i did find that we had the defaults set for the
> leader election, but those values are well below what we expect for services
> in openshift. i have updated the values to be more tolerant to the types of
> disruptions we expect in the cluster.
> 
> if you have a cluster setup, or can set one up, that can test this patch
> out, i would be happy to coordinate with you.
The OpenShift Container Platform 4.9.23 is still running and will hopefully continue to run for a while. So if you have an Image somewhere available that I could try, I'd be happy to-do so and report back whether it will work. Of course I can't verify as QE is able but should be able to see whether improvements when etcd defrag is happening has/will improve.

Comment 7 Michael McCune 2022-03-11 21:49:55 UTC

ok, maybe we can connect next week. i'd prefer not to spin a custom image, but maybe i can configure a cluster to match the expected deployment, or would it possible for me to access the 4.9.23 cluster?

Comment 8 Michael McCune 2022-03-16 20:34:41 UTC

Simon and i tested out the patch here and it alleviates the restarting issue for the cluster autoscaler.

Comment 11 sunzhaohua 2022-03-18 08:24:34 UTC

Based on Comment 8, I think we can move this to verified.

Comment 14 Michael McCune 2022-03-28 13:17:37 UTC

this change is being backported to 4.10, and i have started the process of backporting to 4.9[0] as well.

[0] https://github.com/openshift/cluster-autoscaler-operator/pull/243#issuecomment-1080637216

Comment 16 errata-xmlrpc 2022-08-10 10:53:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.