Created attachment 1865468 [details] cluster-autoscaler-default log when the pod is crashing Description of problem: Starting with OpenShift Container Platform 4.9, automated etcd defrag was implemented (https://docs.openshift.com/container-platform/4.9/release_notes/ocp-4-9-release-notes.html#ocp-4-9-notable-technical-changes). When running the same on large scale OpenShift Container Platform 4 - Cluster, defrag will take some time which will cause the cluster-autoscaler-default pod to fail and restart. E0311 13:42:41.540166 1 leaderelection.go:330] error retrieving resource lock openshift-machine-api/cluster-autoscaler: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler": context deadline exceeded I0311 13:42:41.540217 1 leaderelection.go:283] failed to renew lease openshift-machine-api/cluster-autoscaler: timed out waiting for the condition E0311 13:42:41.540259 1 leaderelection.go:306] Failed to release lock: resource name may not be empty F0311 13:42:41.540267 1 main.go:457] lost master Version-Release number of selected component (if applicable): - OpenShift Container Platform 4.9.23 How reproducible: - Always Steps to Reproduce: 1. Setup OpenShift Container Platform on AWS with Master and Worker of type m5.4xlarge 2. Install Elasticsearch, Logging, Jaeger, NFD, GitOps, Pipelines, Kiali and Service Mesh Operator 3. Run `for i in {5000..7125}; do oc new-project project-$i; oc create configmap project-$i --from-file=/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt; done` Mind `/etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt` is about 264K in size and therefore something in that area should be used Actual results: The `cluster-autoscaler-default` pod will eventually fail when `etcd` defrag is taking a bit more time due to the large `etcd` database. Expected results: As etcd defrag is automated by the platform this should not happen and the cluster-autoscaler-default should somehow be able to cope with a defrag activity Additional info: RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=2063183 may be of interest
i took a quick look at this and it seems like there is a bug in the autoscaler related to the leader election and etcd not being reachable. i see a bunch of nasty looking panics in the autoscaler logs. for example ``` 2022-03-11T12:22:00.768444647Z E0311 12:22:00.768408 1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out 2022-03-11T12:22:03.756139437Z E0311 12:22:03.756086 1 leaderelection.go:330] error retrieving resource lock openshift-machine-api/cluster-autoscaler: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler": context deadline exceeded 2022-03-11T12:22:03.756183099Z I0311 12:22:03.756150 1 leaderelection.go:283] failed to renew lease openshift-machine-api/cluster-autoscaler: timed out waiting for the condition 2022-03-11T12:22:03.756212041Z E0311 12:22:03.756198 1 leaderelection.go:306] Failed to release lock: resource name may not be empty 2022-03-11T12:22:03.756212041Z F0311 12:22:03.756207 1 main.go:457] lost master 2022-03-11T12:22:03.759130650Z goroutine 1 [running]: 2022-03-11T12:22:03.759130650Z k8s.io/klog/v2.stacks(0xc000010001, 0xc00087c690, 0x37, 0xe3) 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9 2022-03-11T12:22:03.759130650Z k8s.io/klog/v2.(*loggingT).output(0x616da00, 0xc000000003, 0x0, 0x0, 0xc002944310, 0x0, 0x510b995, 0x7, 0x1c9, 0x0) 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:975 +0x1e5 2022-03-11T12:22:03.759130650Z k8s.io/klog/v2.(*loggingT).printf(0x616da00, 0x3, 0x0, 0x0, 0x0, 0x0, 0x4105596, 0xb, 0x0, 0x0, ...) 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:753 +0x19a 2022-03-11T12:22:03.759130650Z k8s.io/klog/v2.Fatalf(...) 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1514 2022-03-11T12:22:03.759130650Z main.main.func3() 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:457 +0x8f 2022-03-11T12:22:03.759130650Z k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000c46a20) 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:203 +0x29 2022-03-11T12:22:03.759130650Z k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000c46a20, 0x46c94e8, 0xc000c3c080) 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:213 +0x167 2022-03-11T12:22:03.759130650Z k8s.io/client-go/tools/leaderelection.RunOrDie(0x46c94e8, 0xc000056088, 0x46fb300, 0xc0003a4780, 0x37e11d600, 0x2540be400, 0x77359400, 0xc000c45d00, 0x42504e0, 0x0, ...) 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:226 +0x9f 2022-03-11T12:22:03.759130650Z main.main() 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:444 +0x878 2022-03-11T12:22:03.759130650Z 2022-03-11T12:22:03.759130650Z goroutine 6 [chan receive]: 2022-03-11T12:22:03.759130650Z k8s.io/klog/v2.(*loggingT).flushDaemon(0x616da00) 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b 2022-03-11T12:22:03.759130650Z created by k8s.io/klog/v2.init.0 2022-03-11T12:22:03.759130650Z /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:420 +0xdf ```
it looks like we are not setting the leader elect flags properly on the autoscaler. so, more a bug on our part when deploying the autoscaler. i'll make a patch for the CAO.
@sreber i have created a patch to help address this, but due to the complexity of the setup and my unfamiliarity with some of the deployments (pretty much step 2), i most likely won't be able to manually test it until next week. i did find that we had the defaults set for the leader election, but those values are well below what we expect for services in openshift. i have updated the values to be more tolerant to the types of disruptions we expect in the cluster. if you have a cluster setup, or can set one up, that can test this patch out, i would be happy to coordinate with you.
(In reply to Michael McCune from comment #5) > @sreber i have created a patch to help address this, but due to > the complexity of the setup and my unfamiliarity with some of the > deployments (pretty much step 2), i most likely won't be able to manually > test it until next week. i did find that we had the defaults set for the > leader election, but those values are well below what we expect for services > in openshift. i have updated the values to be more tolerant to the types of > disruptions we expect in the cluster. > > if you have a cluster setup, or can set one up, that can test this patch > out, i would be happy to coordinate with you. The OpenShift Container Platform 4.9.23 is still running and will hopefully continue to run for a while. So if you have an Image somewhere available that I could try, I'd be happy to-do so and report back whether it will work. Of course I can't verify as QE is able but should be able to see whether improvements when etcd defrag is happening has/will improve.
ok, maybe we can connect next week. i'd prefer not to spin a custom image, but maybe i can configure a cluster to match the expected deployment, or would it possible for me to access the 4.9.23 cluster?
Simon and i tested out the patch here and it alleviates the restarting issue for the cluster autoscaler.
Based on Comment 8, I think we can move this to verified.
this change is being backported to 4.10, and i have started the process of backporting to 4.9[0] as well. [0] https://github.com/openshift/cluster-autoscaler-operator/pull/243#issuecomment-1080637216
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069