Description of problem: If create launch configuration using non supported instance type,like "m5d.2xlarge" or "m5d.large", the cluster autoscaler will down. Version-Release number of selected component (if applicable): [ec2-user@ip-172-18-9-241 ~]$ oc version oc v3.11.11 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO How reproducible: Always Steps to Reproduce: 1. install cluster autoscaler using lc which instance type is "m5d.2xlarge" 2. create pod to scale up 3. $ oc get pod -n cluster-autoscaler Actual results: [ec2-user@ip-172-18-9-241 ~]$ oc get pod NAME READY STATUS RESTARTS AGE cluster-autoscaler-74c574c6b8-7h686 0/1 CrashLoopBackOff 6 10m scale-up-686dd75594-25flf 1/1 Running 0 10m scale-up-686dd75594-99wnt 0/1 Pending 0 10m scale-up-686dd75594-jjfct 1/1 Running 0 10m [ec2-user@ip-172-18-9-241 ~]$ oc logs -f cluster-autoscaler-74c574c6b8-7h686 ... I0921 07:10:15.463743 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-686dd75594-n7rdj is unschedulable panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1407de1] goroutine 76 [running]: k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.(*AwsManager).buildNodeFromTemplate(0xc420b2b220, 0xc420d8e000, 0xc4215036c0, 0xc4215036c0, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.7c05662/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_manager.go:407 +0x411 k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.(*Asg).TemplateNodeInfo(0xc420d8e000, 0xc421439a40, 0x7ffe23cb5c09, 0x10) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.7c05662/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go:294 +0x78 k8s.io/autoscaler/cluster-autoscaler/core.GetNodeInfosForGroups(0xc4206e6a60, 0x3, 0x4, 0x5a7db80, 0xc42052a6e0, 0x5a8a6e0, 0xc4200f01e0, 0xc421502240, 0x7, 0x8, ...) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.7c05662/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/core/utils.go:228 +0x2f7 k8s.io/autoscaler/cluster-autoscaler/core.ScaleUp(0xc42109fdc0, 0xc4217fbd00, 0x3, 0x4, 0xc4206e6a60, 0x3, 0x4, 0xc421502240, 0x7, 0x8, ...) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.7c05662/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/core/scale_up.go:62 +0x389 k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc420464500, 0xbee14435b18d44a4, 0x28bbcbf60, 0x5c72d00, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.7c05662/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:297 +0x2794 main.run(0xc420b2bb80) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.7c05662/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:269 +0x494 main.main.func2(0xc4204f0a20) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.7c05662/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:356 +0x2a created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.7c05662/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:145 +0x92 Expected results: cluster autoscaler work nomally Additional info: currently we support: https://github.com/openshift/kubernetes-autoscaler/blob/release-3.11/cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go#L29
I just hit the same bug, and we're using a supported instance type r4.4xlarge. panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x14075eb] goroutine 70 [running]: k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.(*AwsManager).getAsgTemplate(0xc420ae9770, 0x7ffc6fa74b6e, 0x46, 0x1, 0x1, 0x0) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.8c8305e/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_manager.go:367 +0x9b k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.(*Asg).TemplateNodeInfo(0xc42158de00, 0xc4226aff80, 0x7ffc6fa74b6e, 0x46) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.8c8305e/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go:289 +0x44 k8s.io/autoscaler/cluster-autoscaler/core.GetNodeInfosForGroups(0xc420273e00, 0x10, 0x10, 0x5a7db80, 0xc42033a290, 0x5a8a6e0, 0xc4200f41e0, 0xc4201fee00, 0x9, 0x10, ...) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.8c8305e/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/core/utils.go:228 +0x2f7 k8s.io/autoscaler/cluster-autoscaler/core.ScaleUp(0xc421d61500, 0xc4228ea200, 0x1d, 0x20, 0xc420273e00, 0x10, 0x10, 0xc4201fee00, 0x9, 0x10, ...) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.8c8305e/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/core/scale_up.go:62 +0x389 k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc4202fcb00, 0xbefa3806a2c91dad, 0xcb1f20c2ed, 0x5c72d00, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.8c8305e/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:297 +0x2794 main.run(0xc420ae9e50) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.8c8305e/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:269 +0x494 main.main.func2(0xc42036aa20) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.8c8305e/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:356 +0x2a created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.8c8305e/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:145 +0x92
CJ Oster, does it happen everytime? It might be a temporary glitch in the aws endpoint itself. Though it should not panic for sure.
> If create launch configuration using non supported instance type,like "m5d.2xlarge" or "m5d.large", the cluster autoscaler will down. > goroutine 76 [running]: k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.(*AwsManager).buildNodeFromTemplate(0xc420b2b220, 0xc420d8e000, 0xc4215036c0, 0xc4215036c0, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.7c05662/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_manager.go:407 +0x411 k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.(*Asg).TemplateNodeInfo(0xc420d8e000, 0xc421439a40, 0x7ffe23cb5c09, 0x10) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.7c05662/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go:294 +0x78 Given none of ["m5d.2xlarge", "m5d.large"] instance types is available in the latest 3.11 cluster autoscaler release, the autoscaler returns nil pointer of type instanceType [1]. Accessing `VCPU` field of nil pointer then panics [2]. The cluster autoscaler should rather report an error saying the instance type is not available. The issue is fixed in upstream by https://github.com/kubernetes/autoscaler/pull/1425 [1] https://github.com/openshift/kubernetes-autoscaler/blob/e9e93d3e72b101e1aeae419cc5361b0c9c5ae134/cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go#L21 [2] https://github.com/openshift/kubernetes-autoscaler/blob/e9e93d3e72b101e1aeae419cc5361b0c9c5ae134/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L407
Upstream backport https://github.com/openshift/kubernetes-autoscaler/pull/19
wrt. the other issue (not related to the first one) > k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.(*AwsManager).getAsgTemplate(0xc420ae9770, 0x7ffc6fa74b6e, 0x46, 0x1, 0x1, 0x0) /builddir/build/BUILD/atomic-openshift-cluster-autoscaler-git-0.8c8305e/_output/local/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_manager.go:367 +0x9b Backtracking the code there is only once variable that can be nil at [1] and it is asg.LaunchConfigurationName. Adding additional check [2] will avoid the panic. [1] https://github.com/openshift/kubernetes-autoscaler/blob/e9e93d3e72b101e1aeae419cc5361b0c9c5ae134/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L367 [2] https://github.com/openshift/kubernetes-autoscaler/pull/20 Though, the second issue is really not related to the first one. CJ Oster, can you report a new bug? The first issue is easily to reproduce. The second one can be random.
https://github.com/openshift/kubernetes-autoscaler/pull/19 merged
> Though, the second issue is really not related to the first one. CJ Oster, > can you report a new bug? The first issue is easily to reproduce. The second > one can be random. Yeah, ours turned out to be the fact that we were using LaunchTemplates instead of LaunchConfigurations, which we solved by changed back to LC's. Would this be something that can be back-ported?
With enough escalation maybe. Can you more elaborate on when/why LaunchTemplates were removed? Resp. what is the expected behavior?
Verified. Create launch configuration using non supported instance type "m5d.2xlarge" . Create pod to scale the cluster. # oc logs -f cluster-autoscaler-7875fbccf9-ccsfv I0117 09:44:55.711203 1 static_autoscaler.go:271] No schedulable pods I0117 09:44:55.711242 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-ln2fq is unschedulable I0117 09:44:55.711258 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-fxtwr is unschedulable I0117 09:44:55.711263 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-46rct is unschedulable I0117 09:44:55.711275 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-29g42 is unschedulable I0117 09:44:55.711281 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-dwt7p is unschedulable I0117 09:44:55.711291 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-qs799 is unschedulable I0117 09:44:55.711308 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-wcdmh is unschedulable I0117 09:44:55.711322 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-rmh8h is unschedulable I0117 09:44:55.711328 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-6t585 is unschedulable I0117 09:44:55.711338 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-4zhgz is unschedulable I0117 09:44:55.711343 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-psphw is unschedulable I0117 09:44:55.711352 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-5xrz8 is unschedulable I0117 09:44:55.711358 1 scale_up.go:59] Pod cluster-autoscaler/scale-up-79684ff956-bj55c is unschedulable E0117 09:44:55.778598 1 utils.go:233] Unable to build proper template node for zhsun-ASG1: ASG "zhsun-ASG1" uses the unknown EC2 instance type "m5d.large" E0117 09:44:55.778639 1 static_autoscaler.go:302] Failed to scale up: failed to build node infos for node groups: ASG "zhsun-ASG1" uses the unknown EC2 instance type "m5d.large" I0117 09:44:57.138431 1 leaderelection.go:199] successfully renewed lease cluster-autoscaler/cluster-autoscaler I0117 09:44:59.142861 1 leaderelection.go:199] successfully renewed lease cluster-autoscaler/cluster-autoscaler # oc get pod NAME READY STATUS RESTARTS AGE cluster-autoscaler-7875fbccf9-ccsfv 1/1 Running 0 2m scale-up-79684ff956-29g42 0/1 Pending 0 2m scale-up-79684ff956-46rct 0/1 Pending 0 2m scale-up-79684ff956-4zhgz 0/1 Pending 0 2m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0096
I have no further information.