Bug 1854907

Summary: Config logic for skip-nodes-with-local-storage is flawed
Product: OpenShift Container Platform Reporter: Marcel Härri <mharri>
Component: Cloud ComputeAssignee: Michael McCune <mimccune>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: mimccune
Version: 4.4Keywords: UpcomingSprint
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Setting any of the ClusterAutoscaler resource values "balanceSimilarNodeGroups", "ignoreDaemonsetsUtilization", or "skipNodesWithLocalStorage" to "false". Consequence: The false setting is not respected when the cluster autoscaler is deployed. Fix: The cluster-autoscaler-operator has been patched to ensure these values are read properly when deploying the cluster-autoscaler. Result: The cluster-autoscaler now properly reads the "false" value.
Story Points: ---
Clone Of:
: 1879162 (view as bug list) Environment:
Last Closed: 2020-10-27 16:12:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1879162    

Description Marcel Härri 2020-07-08 12:33:49 UTC
There is an example to set the following option:

cluster-autoscaler-operator/examples/clusterautoscaler.yaml

Line 9 in 9c4a47c

 skipNodesWithLocalStorage: true 

However, when setting this option to false nothing happens. The deployment is not getting updated.

This is because the configuration logic is flawed:

cluster-autoscaler-operator/pkg/controller/clusterautoscaler/clusterautoscaler.go

Lines 95 to 97 in 9c4a47c

 if ca.Spec.SkipNodesWithLocalStorage != nil && *ca.Spec.SkipNodesWithLocalStorage { 
 	args = append(args, SkipNodesWithLocalStorage.String()) 
 } 

But you want the autoscaler to run with --skip-nodes-with-local-storage=false if you want to scale down nodes with pods using emptyDir.


There is already a fix available: https://github.com/openshift/cluster-autoscaler-operator/pull/156

It would be nice to have it backported at least down to 4.4

Comment 1 Michael McCune 2020-07-08 13:16:24 UTC
thanks for posting this Marcel, i am taking a look at the issue and pull request.

Comment 2 Michael McCune 2020-07-28 21:02:20 UTC
we need to get another review on this from our team, but we should be able to merge it soon.

Comment 5 sunzhaohua 2020-08-05 07:37:06 UTC
Verified
clusterversion: 4.6.0-0.nightly-2020-08-05-013608
spec:
  balanceSimilarNodeGroups: false
  skipNodesWithLocalStorage: false
  ignoreDaemonsetsUtilization: false
$ oc edit deploy cluster-autoscaler-default
        - --balance-similar-node-groups=false
        - --ignore-daemonsets-utilization=false
        - --skip-nodes-with-local-storage=false
spec:
  balanceSimilarNodeGroups: true
  skipNodesWithLocalStorage: true
  ignoreDaemonsetsUtilization: true
        - --balance-similar-node-groups=true
        - --ignore-daemonsets-utilization=true
        - --skip-nodes-with-local-storage=true

Comment 6 Marcel Härri 2020-08-05 14:33:54 UTC
Can we get this backported to 4.4 / 4.5 ?

Comment 7 Michael McCune 2020-08-05 16:06:57 UTC
i think this is a good candidate for backport, it should be possible to do this sprint.

Comment 8 Michael McCune 2020-08-17 19:15:16 UTC
planning to get this backported during the upcoming sprint.

Comment 10 errata-xmlrpc 2020-10-27 16:12:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196