Description of problem: On OCP 4.6 we started supporting autoscaling by adding providerID to our nodes and machines in OCP. We saw that on certain situations when we define the autoscaler with: delayAfterAdd: 5s delayAfterDelete: 5s We ended up in a situation where working nodes are being removed for new nodes that have just started. When looking into the autoscaler logs we noticed that we see a lot of unregisterd warning, such as: static_autoscaler.go:320] 2 unregistered nodes present static_autoscaler.go:592] Removing unregistered node 7e4772d1-c272-4cea-b5d7-041aa0667d23 static_autoscaler.go:608] Failed to remove node 7e4772d1-c272-4cea-b5d7-041aa0667d23: node group min size reached, skipping unregistered node removal We then understood that one the cluster auto scaler feels like it can it first evects the olds worker nodes, and when we define low timeouts and there is a pressure on the cluster we sometimes can endup in a situation where most of the nodes are in deleting/provisioning state. When we looked at how a node is marked as unregistered by the autoscaler[1] we saw that we have a problem with the machine provider ID value. [1]https://github.com/kubernetes/autoscaler/blob/fde90dee450cb4626d4d683a83e623af1753c075/cluster-autoscaler/clusterstate/clusterstate.go#L968-L985 How reproducible: 100% Steps to Reproduce: 1. Deploy a cluster with an autoscaler https://docs.openshift.com/container-platform/4.5/machine_management/applying-autoscaling.html 2. Look at the autoscaller logs
*** Bug 1883979 has been marked as a duplicate of this bug. ***
*** Bug 1881051 has been marked as a duplicate of this bug. ***
*** Bug 1880136 has been marked as a duplicate of this bug. ***
Verified on: openshift-4.6.0-0.nightly-2020-10-02-065738 Steps: 1. Have OCP with 3 masters and 3 workers 2. # cat cluster_autoscaler.yaml apiVersion: "autoscaling.openshift.io/v1" kind: "ClusterAutoscaler" metadata: name: "default" spec: podPriorityThreshold: -10 resourceLimits: maxNodesTotal: 9 cores: min: 24 max: 40 memory: min: 96 max: 256 scaleDown: enabled: true delayAfterAdd: 10s delayAfterDelete: 10s delayAfterFailure: 30s unneededTime: 30s 3. # oc create -f cluster_autoscaler.yaml 4. # cat machine_autoscaler.yaml apiVersion: "autoscaling.openshift.io/v1beta1" kind: "MachineAutoscaler" metadata: name: "primary-jnzvt-worker-0" namespace: "openshift-machine-api" spec: minReplicas: 3 maxReplicas: 6 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: primary-jnzvt-worker-0 5. # oc create -f machine_autoscaler.yaml 6. # oc apply -f - <<EOF apiVersion: v1 kind: Namespace metadata: name: autoscaler-demo EOF 7. # cat scale-up.yaml apiVersion: apps/v1 kind: Deployment metadata: name: scale-up labels: app: scale-up spec: replicas: 18 selector: matchLabels: app: scale-up template: metadata: labels: app: scale-up spec: containers: - name: origin-base image: openshift/origin-base resources: requests: memory: 2Gi command: - /bin/sh - "-c" - "echo 'this should be in the logs' && sleep 86400" terminationGracePeriodSeconds: 0 8. # oc apply -n autoscaler-demo -f scale-up.yaml 9. Waited around 1 hr and checked the results: # oc get pods -n autoscaler-demo && oc get machine -n openshift-machine-api NAME READY STATUS RESTARTS AGE scale-up-5fd5c67f64-24wk5 1/1 Running 0 78m scale-up-5fd5c67f64-2qd9c 1/1 Running 0 79m scale-up-5fd5c67f64-58tjg 1/1 Running 0 79m scale-up-5fd5c67f64-7fh9q 1/1 Running 0 78m scale-up-5fd5c67f64-7q5bc 1/1 Running 0 79m scale-up-5fd5c67f64-cdhv5 1/1 Running 0 79m scale-up-5fd5c67f64-cv4rl 1/1 Running 0 79m scale-up-5fd5c67f64-fkq9t 1/1 Running 0 79m scale-up-5fd5c67f64-grzl2 1/1 Running 0 79m scale-up-5fd5c67f64-jb57z 1/1 Running 0 79m scale-up-5fd5c67f64-jhnp9 1/1 Running 0 79m scale-up-5fd5c67f64-rtq82 1/1 Running 0 79m scale-up-5fd5c67f64-v5pnd 1/1 Running 0 79m scale-up-5fd5c67f64-vfq6r 1/1 Running 0 79m scale-up-5fd5c67f64-wv9ld 1/1 Running 0 79m scale-up-5fd5c67f64-xmgq6 1/1 Running 0 78m scale-up-5fd5c67f64-z4wtj 1/1 Running 0 79m scale-up-5fd5c67f64-zm4j9 1/1 Running 0 79m NAME PHASE TYPE REGION ZONE AGE primary-8hhkw-master-0 Running 3h51m primary-8hhkw-master-1 Running 3h51m primary-8hhkw-master-2 Running 3h51m primary-8hhkw-worker-0-4xvnf Running 129m primary-8hhkw-worker-0-pb5kv Running 3h40m primary-8hhkw-worker-0-pcntp Running 129m primary-8hhkw-worker-0-sxkl2 Running 78m 10. Changed the scale-up.yaml to scaled to 24 containers waited around 20 min and checked the results # oc get pods -n autoscaler-demo && oc get machine -n openshift-machine-api NAME READY STATUS RESTARTS AGE scale-up-5fd5c67f64-24wk5 1/1 Running 0 97m scale-up-5fd5c67f64-2qd9c 1/1 Running 0 99m scale-up-5fd5c67f64-58tjg 1/1 Running 0 99m scale-up-5fd5c67f64-7fh9q 1/1 Running 0 97m scale-up-5fd5c67f64-7q5bc 1/1 Running 0 99m scale-up-5fd5c67f64-8tx9c 1/1 Running 0 28m scale-up-5fd5c67f64-b9zk9 1/1 Running 0 28m scale-up-5fd5c67f64-cdhv5 1/1 Running 0 99m scale-up-5fd5c67f64-cv4rl 1/1 Running 0 99m scale-up-5fd5c67f64-fkq9t 1/1 Running 0 99m scale-up-5fd5c67f64-fkxhf 1/1 Running 0 28m scale-up-5fd5c67f64-grzl2 1/1 Running 0 99m scale-up-5fd5c67f64-jb57z 1/1 Running 0 99m scale-up-5fd5c67f64-jhnp9 1/1 Running 0 99m scale-up-5fd5c67f64-qxnfk 1/1 Running 0 27m scale-up-5fd5c67f64-rtq82 1/1 Running 0 99m scale-up-5fd5c67f64-v5pnd 1/1 Running 0 99m scale-up-5fd5c67f64-vfq6r 1/1 Running 0 99m scale-up-5fd5c67f64-wv9ld 1/1 Running 0 99m scale-up-5fd5c67f64-xm72p 1/1 Running 0 27m scale-up-5fd5c67f64-xmgq6 1/1 Running 0 97m scale-up-5fd5c67f64-xn6cb 1/1 Running 0 28m scale-up-5fd5c67f64-z4wtj 1/1 Running 0 99m scale-up-5fd5c67f64-zm4j9 1/1 Running 0 99m NAME PHASE TYPE REGION ZONE AGE primary-8hhkw-master-0 Running 4h10m primary-8hhkw-master-1 Running 4h10m primary-8hhkw-master-2 Running 4h10m primary-8hhkw-worker-0-4xvnf Running 148m primary-8hhkw-worker-0-pb5kv Running 4h primary-8hhkw-worker-0-pcntp Running 148m primary-8hhkw-worker-0-sxkl2 Running 97m primary-8hhkw-worker-0-zjr6g Running 28m 11. Checked the provider ID in the new machines: # oc describe {node,machine}/{primary-8hhkw-worker-0-sxkl2,primary-8hhkw-worker-0-zjr6g} -n openshift-machine-api | egrep "ProviderID|Provider ID" ProviderID: ovirt://c17bec45-2237-433c-9661-ccdce2aa3dbc ProviderID: ovirt://39bfbf06-d803-4bf6-ab1e-bb3a4324013a Provider ID: ovirt://c17bec45-2237-433c-9661-ccdce2aa3dbc Provider ID: ovirt://39bfbf06-d803-4bf6-ab1e-bb3a4324013a Results: Extra machines created successfully, provider ID matching and new machines + new containers stable over time
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196