Created attachment 1572536 [details] controller-manager.log Description of problem: MachineSet has been scaled to 0 replicas, but the Machine and associated node are not being removed from a cluster. The MachineSet seems healthy since I can scale it back up and down...but that one Machine/node won't go away. Version-Release number of selected component (if applicable): 4.1.0-rc.4 How reproducible: Steps to Reproduce: 1. Deploy a MachineSet 2. Scale it up 3. Scale it down Actual results: MachineSet scaled down, but node & machine remain Expected results: MachineSet scaled down, node and machine deleted Additional info:
Created attachment 1572537 [details] machine-controller.log
Created attachment 1572538 [details] nodelink-controller
Problematic machine: oc get machine infranode-eu-central-1c-4bs2v -oyaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/exclude-node-draining: "true" creationTimestamp: "2019-05-22T09:36:48Z" finalizers: - machine.machine.openshift.io generateName: infranode-eu-central-1c- generation: 1 labels: machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92 machine.openshift.io/cluster-api-machine-role: infra machine.openshift.io/cluster-api-machine-type: infra machine.openshift.io/cluster-api-machineset: infranode-eu-central-1c name: infranode-eu-central-1c-4bs2v namespace: openshift-machine-api ownerReferences: - apiVersion: machine.openshift.io/v1beta1 blockOwnerDeletion: true controller: true kind: MachineSet name: infranode-eu-central-1c uid: 1f34dd42-7c75-11e9-9acd-02bc9c3bbef2 resourceVersion: "763557" selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/infranode-eu-central-1c-4bs2v uid: 1f3633f6-7c75-11e9-bb3f-0a4014b17db4 spec: metadata: creationTimestamp: null labels: infra: infra node-role.kubernetes.io/infra: "" providerSpec: value: ami: id: ami-0060b31e9ef744e3d apiVersion: awsproviderconfig.openshift.io/v1beta1 blockDevices: - ebs: iops: 0 volumeSize: 120 volumeType: gp2 credentialsSecret: name: aws-cloud-credentials deviceIndex: 0 iamInstanceProfile: id: cluster-cd18-skf92-worker-profile instanceType: m4.large kind: AWSMachineProviderConfig metadata: creationTimestamp: null placement: availabilityZone: eu-central-1c region: eu-central-1 publicIp: null securityGroups: - filters: - name: tag:Name values: - cluster-cd18-skf92-worker-sg subnet: filters: - name: tag:Name values: - cluster-cd18-skf92-private-eu-central-1c tags: - name: kubernetes.io/cluster/cluster-cd18-skf92 value: owned userDataSecret: name: worker-user-data status: addresses: - address: 10.0.163.154 type: InternalIP - address: "" type: ExternalDNS - address: ip-10-0-163-154.eu-central-1.compute.internal type: InternalDNS lastUpdated: "2019-05-22T09:41:10Z" nodeRef: kind: Node name: ip-10-0-163-154.eu-central-1.compute.internal uid: 9ce867bd-7c75-11e9-9acd-02bc9c3bbef2 providerStatus: apiVersion: awsproviderconfig.openshift.io/v1beta1 conditions: - lastProbeTime: "2019-05-22T09:36:50Z" lastTransitionTime: "2019-05-22T09:36:50Z" message: machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreation instanceId: i-0ffea263ffdfe70d4 instanceState: running kind: AWSMachineProviderStatus
Problematic Machineset oc get machinesets -n openshift-machine-api -o yaml infranode-eu-central-1c apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: creationTimestamp: "2019-05-22T09:36:48Z" generation: 4 labels: machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92 name: infranode-eu-central-1c namespace: openshift-machine-api resourceVersion: "770836" selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/infranode-eu-central-1c uid: 1f34dd42-7c75-11e9-9acd-02bc9c3bbef2 spec: replicas: 0 selector: matchLabels: machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92 machine.openshift.io/cluster-api-machine-role: infra machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: infranode-eu-central-1c template: metadata: creationTimestamp: null labels: machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92 machine.openshift.io/cluster-api-machine-role: infra machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: infranode-eu-central-1c spec: metadata: creationTimestamp: null labels: infra: infra node-role.kubernetes.io/infra: "" providerSpec: value: ami: id: ami-0060b31e9ef744e3d apiVersion: awsproviderconfig.openshift.io/v1beta1 blockDevices: - ebs: iops: 0 volumeSize: 120 volumeType: gp2 credentialsSecret: name: aws-cloud-credentials deviceIndex: 0 iamInstanceProfile: id: cluster-cd18-skf92-worker-profile instanceType: m4.large kind: AWSMachineProviderConfig metadata: creationTimestamp: null placement: availabilityZone: eu-central-1c region: eu-central-1 publicIp: null securityGroups: - filters: - name: tag:Name values: - cluster-cd18-skf92-worker-sg subnet: filters: - name: tag:Name values: - cluster-cd18-skf92-private-eu-central-1c tags: - name: kubernetes.io/cluster/cluster-cd18-skf92 value: owned userDataSecret: name: worker-user-data status: observedGeneration: 4 replicas: 0
This was determined to be caused by someone / something tweaking labels on the machine object. This caused the machineset to ignore that machine, so it was orphaned. I'm leaving this case open. We need to get some debug logging into the machineset controller to assist with this type of thing in the future (eg, editing the deployment and increasing log verbosity). Otherwise it will be difficult to find out why. Something along the lines of "Found machines x,y,z; ignore machine x because labels don't match."
Given it was `machine.openshift.io/cluster-api-machine-type` label that changed its value from `worker` to `infra`, not `machine.openshift.io/cluster-api-machineset` nor `machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92`, I would suggest here to abandon both `machine.openshift.io/cluster-api-machine-role` and `machine.openshift.io/cluster-api-machine-type` when creating machineset objects. clusterID and machineset name are sufficient information. Even machineset name would do given machine api plane is running inside the cluster, machineset objects are expected to live under single namespace and clusterID is constant.
(In reply to Jan Chaloupka from comment #6) > Given it was `machine.openshift.io/cluster-api-machine-type` label that > changed its value from `worker` to `infra`, not > `machine.openshift.io/cluster-api-machineset` nor > `machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92`, I would > suggest here to abandon both `machine.openshift.io/cluster-api-machine-role` > and `machine.openshift.io/cluster-api-machine-type` when creating machineset > objects. clusterID and machineset name are sufficient information. Even > machineset name would do given machine api plane is running inside the > cluster, machineset objects are expected to live under single namespace and > clusterID is constant. I agree, we're using too many match-labels and it's a little confusing.
PR: https://github.com/openshift/installer/pull/2153
verified. When machineSet scaled down, node and machine deleted. Even we created machinesets with labels 'machine.openshift.io/cluster-api-machine-role' and 'machine.openshift.io/cluster-api-machine-type' and change the label `machine.openshift.io/cluster-api-machine-role` value from `worker` to `infra`. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-08-18-222019 True False 71m Cluster version is 4.2.0-0.nightly-2019-08-18-222019
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922