Bug 1713374

Summary: Machineset is scaled to 0, but machine and node aren't removed
Product: OpenShift Container Platform Reporter: nate stephany <nstephan>
Component: Cloud ComputeAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: agarcial, mgugino, nstephan, zhsun
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:29:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
controller-manager.log
none
machine-controller.log
none
nodelink-controller none

Description nate stephany 2019-05-23 13:46:29 UTC
Created attachment 1572536 [details]
controller-manager.log

Description of problem:
MachineSet has been scaled to 0 replicas, but the Machine and associated node are not being removed from a cluster. The MachineSet seems healthy since I can scale it back up and down...but that one Machine/node won't go away.


Version-Release number of selected component (if applicable):
4.1.0-rc.4

How reproducible:


Steps to Reproduce:
1. Deploy a MachineSet
2. Scale it up
3. Scale it down

Actual results:
MachineSet scaled down, but node & machine remain

Expected results:
MachineSet scaled down, node and machine deleted

Additional info:

Comment 1 nate stephany 2019-05-23 13:54:10 UTC
Created attachment 1572537 [details]
machine-controller.log

Comment 2 nate stephany 2019-05-23 13:54:52 UTC
Created attachment 1572538 [details]
nodelink-controller

Comment 3 Michael Gugino 2019-05-23 14:00:57 UTC
Problematic machine:
oc get machine infranode-eu-central-1c-4bs2v -oyaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/exclude-node-draining: "true"
  creationTimestamp: "2019-05-22T09:36:48Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: infranode-eu-central-1c-
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92
    machine.openshift.io/cluster-api-machine-role: infra
    machine.openshift.io/cluster-api-machine-type: infra
    machine.openshift.io/cluster-api-machineset: infranode-eu-central-1c
  name: infranode-eu-central-1c-4bs2v
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: infranode-eu-central-1c
    uid: 1f34dd42-7c75-11e9-9acd-02bc9c3bbef2
  resourceVersion: "763557"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/infranode-eu-central-1c-4bs2v
  uid: 1f3633f6-7c75-11e9-bb3f-0a4014b17db4
spec:
  metadata:
    creationTimestamp: null
    labels:
      infra: infra
      node-role.kubernetes.io/infra: ""
  providerSpec:
    value:
      ami:
        id: ami-0060b31e9ef744e3d
      apiVersion: awsproviderconfig.openshift.io/v1beta1
      blockDevices:
      - ebs:
          iops: 0
          volumeSize: 120
          volumeType: gp2
      credentialsSecret:
        name: aws-cloud-credentials
      deviceIndex: 0
      iamInstanceProfile:
        id: cluster-cd18-skf92-worker-profile
      instanceType: m4.large
      kind: AWSMachineProviderConfig
      metadata:
        creationTimestamp: null
      placement:
        availabilityZone: eu-central-1c
        region: eu-central-1
      publicIp: null
      securityGroups:
      - filters:
        - name: tag:Name
          values:
          - cluster-cd18-skf92-worker-sg
      subnet:
        filters:
        - name: tag:Name
          values:
          - cluster-cd18-skf92-private-eu-central-1c
      tags:
      - name: kubernetes.io/cluster/cluster-cd18-skf92
        value: owned
      userDataSecret:
        name: worker-user-data
status:
  addresses:
  - address: 10.0.163.154
    type: InternalIP
  - address: ""
    type: ExternalDNS
  - address: ip-10-0-163-154.eu-central-1.compute.internal
    type: InternalDNS
  lastUpdated: "2019-05-22T09:41:10Z"
  nodeRef:
    kind: Node
    name: ip-10-0-163-154.eu-central-1.compute.internal
    uid: 9ce867bd-7c75-11e9-9acd-02bc9c3bbef2
  providerStatus:
    apiVersion: awsproviderconfig.openshift.io/v1beta1
    conditions:
    - lastProbeTime: "2019-05-22T09:36:50Z"
      lastTransitionTime: "2019-05-22T09:36:50Z"
      message: machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-0ffea263ffdfe70d4
    instanceState: running
    kind: AWSMachineProviderStatus

Comment 4 Michael Gugino 2019-05-23 14:37:09 UTC
Problematic Machineset

oc get machinesets -n openshift-machine-api -o yaml infranode-eu-central-1c
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  creationTimestamp: "2019-05-22T09:36:48Z"
  generation: 4
  labels:
    machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92
  name: infranode-eu-central-1c
  namespace: openshift-machine-api
  resourceVersion: "770836"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/infranode-eu-central-1c
  uid: 1f34dd42-7c75-11e9-9acd-02bc9c3bbef2
spec:
  replicas: 0
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92
      machine.openshift.io/cluster-api-machine-role: infra
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: infranode-eu-central-1c
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92
        machine.openshift.io/cluster-api-machine-role: infra
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: infranode-eu-central-1c
    spec:
      metadata:
        creationTimestamp: null
        labels:
          infra: infra
          node-role.kubernetes.io/infra: ""
      providerSpec:
        value:
          ami:
            id: ami-0060b31e9ef744e3d
          apiVersion: awsproviderconfig.openshift.io/v1beta1
          blockDevices:
          - ebs:
              iops: 0
              volumeSize: 120
              volumeType: gp2
          credentialsSecret:
            name: aws-cloud-credentials
          deviceIndex: 0
          iamInstanceProfile:
            id: cluster-cd18-skf92-worker-profile
          instanceType: m4.large
          kind: AWSMachineProviderConfig
          metadata:
            creationTimestamp: null
          placement:
            availabilityZone: eu-central-1c
            region: eu-central-1
          publicIp: null
          securityGroups:
          - filters:
            - name: tag:Name
              values:
              - cluster-cd18-skf92-worker-sg
          subnet:
            filters:
            - name: tag:Name
              values:
              - cluster-cd18-skf92-private-eu-central-1c
          tags:
          - name: kubernetes.io/cluster/cluster-cd18-skf92
            value: owned
          userDataSecret:
            name: worker-user-data
status:
  observedGeneration: 4
  replicas: 0

Comment 5 Michael Gugino 2019-05-23 14:52:40 UTC
This was determined to be caused by someone / something tweaking labels on the machine object.  This caused the machineset to ignore that machine, so it was orphaned.

I'm leaving this case open.  We need to get some debug logging into the machineset controller to assist with this type of thing in the future (eg, editing the deployment and increasing log verbosity).  Otherwise it will be difficult to find out why.  Something along the lines of "Found machines x,y,z; ignore machine x because labels don't match."

Comment 6 Jan Chaloupka 2019-07-29 12:19:57 UTC
Given it was `machine.openshift.io/cluster-api-machine-type` label that changed its value from `worker` to `infra`, not `machine.openshift.io/cluster-api-machineset` nor `machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92`, I would suggest here to abandon both `machine.openshift.io/cluster-api-machine-role` and `machine.openshift.io/cluster-api-machine-type` when creating machineset objects. clusterID and machineset name are sufficient information. Even machineset name would do given machine api plane is running inside the cluster, machineset objects are expected to live under single namespace and clusterID is constant.

Comment 7 Michael Gugino 2019-07-29 13:11:46 UTC
(In reply to Jan Chaloupka from comment #6)
> Given it was `machine.openshift.io/cluster-api-machine-type` label that
> changed its value from `worker` to `infra`, not
> `machine.openshift.io/cluster-api-machineset` nor
> `machine.openshift.io/cluster-api-cluster: cluster-cd18-skf92`, I would
> suggest here to abandon both `machine.openshift.io/cluster-api-machine-role`
> and `machine.openshift.io/cluster-api-machine-type` when creating machineset
> objects. clusterID and machineset name are sufficient information. Even
> machineset name would do given machine api plane is running inside the
> cluster, machineset objects are expected to live under single namespace and
> clusterID is constant.

I agree, we're using too many match-labels and it's a little confusing.

Comment 8 Jan Chaloupka 2019-08-05 08:20:11 UTC
PR: https://github.com/openshift/installer/pull/2153

Comment 10 sunzhaohua 2019-08-19 08:47:54 UTC
verified.


When machineSet scaled down, node and machine deleted. Even we created machinesets with labels 'machine.openshift.io/cluster-api-machine-role' and 'machine.openshift.io/cluster-api-machine-type' and change the label `machine.openshift.io/cluster-api-machine-role` value from `worker` to `infra`.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-18-222019   True        False         71m     Cluster version is 4.2.0-0.nightly-2019-08-18-222019

Comment 12 errata-xmlrpc 2019-10-16 06:29:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922