Bug 1713105

Summary: Missing node prevents machine from being delete
Product: OpenShift Container Platform Reporter: Michael Gugino <mgugino>
Component: Cloud ComputeAssignee: Michael Gugino <mgugino>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: high    
Version: 4.2.0CC: agarcial, ejacobs, jhou, mgugino, zhsun
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1713061 Environment:
Last Closed: 2019-10-16 06:29:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1713061    

Description Michael Gugino 2019-05-22 21:40:49 UTC
+++ This bug was initially created as a clone of Bug #1713061 +++

Description of problem:
a machineset was scaled up and then scaled down. the nodes disappeared but the machine objects remain

Version-Release number of selected component (if applicable):
4.1.0-rc.4
U3r2LdrhT-A=

Additional info:
NAME                          INSTANCE              STATE     TYPE        REGION      ZONE         AGE
cluster-4e40-c7df5-master-0   i-087186746072193f0   running   m4.xlarge   us-east-2   us-east-2a   24h
cluster-4e40-c7df5-master-1   i-0eafe7e9e69f6aaec   running   m4.xlarge   us-east-2   us-east-2b   24h
cluster-4e40-c7df5-master-2   i-03c13bba692694646   running   m4.xlarge   us-east-2   us-east-2c   24h
infranode-us-east-2a-t7xwt    i-0c6ce0f9d57708d22   running   m4.large    us-east-2   us-east-2a   173m
infranode-us-east-2a-z9nfh    i-0c3f83d4c9003f5d0   running   m4.large    us-east-2   us-east-2a   3h39m
nossd-1a-dczcf                i-00a207dab2c9e970d   running   m4.large    us-east-2   us-east-2a   3h57m
ssd-1a-5l9fh                  i-090acc4f9598a37f3   running   m4.large    us-east-2   us-east-2a   121m
ssd-1a-7cvrr                  i-0ccca476b234fc1da   running   m4.large    us-east-2   us-east-2a   69m
ssd-1a-q52pv                  i-0e9e6d01af5ca727a   running   m4.large    us-east-2   us-east-2a   121m
ssd-1a-q6hr9                  i-08f4a48151276ce90   running   m4.large    us-east-2   us-east-2a   121m
ssd-1a-sfhdm                  i-03eec775cb1ce8f3c   running   m4.large    us-east-2   us-east-2a   121m
ssd-1b-rtxxg                  i-08d06740a65e88be6   running   m4.large    us-east-2   us-east-2b   3h57m


The machines that are 121m old in the `ssd-1a` set are the "orphans" without corresponding nodes. Each of them has a deletiontimestamp.

--- Additional comment from Michael Gugino on 2019-05-22 20:37:20 UTC ---

I have investigated this.  We're failing to retrieve the node from the nodeRef specified on the machine-object.  This is either because the machine-controller deleted the node already and failed to update that annotation for some reason, or an admin removed the node manually before attempting to scale.  Either way, this is definitely a bug and is not easily correctable by the end-user.  I will get a patch out for master and pick to 4.1.

--- Additional comment from Michael Gugino on 2019-05-22 21:05:59 UTC ---

Added a reference to 4.1 known-issue tracker: https://github.com/openshift/openshift-docs/issues/12487

Comment 1 Michael Gugino 2019-05-22 21:46:10 UTC
Patch created against openshift/cluster-api 4.2 branch: https://github.com/openshift/cluster-api/pull/43

Comment 3 sunzhaohua 2019-06-25 06:47:22 UTC
Verified.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-06-24-160709   True        False         16m     Cluster version is 4.2.0-0.nightly-2019-06-24-160709

$ oc delete node ip-10-0-129-227.us-west-2.compute.internal
node "ip-10-0-129-227.us-west-2.compute.internal" deleted

$ oc get machine zhsun2-2rhzr-worker-us-west-2a-4czd4 -o yaml
status:
  addresses:
  - address: 10.0.129.227
    type: InternalIP
  - address: ""
    type: ExternalDNS
  - address: ip-10-0-129-227.us-west-2.compute.internal
    type: InternalDNS
  lastUpdated: "2019-06-25T06:00:15Z"
  nodeRef:
    kind: Node
    name: ip-10-0-129-227.us-west-2.compute.internal
    uid: 61d697a9-970e-11e9-9bdd-06fb8941e6f0
  providerStatus:
    apiVersion: awsproviderconfig.openshift.io/v1beta1
    conditions:
    - lastProbeTime: "2019-06-25T05:54:34Z"
      lastTransitionTime: "2019-06-25T05:54:34Z"
      message: machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-02479ec98a9e04896
    instanceState: running
    kind: AWSMachineProviderStatus

$ oc delete machine zhsun2-2rhzr-worker-us-west-2a-4czd4
machine.machine.openshift.io "zhsun2-2rhzr-worker-us-west-2a-4czd4" deleted

Comment 5 errata-xmlrpc 2019-10-16 06:29:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922