Bug 1834966 - VMware UPI - "Machine is in phase" & "machine does not have valid node reference" after fresh install
Summary: VMware UPI - "Machine is in phase" & "machine does not have valid node refere...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.4
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Joel Speed
QA Contact: Milind Yadav
URL:
Whiteboard:
: 1822345 1834965 (view as bug list)
Depends On:
Blocks: 1837478 1837483
TreeView+ depends on / blocked
 
Reported: 2020-05-12 18:47 UTC by Morgan Peterman
Modified: 2023-12-15 17:53 UTC (History)
42 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: starting in 4.4 the installer began creating machine & machineset manifests for vSphere, but UPI instructions were not updated to remove the manifests Consequence: machine objects were created but failing because machine-api is not active in 4.4 Fix: update UPI documentation to remove newly created manifests Result: machine manifests and objects are no longer created and errors are not present.
Clone Of:
: 1837478 (view as bug list)
Environment:
Last Closed: 2021-04-09 13:54:22 UTC
Target Upstream Version:
Embargoed:
jspeed: needinfo-
jspeed: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3619 0 None closed Bug 1834966: update vSphere UPI docs to remove machinesets 2021-02-16 03:18:25 UTC
Red Hat Knowledge Base (Solution) 5086271 0 None None None 2020-05-20 06:09:19 UTC
Red Hat Knowledge Base (Solution) 5298231 0 None None None 2020-12-22 13:06:12 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:38:20 UTC

Description Morgan Peterman 2020-05-12 18:47:46 UTC
Description of problem:

After deploying OCP 4.4.3 on VMware the following alerts are present upon login:

May 12, 2:29 pm machine openshift-dj2bm-master-0 is in phase
May 12, 2:29 pm machine openshift-dj2bm-master-1 is in phase
May 12, 2:29 pm machine openshift-dj2bm-master-2 is in phase
May 12, 2:29 pm machine openshift-dj2bm-master-0 does not have valid node reference
May 12, 2:29 pm machine openshift-dj2bm-master-1 does not have valid node reference
May 12, 2:29 pm machine openshift-dj2bm-master-2 does not have valid node reference

The following Machines are created:

$ oc get machines -o wide
NAME                       PHASE   TYPE   REGION   ZONE   AGE   NODE   PROVIDERID   STATE
openshift-dj2bm-master-0                                  33m                       
openshift-dj2bm-master-1                                  33m                       
openshift-dj2bm-master-2                                  33m 

$ oc describe machine openshift-dj2bm-master-0
Name:         openshift-dj2bm-master-0
Namespace:    openshift-machine-api
Labels:       machine.openshift.io/cluster-api-cluster=openshift-dj2bm
              machine.openshift.io/cluster-api-machine-role=master
              machine.openshift.io/cluster-api-machine-type=master
Annotations:  <none>
API Version:  machine.openshift.io/v1beta1
Kind:         Machine
Metadata:
  Creation Timestamp:  2020-05-12T18:08:10Z
  Generation:          1
  Resource Version:    1695
  Self Link:           /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/openshift-dj2bm-master-0
  UID:                 f9d58c3a-003e-40d7-8f2e-1634ce9bf117
Spec:
  Metadata:
    Creation Timestamp:  <nil>
  Provider Spec:
    Value:
      API Version:  vsphereprovider.openshift.io/v1beta1
      Credentials Secret:
        Name:       vsphere-cloud-credentials
      Disk Gi B:    120
      Kind:         VSphereMachineProviderSpec
      Memory Mi B:  16384
      Metadata:
        Creation Timestamp:  <nil>
      Network:
        Devices:
          Network Name:      
      Num CP Us:             4
      Num Cores Per Socket:  1
      Template:              
      User Data Secret:
        Name:  master-user-data
      Workspace:
        Datacenter:  LAB
        Datastore:   raid0
        Folder:      openshift-dj2bm
        Server:      vcenter.lab.int
Status:
Events:  <none>

install-config.yaml

apiVersion: v1
baseDomain: lab.int
compute:
- hyperthreading: Enabled
  name: worker
  replicas: 0
controlPlane:
  hyperthreading: Enabled
  name: master
  replicas: 3
metadata:
  name: openshift
networking:
  clusterNetworks:
  - cidr: 10.254.0.0/16
    hostPrefix: 24
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  vsphere:
    vcenter: vcenter.lab.int
    username: Administrator
    password: ....
    datacenter: LAB
    defaultDatastore: raid0
pullSecret: '{
  "auths": {...}
}'
sshKey: 'ssh-rsa ...'



Version-Release number of selected component (if applicable):
4.4.3

How reproducible:

Every time.

Steps to Reproduce:
1. Provision OpenShift Container Platform 4.4.3 on VMWare using UPI

Actual results:

Machines created that are not associated to the actual Master nodes which result in the error messages.

Expected results:

Either no Machines are created or the Machines are created and then associated to the master nodes.

Comment 1 Morgan Peterman 2020-05-12 20:27:24 UTC
*** Bug 1834965 has been marked as a duplicate of this bug. ***

Comment 2 Michael McNeill 2020-05-14 14:37:47 UTC
I am able to reproduce this consistently in my vSphere 6.7U3 environment with a fresh install of OCP 4.4.3.

Comment 5 Alberto 2020-05-15 12:40:18 UTC
Hey Morgan Peterman, Can you help me understand why do you have machine objects in your cluster? Can you point me to the step they were created?

Comment 6 Morgan Peterman 2020-05-15 13:16:45 UTC
(In reply to Alberto from comment #5)
> Hey Morgan Peterman, Can you help me understand why do you have machine
> objects in your cluster? Can you point me to the step they were created?

Alberto,

I performed a normal UPI install for VMware. Michael McNeill is experiencing the same issue. I can deploy another cluster and capture whatever logs you require.

This is a copy of my install-config.yaml with pullSecret and sshKey redacted:

install-config.yaml

apiVersion: v1
baseDomain: lab.int
compute:
- hyperthreading: Enabled
  name: worker
  replicas: 0
controlPlane:
  hyperthreading: Enabled
  name: master
  replicas: 3
metadata:
  name: openshift
networking:
  clusterNetworks:
  - cidr: 10.254.0.0/16
    hostPrefix: 24
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  vsphere:
    vcenter: vcenter.lab.int
    username: Administrator
    password: ....
    datacenter: LAB
    defaultDatastore: raid0
pullSecret: '{
  "auths": {...}
}'
sshKey: 'ssh-rsa ...'

Comment 10 Alberto 2020-05-18 14:38:31 UTC
>Aditya, yes remove them. There are instructions for removing the machinesets in other UPI instructs (e.g. AWS) which should be the same. I will resolve this bug by adding the same instructions to the vSphere UPI docs.

Please note this is reporting master machines. The installer shouldn't have instantiated the master machine objects in the cluster in the first place or at minimum should document that those alerts can be silenced.

In 4.4 there's no machine controller running for vSphere so if you delete the machines objects there are no consequences. As soon as you upgrade your cluster to 4.5 there will be a machine controller running and therefore reacting to machine objects events, so making users think it's safe to delete machine objects is ill advised.

Comment 12 Patrick Dillon 2020-05-18 15:58:32 UTC
The issue is that work for vSphere IPI began in 4.4 and that work included the installer creation of machinesets. We failed to update the documentation at that time to include the standard step of removing the machinesets. This step is common across all IPI platforms, and the steps should be largely the same for vSphere. Here is a link to the steps for Azure: https://github.com/openshift/installer/blob/master/docs/user/azure/install_upi.md#remove-control-plane-machines-and-machinesets

This will have to be resolved in 4.5 with an update to docs, similar to the above.

Comment 14 Alberto 2020-05-18 16:22:27 UTC
> In addition to updating the docs should we also be adding a KCS so when people go searching for this error they're directed that it is safe to remove the machines and machinesets?

>That has been done by our colleagues  so I am removing needinfo.

To make sure we are on the same page:

Documentation needs to be updated to reflect that you can remove the machine/machineSets manifests (if you don't want automated machine management) PRIOR to run the install (just like any other provider UPI install).

AFTER those manifests are persisted to the cluster we should not recommend/communicate that delete them from the cluster is safe without understanding all consequences.
Any vSphere cluster getting to >= 4.5 will have a machine controller running. Deleting a master machine object in >= 4.5 will delete the backing instance.

Comment 15 Abhinav Dahiya 2020-05-18 16:37:26 UTC
*** Bug 1822345 has been marked as a duplicate of this bug. ***

Comment 18 Michael McNeill 2020-05-20 22:45:32 UTC
I believe we need to remove the part of the KCS that references removing the machines, per Alberto [comment #14]. We should NOT be encouraging customers to do something that could have consequences in future versions (like 4.5.x when the machine controller will take action on those removed master instances). I believe this is an urgent fix. Adding needinfo to Morgan because I believe he is the one that created the KCS. 

> > In addition to updating the docs should we also be adding a KCS so when people go searching for this error they're directed that it is safe to remove the machines and machinesets?
> 
> >That has been done by our colleagues  so I am removing needinfo.
> 
> To make sure we are on the same page:
> 
> Documentation needs to be updated to reflect that you can remove the
> machine/machineSets manifests (if you don't want automated machine
> management) PRIOR to run the install (just like any other provider UPI
> install).
> 
> AFTER those manifests are persisted to the cluster we should not
> recommend/communicate that delete them from the cluster is safe without
> understanding all consequences.
> Any vSphere cluster getting to >= 4.5 will have a machine controller
> running. Deleting a master machine object in >= 4.5 will delete the backing
> instance.

Comment 23 jima 2020-06-01 08:32:39 UTC
The issue has been verified on 4.5.0-0.nightly-2020-05-31-230932.

Following updated upstream vSphere UPI docs, to remove machine and machinesets, then continue the installation.
After installation is completed, these resources can not found any more in openshift-machine-api namespaces.
# oc get machines -n openshift-machine-api
No resources found.
# oc get machinesets -n openshift-machine-api
No resources found.

Comment 24 Robert DeMay 2020-06-26 12:30:53 UTC
I could not find the updated documentation.  I am running 4.4.8.  If I understand this correctly, it is safe to remove those machines.  I have the same issue however there is no machineset for the masters.

[root@bastion vmocp]# oc get machines -n openshift-machine-api
NAME                   PHASE   TYPE   REGION   ZONE   AGE
vmocp-t2lp4-master-0                                  45h
vmocp-t2lp4-master-1                                  45h
vmocp-t2lp4-master-2                                  45h


[root@bastion vmocp]# oc get machinesets -n openshift-machine-api
NAME                 DESIRED   CURRENT   READY   AVAILABLE   AGE
vmocp-t2lp4-worker   3         0                             45h

Comment 25 Morgan Peterman 2020-06-26 14:11:12 UTC
Robert,

There would be no machineset for the masters only the workers.

Please see KCS - https://access.redhat.com/solutions/5086271

Comment 26 Robert DeMay 2020-06-26 14:15:22 UTC
Thank you

Comment 27 errata-xmlrpc 2020-07-13 17:37:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 30 Alberto 2020-07-27 07:32:28 UTC
>I have a customer hitting this issue von 4.5.1 and wants to know if this issue can be resolved without re-installing the cluster.

Is this a brand new cluster or an upgraded one which al ready had this alerts? If it is a new one please create a new BZ and share must-gather logs. Otherwise please share them here.

Comment 35 Abhinav Dahiya 2020-08-07 18:28:14 UTC
The docs were fixed in the installer, and as for workarounds for removing them in the cluster after upgrade I think machine-api team can help there. There if no more fix that installer team can help with here.

Comment 44 Simon Belmas-Gauderic 2020-10-13 09:42:03 UTC
Hello,

One of my customers is facing this issue.

it seems that on a fresh upi install on vmware, it still occurs.
```
Status:
  Last Updated:  2020-09-29T13:23:48Z
  Phase:         Provisioning
  Provider Status:
    Conditions:
      Last Probe Time:       2020-09-29T13:23:48Z
      Last Transition Time:  2020-09-29T13:23:48Z
      Message:               vm 'o45nia00-w9xck-rhcos' not found
      Reason:                MachineCreationFailed
      Status:                False
      Type:                  MachineCreation
```
Can we document the "workaround" or avoid this behaviour?

Thanks,
Simon Belmas-Gauderic
OCP Technical account manager

Comment 47 Stephen Cuppett 2020-10-14 11:46:14 UTC
Setting target release to the active development branch (4.7.0). For any fixes, where required and requested, cloned BZs will be created for those release maintenance streams where appropriate once they are identified.

Comment 48 Mennatallah Shaaban 2020-10-15 14:34:32 UTC
Hello,

we have a fresh UPI installation of 4.5.7 cluster but did not remove openshift/99_openshift-cluster-api_master-machines-*.yaml openshift/99_openshift-cluster-api_worker-machineset-*.yaml after creation of the manifests and we are facing the same issue.

oc get machines -n openshift-machine-api
NAME                     PHASE          TYPE   REGION   ZONE   AGE
ocpprod-zswp9-master-0   Provisioning                          30d
ocpprod-zswp9-master-1   Provisioning                          30d
ocpprod-zswp9-master-2   Provisioning                          30d

when describe the machines I get the below in the status

Status:
  Last Updated:  2020-09-15T06:16:31Z
  Phase:         Provisioning
  Provider Status:
    Conditions:
      Last Probe Time:       2020-09-15T06:16:31Z
      Last Transition Time:  2020-09-15T06:16:31Z
      Message:               vm 'ocpprod-zswp9-rhcos' not found
      Reason:                MachineCreationFailed
      Status:                False
      Type:                  MachineCreation
Events:
  Type     Reason        Age                     From               Message
  ----     ------        ----                    ----               -------
  Warning  FailedCreate  4m48s (x2588 over 28d)  vspherecontroller  vm 'ocpprod-zswp9-rhcos' not found

as mentioned above deleting the machines may cause deletion of the backing instance so what about applying the following:
1- remove the right password of the vSphere form the cloud provider configmap which will prevent the deletion of the VMs
2- remove the machines and machinesets by deleting the finalizers
3- add the right password of the vSphere again to the cloud provider configmap

Please let me know if this scenario will work or why it will fail.

Comment 50 Joel Speed 2020-11-13 12:16:10 UTC
@mshaaban Your suggested workaround should work as far as I can tell. An alternative would be to add an exception to the ClusterVersion resource so that it stops managing the machine-api-operator, scale this to zero, then scale the machine-api-controllers to zero, then force the deletion of the resources. This would ensure that our controllers are not running at all when you force the deletion so it should prevent any potential mishaps.

@pbertera If your Machine objects are already in deleting and the controller is unable to connect to the vCenter, then it should be safe to delete. I would suggest using the method I described above which relates to https://bugzilla.redhat.com/show_bug.cgi?id=1834966#c48.
If the objects are stuck in the deleting phase, you can remove their finalizers to allow them to be deleted. You should only do so if you are sure that this is safe, though in your case, it sounds like it should be as the Machines never related to real VMs.

Apart from supporting these misinstalled clusters, I don't think there's anything to be done for this BZ.
I would rather we didn't have this process publicly documented since it should never be needed in a normal situation and in an IPI cluster could have catastrophic consequences.

If there are no objections, I will close this issue again on Friday 20th November

Comment 51 Alejandro G 2020-11-13 15:41:24 UTC
I was reading all the comments, support also commented a plan to fix at version 4.7, I'm wondering if we can match the existing nodes to those machine objects.

https://access.redhat.com/solutions/5298231 Mention

"confirming that the machines are not map to the nodes by checking the logs from the Machine API controllers"

What strings should we look for determine if are mapped or not?

machine-api-controllers has 4 containers 

machineset-controller

I1113 02:23:11.494965       1 leaderelection.go:242] attempting to acquire leader lease  openshift-machine-api/cluster-api-provider-machineset-leader...
I1113 02:25:46.816628       1 leaderelection.go:252] successfully acquired lease openshift-machine-api/cluster-api-provider-machineset-leader
I1113 02:25:46.817204       1 reflector.go:175] Starting reflector *v1beta1.MachineSet (10m19.747206386s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I1113 02:25:46.817217       1 reflector.go:211] Listing and watching *v1beta1.MachineSet from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I1113 02:25:46.917539       1 reflector.go:175] Starting reflector *v1beta1.Machine (9m50.956499648s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I1113 02:25:46.917562       1 reflector.go:211] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I1113 03:39:46.631619       1 reflector.go:211] Listing and watching *v1beta1.MachineSet from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I1113 03:39:46.766780       1 reflector.go:211] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I1113 03:43:02.553948       1 reflector.go:211] Listing and watching *v1beta1.MachineSet from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224
I1113 03:43:02.960680       1 reflector.go:211] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224


machine-controller

E1113 02:38:10.925244       1 controller.go:272] kubedev-stsnx-master-0: failed to check if machine exists: kubedev-stsnx-master-0: failed to create scope for machine: failed to create vSphere session: error setting up new vSphere SOAP client: Post https://plesxvc01.cscinfo.com/sdk: dial tcp 10.96.4.30:443: i/o timeout
I1113 02:38:11.928356       1 controller.go:169] kubedev-stsnx-master-1: reconciling Machine
I1113 02:38:11.928392       1 actuator.go:80] kubedev-stsnx-master-1: actuator checking if machine exists
E1113 02:38:41.933670       1 controller.go:272] kubedev-stsnx-master-1: failed to check if machine exists: kubedev-stsnx-master-1: failed to create scope for machine: failed to create vSphere session: error setting up new vSphere SOAP client: Post https://plesxvc01.cscinfo.com/sdk: dial tcp 10.96.4.30:443: i/o timeout
I1113 02:38:42.933902       1 controller.go:169] kubedev-stsnx-master-2: reconciling Machine
I1113 02:38:42.933927       1 actuator.go:80] kubedev-stsnx-master-2: actuator checking if machine exists
E1113 02:39:12.940185       1 controller.go:272] kubedev-stsnx-master-2: failed to check if machine exists: kubedev-stsnx-master-2: failed to create scope for machine: failed to create vSphere session: error setting up new vSphere SOAP client: Post https://plesxvc01.cscinfo.com/sdk: dial tcp 10.96.4.30:443: i/o timeout
I1113 02:39:13.940555       1 controller.go:169] kubedev-stsnx-master-0: reconciling Machine
I1113 02:39:13.940658       1 actuator.go:80] kubedev-stsnx-master-0: actuator checking if machine exists



nodelink-controller

I1113 13:53:16.717251       1 nodelink_controller.go:409] Finding machine from node "ocp-control-02.kubedev.cscglobal.com"
I1113 13:53:16.717262       1 nodelink_controller.go:426] Finding machine from node "ocp-control-02.kubedev.cscglobal.com" by ProviderID
I1113 13:53:16.717279       1 nodelink_controller.go:449] Finding machine from node "ocp-control-02.kubedev.cscglobal.com" by IP
I1113 13:53:16.717289       1 nodelink_controller.go:454] Found internal IP for node "ocp-control-02.kubedev.cscglobal.com": "10.96.162.51"
I1113 13:53:16.717298       1 nodelink_controller.go:478] Matching machine not found for node "ocp-control-02.kubedev.cscglobal.com" with internal IP "10.96.162.51"
W1113 13:53:16.717307       1 nodelink_controller.go:212] Machine for node "ocp-control-02.kubedev.cscglobal.com" not found
I1113 13:53:24.961704       1 nodelink_controller.go:58] Adding providerID "vsphere://423ca0d9-c373-545d-ec33-0a8916e32217" for node "ocp-compute-02.kubedev.cscglobal.com" to indexer
I1113 13:53:24.961785       1 nodelink_controller.go:92] Adding internal IP "10.96.162.77" for node "ocp-compute-02.kubedev.cscglobal.com" to indexer
I1113 13:53:24.961809       1 nodelink_controller.go:58] Adding providerID "vsphere://423ca0d9-c373-545d-ec33-0a8916e32217" for node "ocp-compute-02.kubedev.cscglobal.com" to indexer
I1113 13:53:24.961830       1 nodelink_controller.go:92] Adding internal IP "10.96.162.77" for node "ocp-compute-02.kubedev.cscglobal.com" to indexer
I1113 13:53:24.961868       1 nodelink_controller.go:188] Reconciling Node /ocp-compute-02.kubedev.cscglobal.com
I1113 13:53:24.961914       1 nodelink_controller.go:409] Finding machine from node "ocp-compute-02.kubedev.cscglobal.com"




machine-healthcheck-controller

E1113 13:14:01.696552       1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-compute-20.kubedev.cscglobal.com": expecting one machine for node ocp-compute-20.kubedev.cscglobal.com, got: []
E1113 13:14:03.181030       1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-control-03.kubedev.cscglobal.com": expecting one machine for node ocp-control-03.kubedev.cscglobal.com, got: []
E1113 13:14:03.181076       1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-control-03.kubedev.cscglobal.com": expecting one machine for node ocp-control-03.kubedev.cscglobal.com, got: []
E1113 13:14:04.796142       1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-compute-16.kubedev.cscglobal.com": expecting one machine for node ocp-compute-16.kubedev.cscglobal.com, got: []
E1113 13:14:04.796204       1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-compute-16.kubedev.cscglobal.com": expecting one machine for node ocp-compute-16.kubedev.cscglobal.com, got: []
E1113 13:14:23.928296       1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-infra-03.kubedev.cscglobal.com": expecting one machine for node ocp-infra-03.kubedev.cscglobal.com, got: []
E1113 13:14:23.928337       1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-infra-03.kubedev.cscglobal.com": expecting one machine for node ocp-infra-03.kubedev.cscglobal.com, got: []
E1113 13:14:33.549873       1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-infra-04.kubedev.cscglobal.com": expecting one machine for node ocp-infra-04.kubedev.cscglobal.com, got: []
E1

Comment 52 Joel Speed 2020-11-13 16:10:27 UTC
You would want to look at the machine-controller logs

You can see in the machine-controller logs in this case that it can't talk to the vSphere endpoint, so I doubt it has ever been configured properly, so the Machines are probably all sat in the `Provisioning` state right?

You can also look at the Machine, if there's a providerID on the Machine, this means that it has been linked and would delete that provider instance if deleted.

Comment 53 Alejandro G 2020-11-13 17:00:42 UTC
(In reply to Joel Speed from comment #52)
> You would want to look at the machine-controller logs
> 
> You can see in the machine-controller logs in this case that it can't talk
> to the vSphere endpoint, so I doubt it has ever been configured properly, so
> the Machines are probably all sat in the `Provisioning` state right?
> 
> You can also look at the Machine, if there's a providerID on the Machine,
> this means that it has been linked and would delete that provider instance
> if deleted.

Thanks Joe

I forget mentions this is a UPI installation

Do not show any phase at all

I consider after all information deleting the machines do not impact the cluster, I'll test in a lab I'm creating . I'll post my notes



[aguadarr@dlosbastion01 machines]$ oc get machines
NAME                     PHASE   TYPE   REGION   ZONE   AGE
kubedev-stsnx-master-0                                  15h
kubedev-stsnx-master-1                                  15h
kubedev-stsnx-master-2                                  15h


[aguadarr@dlosbastion01 machines]$ oc describe machine kubedev-stsnx-master-0
Name:         kubedev-stsnx-master-0
Namespace:    openshift-machine-api
Labels:       machine.openshift.io/cluster-api-cluster=kubedev-stsnx
              machine.openshift.io/cluster-api-machine-role=master
              machine.openshift.io/cluster-api-machine-type=master
Annotations:  <none>
API Version:  machine.openshift.io/v1beta1
Kind:         Machine
Metadata:
  Creation Timestamp:  2020-11-13T00:12:37Z
  Finalizers:
    machine.machine.openshift.io
  Generation:  1
  Managed Fields:
    API Version:  machine.openshift.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:machine.openshift.io/cluster-api-cluster:
          f:machine.openshift.io/cluster-api-machine-role:
          f:machine.openshift.io/cluster-api-machine-type:
      f:spec:
        .:
        f:metadata:
        f:providerSpec:
          .:
          f:value:
            .:
            f:apiVersion:
            f:credentialsSecret:
            f:diskGiB:
            f:kind:
            f:memoryMiB:
            f:metadata:
            f:network:
            f:numCPUs:
            f:numCoresPerSocket:
            f:snapshot:
            f:template:
            f:userDataSecret:
            f:workspace:
      f:status:
    Manager:      cluster-bootstrap
    Operation:    Update
    Time:         2020-11-13T00:12:37Z
    API Version:  machine.openshift.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"machine.machine.openshift.io":
    Manager:         machine-controller-manager
    Operation:       Update
    Time:            2020-11-13T00:21:30Z
  Resource Version:  12690
  Self Link:         /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/kubedev-stsnx-master-0
  UID:               caddd6d9-62d4-4f49-ba58-a23507e1bee0
Spec:
  Metadata:
  Provider Spec:
    Value:
      API Version:  vsphereprovider.openshift.io/v1beta1
      Credentials Secret:
        Name:       vsphere-cloud-credentials
      Disk Gi B:    120
      Kind:         VSphereMachineProviderSpec
      Memory Mi B:  16384
      Metadata:
        Creation Timestamp:  <nil>
      Network:
        Devices:
          Network Name:      
      Num CP Us:             4
      Num Cores Per Socket:  1
      Snapshot:              
      Template:              kubedev-stsnx-rhcos
      User Data Secret:
        Name:  master-user-data
      Workspace:
        Datacenter:     US1-Ashburn
        Datastore:      pvntx56-vms-k8s
        Folder:         /US1-Ashburn/vm/kubedev-stsnx
        Resource Pool:  /US1-Ashburn/host//Resources
        Server:         plesxvc01.cscinfo.com
Status:
Events:  <none>

Comment 78 Joel Speed 2021-04-09 13:54:22 UTC
We have come to a conclusion about the actions that need to be taken to prevent this issue in the future for customers.

Since this will be a larger piece of work, we are going to track this in Jira going forward.

For those interested, please see https://issues.redhat.com/browse/OCPCLOUD-1135 for further details.


Note You need to log in before you can comment on or make changes to this bug.