Description of problem: After deploying OCP 4.4.3 on VMware the following alerts are present upon login: May 12, 2:29 pm machine openshift-dj2bm-master-0 is in phase May 12, 2:29 pm machine openshift-dj2bm-master-1 is in phase May 12, 2:29 pm machine openshift-dj2bm-master-2 is in phase May 12, 2:29 pm machine openshift-dj2bm-master-0 does not have valid node reference May 12, 2:29 pm machine openshift-dj2bm-master-1 does not have valid node reference May 12, 2:29 pm machine openshift-dj2bm-master-2 does not have valid node reference The following Machines are created: $ oc get machines -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE openshift-dj2bm-master-0 33m openshift-dj2bm-master-1 33m openshift-dj2bm-master-2 33m $ oc describe machine openshift-dj2bm-master-0 Name: openshift-dj2bm-master-0 Namespace: openshift-machine-api Labels: machine.openshift.io/cluster-api-cluster=openshift-dj2bm machine.openshift.io/cluster-api-machine-role=master machine.openshift.io/cluster-api-machine-type=master Annotations: <none> API Version: machine.openshift.io/v1beta1 Kind: Machine Metadata: Creation Timestamp: 2020-05-12T18:08:10Z Generation: 1 Resource Version: 1695 Self Link: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/openshift-dj2bm-master-0 UID: f9d58c3a-003e-40d7-8f2e-1634ce9bf117 Spec: Metadata: Creation Timestamp: <nil> Provider Spec: Value: API Version: vsphereprovider.openshift.io/v1beta1 Credentials Secret: Name: vsphere-cloud-credentials Disk Gi B: 120 Kind: VSphereMachineProviderSpec Memory Mi B: 16384 Metadata: Creation Timestamp: <nil> Network: Devices: Network Name: Num CP Us: 4 Num Cores Per Socket: 1 Template: User Data Secret: Name: master-user-data Workspace: Datacenter: LAB Datastore: raid0 Folder: openshift-dj2bm Server: vcenter.lab.int Status: Events: <none> install-config.yaml apiVersion: v1 baseDomain: lab.int compute: - hyperthreading: Enabled name: worker replicas: 0 controlPlane: hyperthreading: Enabled name: master replicas: 3 metadata: name: openshift networking: clusterNetworks: - cidr: 10.254.0.0/16 hostPrefix: 24 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: vsphere: vcenter: vcenter.lab.int username: Administrator password: .... datacenter: LAB defaultDatastore: raid0 pullSecret: '{ "auths": {...} }' sshKey: 'ssh-rsa ...' Version-Release number of selected component (if applicable): 4.4.3 How reproducible: Every time. Steps to Reproduce: 1. Provision OpenShift Container Platform 4.4.3 on VMWare using UPI Actual results: Machines created that are not associated to the actual Master nodes which result in the error messages. Expected results: Either no Machines are created or the Machines are created and then associated to the master nodes.
*** Bug 1834965 has been marked as a duplicate of this bug. ***
I am able to reproduce this consistently in my vSphere 6.7U3 environment with a fresh install of OCP 4.4.3.
Hey Morgan Peterman, Can you help me understand why do you have machine objects in your cluster? Can you point me to the step they were created?
(In reply to Alberto from comment #5) > Hey Morgan Peterman, Can you help me understand why do you have machine > objects in your cluster? Can you point me to the step they were created? Alberto, I performed a normal UPI install for VMware. Michael McNeill is experiencing the same issue. I can deploy another cluster and capture whatever logs you require. This is a copy of my install-config.yaml with pullSecret and sshKey redacted: install-config.yaml apiVersion: v1 baseDomain: lab.int compute: - hyperthreading: Enabled name: worker replicas: 0 controlPlane: hyperthreading: Enabled name: master replicas: 3 metadata: name: openshift networking: clusterNetworks: - cidr: 10.254.0.0/16 hostPrefix: 24 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: vsphere: vcenter: vcenter.lab.int username: Administrator password: .... datacenter: LAB defaultDatastore: raid0 pullSecret: '{ "auths": {...} }' sshKey: 'ssh-rsa ...'
>Aditya, yes remove them. There are instructions for removing the machinesets in other UPI instructs (e.g. AWS) which should be the same. I will resolve this bug by adding the same instructions to the vSphere UPI docs. Please note this is reporting master machines. The installer shouldn't have instantiated the master machine objects in the cluster in the first place or at minimum should document that those alerts can be silenced. In 4.4 there's no machine controller running for vSphere so if you delete the machines objects there are no consequences. As soon as you upgrade your cluster to 4.5 there will be a machine controller running and therefore reacting to machine objects events, so making users think it's safe to delete machine objects is ill advised.
The issue is that work for vSphere IPI began in 4.4 and that work included the installer creation of machinesets. We failed to update the documentation at that time to include the standard step of removing the machinesets. This step is common across all IPI platforms, and the steps should be largely the same for vSphere. Here is a link to the steps for Azure: https://github.com/openshift/installer/blob/master/docs/user/azure/install_upi.md#remove-control-plane-machines-and-machinesets This will have to be resolved in 4.5 with an update to docs, similar to the above.
> In addition to updating the docs should we also be adding a KCS so when people go searching for this error they're directed that it is safe to remove the machines and machinesets? >That has been done by our colleagues so I am removing needinfo. To make sure we are on the same page: Documentation needs to be updated to reflect that you can remove the machine/machineSets manifests (if you don't want automated machine management) PRIOR to run the install (just like any other provider UPI install). AFTER those manifests are persisted to the cluster we should not recommend/communicate that delete them from the cluster is safe without understanding all consequences. Any vSphere cluster getting to >= 4.5 will have a machine controller running. Deleting a master machine object in >= 4.5 will delete the backing instance.
*** Bug 1822345 has been marked as a duplicate of this bug. ***
I believe we need to remove the part of the KCS that references removing the machines, per Alberto [comment #14]. We should NOT be encouraging customers to do something that could have consequences in future versions (like 4.5.x when the machine controller will take action on those removed master instances). I believe this is an urgent fix. Adding needinfo to Morgan because I believe he is the one that created the KCS. > > In addition to updating the docs should we also be adding a KCS so when people go searching for this error they're directed that it is safe to remove the machines and machinesets? > > >That has been done by our colleagues so I am removing needinfo. > > To make sure we are on the same page: > > Documentation needs to be updated to reflect that you can remove the > machine/machineSets manifests (if you don't want automated machine > management) PRIOR to run the install (just like any other provider UPI > install). > > AFTER those manifests are persisted to the cluster we should not > recommend/communicate that delete them from the cluster is safe without > understanding all consequences. > Any vSphere cluster getting to >= 4.5 will have a machine controller > running. Deleting a master machine object in >= 4.5 will delete the backing > instance.
The issue has been verified on 4.5.0-0.nightly-2020-05-31-230932. Following updated upstream vSphere UPI docs, to remove machine and machinesets, then continue the installation. After installation is completed, these resources can not found any more in openshift-machine-api namespaces. # oc get machines -n openshift-machine-api No resources found. # oc get machinesets -n openshift-machine-api No resources found.
I could not find the updated documentation. I am running 4.4.8. If I understand this correctly, it is safe to remove those machines. I have the same issue however there is no machineset for the masters. [root@bastion vmocp]# oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE vmocp-t2lp4-master-0 45h vmocp-t2lp4-master-1 45h vmocp-t2lp4-master-2 45h [root@bastion vmocp]# oc get machinesets -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE vmocp-t2lp4-worker 3 0 45h
Robert, There would be no machineset for the masters only the workers. Please see KCS - https://access.redhat.com/solutions/5086271
Thank you
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
>I have a customer hitting this issue von 4.5.1 and wants to know if this issue can be resolved without re-installing the cluster. Is this a brand new cluster or an upgraded one which al ready had this alerts? If it is a new one please create a new BZ and share must-gather logs. Otherwise please share them here.
The docs were fixed in the installer, and as for workarounds for removing them in the cluster after upgrade I think machine-api team can help there. There if no more fix that installer team can help with here.
Hello, One of my customers is facing this issue. it seems that on a fresh upi install on vmware, it still occurs. ``` Status: Last Updated: 2020-09-29T13:23:48Z Phase: Provisioning Provider Status: Conditions: Last Probe Time: 2020-09-29T13:23:48Z Last Transition Time: 2020-09-29T13:23:48Z Message: vm 'o45nia00-w9xck-rhcos' not found Reason: MachineCreationFailed Status: False Type: MachineCreation ``` Can we document the "workaround" or avoid this behaviour? Thanks, Simon Belmas-Gauderic OCP Technical account manager
Setting target release to the active development branch (4.7.0). For any fixes, where required and requested, cloned BZs will be created for those release maintenance streams where appropriate once they are identified.
Hello, we have a fresh UPI installation of 4.5.7 cluster but did not remove openshift/99_openshift-cluster-api_master-machines-*.yaml openshift/99_openshift-cluster-api_worker-machineset-*.yaml after creation of the manifests and we are facing the same issue. oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ocpprod-zswp9-master-0 Provisioning 30d ocpprod-zswp9-master-1 Provisioning 30d ocpprod-zswp9-master-2 Provisioning 30d when describe the machines I get the below in the status Status: Last Updated: 2020-09-15T06:16:31Z Phase: Provisioning Provider Status: Conditions: Last Probe Time: 2020-09-15T06:16:31Z Last Transition Time: 2020-09-15T06:16:31Z Message: vm 'ocpprod-zswp9-rhcos' not found Reason: MachineCreationFailed Status: False Type: MachineCreation Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 4m48s (x2588 over 28d) vspherecontroller vm 'ocpprod-zswp9-rhcos' not found as mentioned above deleting the machines may cause deletion of the backing instance so what about applying the following: 1- remove the right password of the vSphere form the cloud provider configmap which will prevent the deletion of the VMs 2- remove the machines and machinesets by deleting the finalizers 3- add the right password of the vSphere again to the cloud provider configmap Please let me know if this scenario will work or why it will fail.
@mshaaban Your suggested workaround should work as far as I can tell. An alternative would be to add an exception to the ClusterVersion resource so that it stops managing the machine-api-operator, scale this to zero, then scale the machine-api-controllers to zero, then force the deletion of the resources. This would ensure that our controllers are not running at all when you force the deletion so it should prevent any potential mishaps. @pbertera If your Machine objects are already in deleting and the controller is unable to connect to the vCenter, then it should be safe to delete. I would suggest using the method I described above which relates to https://bugzilla.redhat.com/show_bug.cgi?id=1834966#c48. If the objects are stuck in the deleting phase, you can remove their finalizers to allow them to be deleted. You should only do so if you are sure that this is safe, though in your case, it sounds like it should be as the Machines never related to real VMs. Apart from supporting these misinstalled clusters, I don't think there's anything to be done for this BZ. I would rather we didn't have this process publicly documented since it should never be needed in a normal situation and in an IPI cluster could have catastrophic consequences. If there are no objections, I will close this issue again on Friday 20th November
I was reading all the comments, support also commented a plan to fix at version 4.7, I'm wondering if we can match the existing nodes to those machine objects. https://access.redhat.com/solutions/5298231 Mention "confirming that the machines are not map to the nodes by checking the logs from the Machine API controllers" What strings should we look for determine if are mapped or not? machine-api-controllers has 4 containers machineset-controller I1113 02:23:11.494965 1 leaderelection.go:242] attempting to acquire leader lease openshift-machine-api/cluster-api-provider-machineset-leader... I1113 02:25:46.816628 1 leaderelection.go:252] successfully acquired lease openshift-machine-api/cluster-api-provider-machineset-leader I1113 02:25:46.817204 1 reflector.go:175] Starting reflector *v1beta1.MachineSet (10m19.747206386s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I1113 02:25:46.817217 1 reflector.go:211] Listing and watching *v1beta1.MachineSet from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I1113 02:25:46.917539 1 reflector.go:175] Starting reflector *v1beta1.Machine (9m50.956499648s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I1113 02:25:46.917562 1 reflector.go:211] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I1113 03:39:46.631619 1 reflector.go:211] Listing and watching *v1beta1.MachineSet from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I1113 03:39:46.766780 1 reflector.go:211] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I1113 03:43:02.553948 1 reflector.go:211] Listing and watching *v1beta1.MachineSet from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 I1113 03:43:02.960680 1 reflector.go:211] Listing and watching *v1beta1.Machine from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 machine-controller E1113 02:38:10.925244 1 controller.go:272] kubedev-stsnx-master-0: failed to check if machine exists: kubedev-stsnx-master-0: failed to create scope for machine: failed to create vSphere session: error setting up new vSphere SOAP client: Post https://plesxvc01.cscinfo.com/sdk: dial tcp 10.96.4.30:443: i/o timeout I1113 02:38:11.928356 1 controller.go:169] kubedev-stsnx-master-1: reconciling Machine I1113 02:38:11.928392 1 actuator.go:80] kubedev-stsnx-master-1: actuator checking if machine exists E1113 02:38:41.933670 1 controller.go:272] kubedev-stsnx-master-1: failed to check if machine exists: kubedev-stsnx-master-1: failed to create scope for machine: failed to create vSphere session: error setting up new vSphere SOAP client: Post https://plesxvc01.cscinfo.com/sdk: dial tcp 10.96.4.30:443: i/o timeout I1113 02:38:42.933902 1 controller.go:169] kubedev-stsnx-master-2: reconciling Machine I1113 02:38:42.933927 1 actuator.go:80] kubedev-stsnx-master-2: actuator checking if machine exists E1113 02:39:12.940185 1 controller.go:272] kubedev-stsnx-master-2: failed to check if machine exists: kubedev-stsnx-master-2: failed to create scope for machine: failed to create vSphere session: error setting up new vSphere SOAP client: Post https://plesxvc01.cscinfo.com/sdk: dial tcp 10.96.4.30:443: i/o timeout I1113 02:39:13.940555 1 controller.go:169] kubedev-stsnx-master-0: reconciling Machine I1113 02:39:13.940658 1 actuator.go:80] kubedev-stsnx-master-0: actuator checking if machine exists nodelink-controller I1113 13:53:16.717251 1 nodelink_controller.go:409] Finding machine from node "ocp-control-02.kubedev.cscglobal.com" I1113 13:53:16.717262 1 nodelink_controller.go:426] Finding machine from node "ocp-control-02.kubedev.cscglobal.com" by ProviderID I1113 13:53:16.717279 1 nodelink_controller.go:449] Finding machine from node "ocp-control-02.kubedev.cscglobal.com" by IP I1113 13:53:16.717289 1 nodelink_controller.go:454] Found internal IP for node "ocp-control-02.kubedev.cscglobal.com": "10.96.162.51" I1113 13:53:16.717298 1 nodelink_controller.go:478] Matching machine not found for node "ocp-control-02.kubedev.cscglobal.com" with internal IP "10.96.162.51" W1113 13:53:16.717307 1 nodelink_controller.go:212] Machine for node "ocp-control-02.kubedev.cscglobal.com" not found I1113 13:53:24.961704 1 nodelink_controller.go:58] Adding providerID "vsphere://423ca0d9-c373-545d-ec33-0a8916e32217" for node "ocp-compute-02.kubedev.cscglobal.com" to indexer I1113 13:53:24.961785 1 nodelink_controller.go:92] Adding internal IP "10.96.162.77" for node "ocp-compute-02.kubedev.cscglobal.com" to indexer I1113 13:53:24.961809 1 nodelink_controller.go:58] Adding providerID "vsphere://423ca0d9-c373-545d-ec33-0a8916e32217" for node "ocp-compute-02.kubedev.cscglobal.com" to indexer I1113 13:53:24.961830 1 nodelink_controller.go:92] Adding internal IP "10.96.162.77" for node "ocp-compute-02.kubedev.cscglobal.com" to indexer I1113 13:53:24.961868 1 nodelink_controller.go:188] Reconciling Node /ocp-compute-02.kubedev.cscglobal.com I1113 13:53:24.961914 1 nodelink_controller.go:409] Finding machine from node "ocp-compute-02.kubedev.cscglobal.com" machine-healthcheck-controller E1113 13:14:01.696552 1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-compute-20.kubedev.cscglobal.com": expecting one machine for node ocp-compute-20.kubedev.cscglobal.com, got: [] E1113 13:14:03.181030 1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-control-03.kubedev.cscglobal.com": expecting one machine for node ocp-control-03.kubedev.cscglobal.com, got: [] E1113 13:14:03.181076 1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-control-03.kubedev.cscglobal.com": expecting one machine for node ocp-control-03.kubedev.cscglobal.com, got: [] E1113 13:14:04.796142 1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-compute-16.kubedev.cscglobal.com": expecting one machine for node ocp-compute-16.kubedev.cscglobal.com, got: [] E1113 13:14:04.796204 1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-compute-16.kubedev.cscglobal.com": expecting one machine for node ocp-compute-16.kubedev.cscglobal.com, got: [] E1113 13:14:23.928296 1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-infra-03.kubedev.cscglobal.com": expecting one machine for node ocp-infra-03.kubedev.cscglobal.com, got: [] E1113 13:14:23.928337 1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-infra-03.kubedev.cscglobal.com": expecting one machine for node ocp-infra-03.kubedev.cscglobal.com, got: [] E1113 13:14:33.549873 1 machinehealthcheck_controller.go:387] No-op: Unable to retrieve machine from node "/ocp-infra-04.kubedev.cscglobal.com": expecting one machine for node ocp-infra-04.kubedev.cscglobal.com, got: [] E1
You would want to look at the machine-controller logs You can see in the machine-controller logs in this case that it can't talk to the vSphere endpoint, so I doubt it has ever been configured properly, so the Machines are probably all sat in the `Provisioning` state right? You can also look at the Machine, if there's a providerID on the Machine, this means that it has been linked and would delete that provider instance if deleted.
(In reply to Joel Speed from comment #52) > You would want to look at the machine-controller logs > > You can see in the machine-controller logs in this case that it can't talk > to the vSphere endpoint, so I doubt it has ever been configured properly, so > the Machines are probably all sat in the `Provisioning` state right? > > You can also look at the Machine, if there's a providerID on the Machine, > this means that it has been linked and would delete that provider instance > if deleted. Thanks Joe I forget mentions this is a UPI installation Do not show any phase at all I consider after all information deleting the machines do not impact the cluster, I'll test in a lab I'm creating . I'll post my notes [aguadarr@dlosbastion01 machines]$ oc get machines NAME PHASE TYPE REGION ZONE AGE kubedev-stsnx-master-0 15h kubedev-stsnx-master-1 15h kubedev-stsnx-master-2 15h [aguadarr@dlosbastion01 machines]$ oc describe machine kubedev-stsnx-master-0 Name: kubedev-stsnx-master-0 Namespace: openshift-machine-api Labels: machine.openshift.io/cluster-api-cluster=kubedev-stsnx machine.openshift.io/cluster-api-machine-role=master machine.openshift.io/cluster-api-machine-type=master Annotations: <none> API Version: machine.openshift.io/v1beta1 Kind: Machine Metadata: Creation Timestamp: 2020-11-13T00:12:37Z Finalizers: machine.machine.openshift.io Generation: 1 Managed Fields: API Version: machine.openshift.io/v1beta1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:labels: .: f:machine.openshift.io/cluster-api-cluster: f:machine.openshift.io/cluster-api-machine-role: f:machine.openshift.io/cluster-api-machine-type: f:spec: .: f:metadata: f:providerSpec: .: f:value: .: f:apiVersion: f:credentialsSecret: f:diskGiB: f:kind: f:memoryMiB: f:metadata: f:network: f:numCPUs: f:numCoresPerSocket: f:snapshot: f:template: f:userDataSecret: f:workspace: f:status: Manager: cluster-bootstrap Operation: Update Time: 2020-11-13T00:12:37Z API Version: machine.openshift.io/v1beta1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: v:"machine.machine.openshift.io": Manager: machine-controller-manager Operation: Update Time: 2020-11-13T00:21:30Z Resource Version: 12690 Self Link: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/kubedev-stsnx-master-0 UID: caddd6d9-62d4-4f49-ba58-a23507e1bee0 Spec: Metadata: Provider Spec: Value: API Version: vsphereprovider.openshift.io/v1beta1 Credentials Secret: Name: vsphere-cloud-credentials Disk Gi B: 120 Kind: VSphereMachineProviderSpec Memory Mi B: 16384 Metadata: Creation Timestamp: <nil> Network: Devices: Network Name: Num CP Us: 4 Num Cores Per Socket: 1 Snapshot: Template: kubedev-stsnx-rhcos User Data Secret: Name: master-user-data Workspace: Datacenter: US1-Ashburn Datastore: pvntx56-vms-k8s Folder: /US1-Ashburn/vm/kubedev-stsnx Resource Pool: /US1-Ashburn/host//Resources Server: plesxvc01.cscinfo.com Status: Events: <none>
We have come to a conclusion about the actions that need to be taken to prevent this issue in the future for customers. Since this will be a larger piece of work, we are going to track this in Jira going forward. For those interested, please see https://issues.redhat.com/browse/OCPCLOUD-1135 for further details.