Description of problem: ----------------------- Machine annotated for deletion was reused to provision another BMH Version-Release number of selected component (if applicable): ------------------------------------------------------------- 4.4.0-0.nightly-2020-04-18-095545 Steps to Reproduce: ------------------- 1. Depbloy bm ipi - 3 masters + 2workers 2. Add to new bmh/secrets for another 2 workers apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: creationTimestamp: null name: openshift-worker-0-2 namespace: openshift-machine-api spec: bmc: address: redfish://192.168.123.1:8000/redfish/v1/Systems/822e8bc7-6172-4a1d-8b75-b87e64d2775b credentialsName: openshift-worker-0-2-bmc-secret disableCertificateVerification: true bootMACAddress: "52:54:00:54:d8:ff" online: false userData: name: worker-user-data namespace: openshift-machine-api --- apiVersion: v1 data: password: cGFzc3dvcmQ= username: YWRtaW4= kind: Secret metadata: creationTimestamp: null name: openshift-worker-0-2-bmc-secret namespace: openshift-machine-api type: Opaque [kni@provisionhost-0-0 ScaleUp]$ 3. Make sure that successfully introspected oc get bmh -n openshift-machine-api NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR ... openshift-worker-0-2 OK ready redfish://192.168.123.1:8000/redfish/v1/Systems/822e8bc7-6172-4a1d-8b75-b87e64d2775b unknown false openshift-worker-0-3 OK ready redfish://192.168.123.1:8000/redfish/v1/Systems/a24fbc47-4982-4e2f-825b-159f5fd896ea unknown false 4. Scale up existing machineSet oc scale --replicas=3 machineset/ocp-edge-cluster-0-worker-0 -n openshift-machine-api machineset.machine.openshift.io/ocp-edge-cluster-0-worker-0 scaled 5. Wait for new node to be ready oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 2d2h v1.17.1 master-0-1 Ready master 2d2h v1.17.1 master-0-2 Ready master 2d2h v1.17.1 worker-0-0 Ready worker 2d1h v1.17.1 worker-0-1 Ready worker 2d1h v1.17.1 worker-0-3 Ready worker 10m v1.17.1 6. Add annotation `machine.openshift.io/cluster-api-delete-machine=yes` to machine for new worker(worker-0-3 in this example: oc annotate machine ocp-edge-cluster-0-worker-0-hkw8q machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api machine.machine.openshift.io/ocp-edge-cluster-0-worker-0-hkw8q annotated 7. Delete baremtalHost object for the corresponding machine oc delete bmh openshift-worker-0-3 -n openshift-machine-api baremetalhost.metal3.io "openshift-worker-0-3" deleted 8. Cause there is a spare bmh and replicaset still has requires 3 worker nodes, available bmh (openshift-worker-0-2) is immediately provisioned 9. Check that now new machine was created, but the one with annotation is reused: (after some time details internalIP, Hostname, are updated with correct data): Name: ocp-edge-cluster-0-worker-0-hkw8q Namespace: openshift-machine-api Labels: machine.openshift.io/cluster-api-cluster=ocp-edge-cluster-0 machine.openshift.io/cluster-api-machine-role=worker machine.openshift.io/cluster-api-machine-type=worker machine.openshift.io/cluster-api-machineset=ocp-edge-cluster-0-worker-0 Annotations: machine.openshift.io/cluster-api-delete-machine: yes metal3.io/BareMetalHost: openshift-machine-api/openshift-worker-0-2 API Version: machine.openshift.io/v1beta1 Kind: Machine Metadata: Creation Timestamp: 2020-04-22T12:28:13Z Finalizers: machine.machine.openshift.io Generate Name: ocp-edge-cluster-0-worker-0- Generation: 1 Owner References: API Version: machine.openshift.io/v1beta1 Block Owner Deletion: true Controller: true Kind: MachineSet Name: ocp-edge-cluster-0-worker-0 UID: 6d6c65cd-f3e0-4ab7-b01c-a65e63d45a98 Resource Version: 972278 Self Link: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/ocp-edge-cluster-0-worker-0-hkw8q UID: 66d5f3d6-a3c3-4702-9c12-43e4eaaadacd Spec: Metadata: Creation Timestamp: <nil> Provider Spec: Value: Host Selector: Image: Checksum: http://[fd00:1101::3]:6180/images/rhcos-44.81.202003110027-0-openstack.x86_64.qcow2/rhcos-44.81.202003110027-0-compressed.x86_64.qcow2.md5sum URL: http://[fd00:1101::3]:6180/images/rhcos-44.81.202003110027-0-openstack.x86_64.qcow2/rhcos-44.81.202003110027-0-compressed.x86_64.qcow2 Metadata: Creation Timestamp: <nil> User Data: Name: worker-user-data Status: Addresses: Address: 192.168.123.130 Type: InternalIP Address: fd00:1101::4f Type: InternalIP Address: worker-0-2 Type: Hostname Address: worker-0-2 Type: InternalDNS Last Updated: 2020-04-22T12:49:00Z Node Ref: Kind: Node Name: worker-0-3 UID: fc5a1b49-93be-41ce-891a-c29b7ab8e44c Events: <none> 10. Node node-3 moves into NotReady state and still has IP address of previous BMH Actual results: --------------- machine with annotation `machine.openshift.io/cluster-api-delete-machine=yes` is reused Expected results: ----------------- machine with annotation `machine.openshift.io/cluster-api-delete-machine=yes` isn't reused Additional info: ---------------- Virtual deployment, provisioning net IPv6, baremetal net IPv4, disconnected env
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.
Seems like this is a generic MAO issue and not specific to baremetal. Could someone from MAO please take a look?
This is describing an scenario resulting from bypassing machine deletion. Annotating with "machine.openshift.io/cluster-api-delete-machine: yes" just makes a machine as highest priority to be delete during a scale down operation. Moving back to baremetal team: Please see my comment here https://bugzilla.redhat.com/show_bug.cgi?id=1812588#c19.
can we please make the comments public so every one coming here has as much context as possible?
There's two issues here. 1) machine.openshift.io/cluster-api-delete-machine doesn't delete the machine or cause the machine to be deleted. That's just for informing the machineSet that next time there is a scale down operation, prioritize this machine over all others when selecting deletion candidates. It has no other effect and should not be used in the manner described here. 2) Re-using an existing machine object with a different backing VM is absolutely unsupported. If you delete an instance in the cloud (in this case, via the bare metal API), that is considered unrecoverable from the machine-api POV. That machine-object should be deleted, not reused. If the bare metal provider is reusing existing machines, it needs to be changed to not do this. As far as the bare metal provider assigning a new machine object to an existing metal instance, that is fine, provided that instance has not previously joined the cluster as a node. EG, if it's just a warm VM awaiting configuration or something. 1 machine == 1 VM/Host == 1 node. These are immutable.
The cluster-api-delete-machine annotation appears irrelevant to the issue in question here. The Machine should be deleted, not because of the annotation (that would only have an effect if we scaled down), but because the Host it is using is deleted. That appears to have been implemented correctly even at the time of this bug report: https://github.com/openshift/cluster-api-provider-baremetal/blob/9f1ec40287c4a08c1a9777a9bccdf94473883efd/pkg/cloud/baremetal/actuators/machine/actuator.go#L224-L259 However, the Exists() method of the actuator returns false if the Host named in the annotation cannot be found (or the annotation does not exist). When Exists() returns false, the machine controller calls the Create() method of the actuator instead of Update(), and in the Create() phase it is assumed that if the Host does not exist we should select one. So we have only a brief window to detect that the Host is going away: after it enters the Deleting state, but before it is actually deleted. The deleting state is generally very short, since all it has to do is delete the node from Ironic and remove the finalizer. The Machine has its own finalizer on the Host, but it removes it and requeues before deleting the Machine object itself, so there's amply opportunity for races. What likely happened is that we removed our finalizer the Host disappeared from the API before we requeued and called Exists() again, so the Machine controller bounced back to calling Create(). My PR https://github.com/openshift/cluster-api-provider-baremetal/pull/57 would have solved this, but only by making it more likely that we'd win the race (since we would delete the Machine as soon as the Host enters the Deprovisioning state, rather than wait for the last moment). A true fix would be to make Exists() return true if the Machine has ever been created. One way we could potentially do that is just by looking for the existence of the annotation. Another might be to look for a link between the Machine and a Node - in theory this is something the Machine controller could and should check before asking the actuator.
(In reply to Zane Bitter from comment #8) > The cluster-api-delete-machine annotation appears irrelevant to the issue in > question here. The Machine should be deleted, not because of the annotation > (that would only have an effect if we scaled down), but because the Host it > is using is deleted. That appears to have been implemented correctly even at > the time of this bug report: > > https://github.com/openshift/cluster-api-provider-baremetal/blob/ > 9f1ec40287c4a08c1a9777a9bccdf94473883efd/pkg/cloud/baremetal/actuators/ > machine/actuator.go#L224-L259 > > However, the Exists() method of the actuator returns false if the Host named > in the annotation cannot be found (or the annotation does not exist). When > Exists() returns false, the machine controller calls the Create() method of > the actuator instead of Update(), and in the Create() phase it is assumed > that if the Host does not exist we should select one. > > So we have only a brief window to detect that the Host is going away: after > it enters the Deleting state, but before it is actually deleted. The > deleting state is generally very short, since all it has to do is delete the > node from Ironic and remove the finalizer. The Machine has its own finalizer > on the Host, but it removes it and requeues before deleting the Machine > object itself, so there's amply opportunity for races. What likely happened > is that we removed our finalizer the Host disappeared from the API before we > requeued and called Exists() again, so the Machine controller bounced back > to calling Create(). > > My PR https://github.com/openshift/cluster-api-provider-baremetal/pull/57 > would have solved this, but only by making it more likely that we'd win the > race (since we would delete the Machine as soon as the Host enters the > Deprovisioning state, rather than wait for the last moment). > > A true fix would be to make Exists() return true if the Machine has ever > been created. One way we could potentially do that is just by looking for > the existence of the annotation. Another might be to look for a link between > the Machine and a Node - in theory this is something the Machine controller > could and should check before asking the actuator. I think we should set up a meeting, there are some problems with the design if I'm understanding this correctly.
Michael and I had a call and we figured out the problem. The issue is happening basically as I described above: Exists() returns false, then Create() is called instead of Update(), and the Machine grabs a new Host and starts provisioning that. However, the tricky part is that returning false from Exists() is actually correct. The solution that I suggested of the Machine controller checking the state of the Machine before relying on the actuator's Exists() method to determine whether to Create() or Update() was in fact implemented in the OpenShift fork of the Cluster API 10 months ago already: https://github.com/openshift/cluster-api/pull/75 So the real problem is that the release-4.4 branch of CAPBM at the time this bug was opened had an ancient version of the Cluster API vendored (which was already fixed in the release-4.5 branch by https://github.com/openshift/cluster-api-provider-baremetal/pull/63), at a time when the Machine controllers in OpenShift had already been migrated to the MAO repository (9 months ago: https://github.com/openshift/machine-api-operator/pull/441; this was fixed later in the release-4.5 branch of CAPBM by https://github.com/openshift/cluster-api-provider-baremetal/pull/69). So this issue is fixed in 4.5. The only question is whether to fix it in 4.4. If we chose to do so, the obvious way would be to sync the vendoring with the actual Machine controller released in 4.4.
Michael also mentioned that we should *not* be deleting the Machine object from the actuator, even in the case that the underlying Host is deleted. I opened bug 1868104 to track that issue.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days