Created attachment 1689912 [details] metal3 containers logs Description of problem: After scaling down a machineset we try to remove the bmh that becomes ready and is offline by "oc delete bmh" command The command is nevere finished (stuck) and can be interrupted only by ctrl-C The bmh is not removed from bmh list and becomes deleting Version-Release number of selected component (if applicable): 4.5 How reproducible: Constantly Steps to Reproduce: 1. Annotate a worker to be deleted from machineset, e.g. $ oc annotate machine CONSUMERNAME machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api 2. Scale down the machineset $ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=2 3. Wait till the node is deleted from cluster and becomes ready and offline in Bare Metal $ oc get bmh -n openshift-machine-api 4. Delete the bmh in state ready $ oc delete bmh openshift-worker-0-X -n openshift-machine-api Actual results: - The command is stuck until interrupted by ctrl-C - The bmh is listed in "oc get bmh" list as deleting forever Expected results: - The command finishes to run in about 1 minute - The bmh is deleted and not listed in "oc get bmh" command output Additional info: There is no problem with hte same scenario in OCP 4.4 metal3 containers logs in attached zip
It seems that the node has been successfully removed from ironic: 2020-05-19 13:38:57.188 28 INFO eventlet.wsgi.server [req-0731957d-f9c6-4c36-aa03-07b597b0eaa4 - - - - -] fd00:1101::3 "DELETE /v1/nodes/74065fa2-e19a-4572-8c42-914d869d7808 HTTP/1.1" status: 204 len: 295 time: 0.1069863[00m The last BMO logs that touch this node seem to confirm it: {"level":"info","ts":1589895536.8779397,"logger":"baremetalhost","msg":"Reconciling BareMetalHost","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2"} {"level":"info","ts":1589895536.8780553,"logger":"baremetalhost","msg":"marked to be deleted","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","timestamp":"2020-05-19 13:38:56 +0000 UTC"} {"level":"info","ts":1589895536.9121392,"logger":"baremetalhost_ironic","msg":"found existing node by ID","host":"openshift-worker-0-2"} {"level":"info","ts":1589895536.9121685,"logger":"baremetalhost_ironic","msg":"deleting host","host":"openshift-worker-0-2","ID":"74065fa2-e19a-4572-8c42-914d869d7808","lastError":"","current":"manageable","target":"","deploy step":{}} {"level":"info","ts":1589895536.9121957,"logger":"baremetalhost_ironic","msg":"setting host maintenance flag to force image delete","host":"openshift-worker-0-2"} {"level":"info","ts":1589895537.031573,"logger":"baremetalhost","msg":"saving host status","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","operational status":"OK","provisioning state":"deleting"} {"level":"info","ts":1589895537.0399234,"logger":"baremetalhost","msg":"done","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","requeue":true,"after":0} {"level":"info","ts":1589895537.0399852,"logger":"baremetalhost","msg":"Reconciling BareMetalHost","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2"} {"level":"info","ts":1589895537.0401134,"logger":"baremetalhost","msg":"marked to be deleted","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","timestamp":"2020-05-19 13:38:56 +0000 UTC"} {"level":"info","ts":1589895537.080819,"logger":"baremetalhost_ironic","msg":"found existing node by ID","host":"openshift-worker-0-2"} {"level":"info","ts":1589895537.0808783,"logger":"baremetalhost_ironic","msg":"deleting host","host":"openshift-worker-0-2","ID":"74065fa2-e19a-4572-8c42-914d869d7808","lastError":"","current":"manageable","target":"","deploy step":{}} {"level":"info","ts":1589895537.0809054,"logger":"baremetalhost_ironic","msg":"host ready to be removed","host":"openshift-worker-0-2"} {"level":"info","ts":1589895537.1889172,"logger":"baremetalhost_ironic","msg":"removed","host":"openshift-worker-0-2"} {"level":"info","ts":1589895537.1975627,"logger":"baremetalhost","msg":"saving host status","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","operational status":"OK","provisioning state":"deleting"} {"level":"info","ts":1589895537.2052593,"logger":"baremetalhost","msg":"done","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","requeue":true,"after":0} {"level":"info","ts":1589895537.205345,"logger":"baremetalhost","msg":"Reconciling BareMetalHost","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2"} {"level":"info","ts":1589895537.2054608,"logger":"baremetalhost","msg":"marked to be deleted","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","timestamp":"2020-05-19 13:38:56 +0000 UTC"} {"level":"info","ts":1589895537.2376485,"logger":"baremetalhost_ironic","msg":"looking for existing node by name","host":"openshift-worker-0-2","name":"openshift-worker-0-2"} {"level":"info","ts":1589895537.273997,"logger":"baremetalhost_ironic","msg":"no node found, already deleted","host":"openshift-worker-0-2"} {"level":"info","ts":1589895537.2740564,"logger":"baremetalhost","msg":"cleanup is complete, removed finalizer","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","remaining":["machine.machine.openshift.io"]} {"level":"info","ts":1589895537.2845814,"logger":"baremetalhost","msg":"done","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","requeue":false,"after":0} {"level":"info","ts":1589895537.284646,"logger":"baremetalhost","msg":"Reconciling BareMetalHost","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2"} {"level":"info","ts":1589895537.2847314,"logger":"baremetalhost","msg":"marked to be deleted","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","timestamp":"2020-05-19 13:38:56 +0000 UTC"} {"level":"info","ts":1589895537.284746,"logger":"baremetalhost","msg":"ready to be deleted","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting"} {"level":"info","ts":1589895537.28475,"logger":"baremetalhost","msg":"done","Request.Namespace":"openshift-machine-api","Request.Name":"openshift-worker-0-2","provisioningState":"deleting","requeue":false,"after":0}
Please retry after the linked BZ is solved *** This bug has been marked as a duplicate of bug 1828003 ***
It's 2 different scenarios. Both begin from starting httpd pod. Machine to delete is the worker that run the pod Here scenario is: 1. Annotate machine to be deleted from machineset 2. Scale down the machineset 3. After machine is deleted from machinset, delete corresponding baremetal host - here the command stuck and baremetal host is reported as deleting In bug 1828003 the scenario is: 1. Annotate machine to be deleted from machineset 2. Delete corresponding baremetal host 3. Scale down machineset If we don't drain down node before step 2 httpd pod is stuck in Terminating state on deleted machine and the machine is not deleted Both scenarios described in https://github.com/metal3-io/metal3-docs/blob/master/design/remove-host.md Hope that root cause for both problems is the same Will try to reproduce the problem after linked BZ is solved
Reopening the bug. bug 1828003 verified, but the problem reported in this bug still happens - as I mentioned in comment 3 it is different scenario
Additional info, hope it might help: deleting master bmh doesn't cause the command to stuck. Only for workers
Caught few times when deleting bmh for master node was stuck as well
(In reply to Lubov from comment #6) > Caught few times when deleting bmh for master node was stuck as well Re-checked, externally provisioned master nodes are always deleted, it takes about 5 minutes sometimes. Sorry for confusion
Can you paste the output of `oc describe bmh -n openshift-machine-api openshift-worker-0-X`? It's likely that there is a finalizer remaining on the Host, presumably from the Machine. It's possible that this is fixed by https://github.com/openshift/cluster-api-provider-baremetal/pull/87
Fix in https://github.com/openshift/cluster-api-provider-baremetal/pull/90
Verified on Client Version: 4.6.0-0.nightly-2020-08-16-072105 Server Version: 4.6.0-0.nightly-2020-08-16-072105 Kubernetes Version: v1.19.0-rc.2+99cb93a-dirty
facing similar issue in 4.5.7 - node stuck in deleting state ---- # oc annotate machine ocp4-worker-0-p9r7s machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api # oc scale machineset ocp4-worker-0 -n openshift-machine-api --replicas 2 $ oc get machineset -n openshift-machine-api $ oc get bmh -A Delete BMH # oc delete bmh ocp4-worker0 -n openshift-machine-api [kni@provision ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api ocp4-master0 OK externally provisioned ocp4-master-0 ipmi://10.46.26.44:6231 true openshift-machine-api ocp4-master1 OK externally provisioned ocp4-master-1 ipmi://10.46.26.44:6232 true openshift-machine-api ocp4-master2 OK externally provisioned ocp4-master-2 ipmi://10.46.26.44:6233 true openshift-machine-api ocp4-worker0 OK deleting ipmi://10.46.26.44:6234 libvirt false openshift-machine-api ocp4-worker1 OK provisioned ocp4-worker-0-d5bqb ipmi://10.46.26.44:6235 libvirt true openshift-machine-api ocp4-worker2 OK provisioned ocp4-worker-0-nwffc ipmi://10.46.26.44:6236 libvirt true ~~ Delteting BMH should delete node/machine references but it doesn't which makes CO degraded as its not aware node is deleted & continus to wait for node to be online Still it shows node in notready - is should not be there [kni@provision ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master0.ocp4.example.com Ready master 32d v1.18.3+2cf11e2 master1.ocp4.example.com Ready master 32d v1.18.3+2cf11e2 master2.ocp4.example.com Ready master 32d v1.18.3+2cf11e2 worker0.ocp4.example.com NotReady,SchedulingDisabled worker 32d v1.18.3+2cf11e2 worker1.ocp4.example.com Ready worker 32d v1.18.3+2cf11e2 worker2.ocp4.example.com Ready worker 28d v1.18.3+2cf11e2 [kni@provision ~]$ oc delete bmh ocp4-worker0 -n openshift-machine-api baremetalhost.metal3.io "ocp4-worker0" deleted ***stuck*** [kni@provision ~]$ oc get bmh -A openshift-machine-api ocp4-worker0 OK deleting ipmi://10.46.26.44:6234 libvirt false Pods & co still referring to deleted node [kni@provision ~]$ oc get all -o wide -n openshift-ovn-kubernetes | grep -i worker0 pod/ovnkube-node-82zhp 2/2 Running 0 32d 192.168.7.31 worker0.ocp4.example.com <none> <none> pod/ovs-node-jh9rx 1/1 Running 0 32d 192.168.7.31 worker0.ocp4.example.com <none> <none> Status: Conditions: Last Transition Time: 2020-10-12T03:12:41Z Message: DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2020-10-12T03:01:07Z DaemonSet "openshift-ovn-kubernetes/ovs-node" rollout is not making progress - last change 2020-10-12T03:01:06Z DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2020-10-12T03:01:06Z Reason: RolloutHung Status: True Type: Degraded Last Transition Time: 2020-09-09T06:40:09Z Status: True Type: Upgradeable Last Transition Time: 2020-10-12T03:01:07Z Message: DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes) DaemonSet "openshift-ovn-kubernetes/ovs-node" is not available (awaiting 1 nodes) DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes) [kni@provision ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.5.7 True True False 32d machine-config 4.5.7 False False True 16m marketplace 4.5.7 True False False 24d monitoring 4.5.7 False True True 20m network 4.5.7 True True True 32d [kni@provision ~]$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-14d75c4f41c5c1151418032daa3a0e81 True False False 3 3 3 0 32d worker rendered-worker-8d92311edfdfe139596a9fd446f6dfb2 False True False 3 2 3 0 32d [kni@provision ~]$ oc describe co machine-config Status: Conditions: Last Transition Time: 2020-09-09T06:47:13Z Message: Cluster version is 4.5.7 Status: False Type: Progressing Last Transition Time: 2020-10-12T03:11:08Z Message: Failed to resync 4.5.7 because: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1) [kni@provision ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api ocp4-master0 OK externally provisioned ocp4-master-0 ipmi://10.46.26.44:6231 true openshift-machine-api ocp4-master1 OK externally provisioned ocp4-master-1 ipmi://10.46.26.44:6232 true openshift-machine-api ocp4-master2 OK externally provisioned ocp4-master-2 ipmi://10.46.26.44:6233 true openshift-machine-api ocp4-worker0 OK deleting ipmi://10.46.26.44:6234 libvirt false openshift-machine-api ocp4-worker1 OK provisioned ocp4-worker-0-d5bqb ipmi://10.46.26.44:6235 libvirt true openshift-machine-api ocp4-worker2 OK provisioned ocp4-worker-0-nwffc ipmi://10.46.26.44:6236 libvirt true ~~~~~ After manually deleting nodes Co becomes normal - so deleting BMH or scale (--) should auto delete node as well which is missing [kni@provision ~]$ oc delete node worker0.ocp4.example.com node "worker0.ocp4.example.com" deleted [kni@provision ~]$ oc get co | egrep -i 'dns|machine-config|network|marketplace|monitoring' dns 4.5.7 True False False 32d machine-config 4.5.7 True False False 112s marketplace 4.5.7 True False False 24d monitoring 4.5.7 True False False 89s network 4.5.7 True False False 32d Final status Still stuck in deleting state [kni@provision ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api ocp4-master0 OK externally provisioned ocp4-master-0 ipmi://10.46.26.44:6231 true openshift-machine-api ocp4-master1 OK externally provisioned ocp4-master-1 ipmi://10.46.26.44:6232 true openshift-machine-api ocp4-master2 OK externally provisioned ocp4-master-2 ipmi://10.46.26.44:6233 true openshift-machine-api ocp4-worker0 OK deleting ipmi://10.46.26.44:6234 libvirt false openshift-machine-api ocp4-worker1 OK provisioned ocp4-worker-0-d5bqb ipmi://10.46.26.44:6235 libvirt true openshift-machine-api ocp4-worker2 OK provisioned ocp4-worker-0-nwffc ipmi://10.46.26.44:6236 libvirt true [kni@provision ~]$ oc describe bmh -n openshift-machine-api ocp4-worker0 Name: ocp4-worker0 Namespace: openshift-machine-api Labels: <none> Annotations: baremetalhost.metal3.io/status: {"operationalStatus":"OK","lastUpdated":"2020-10-12T03:14:33Z","hardwareProfile":"libvirt","hardware":{"systemVendor":{"manufacturer":"Red... API Version: metal3.io/v1alpha1 Kind: BareMetalHost ;;;; Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal DeprovisioningStarted 93m metal3-baremetal-controller Image deprovisioning started Normal DeprovisioningComplete 93m metal3-baremetal-controller Image deprovisioning completed [kni@provision ~]$ Baremetal outputs shows node deleted ironic Db shows node not there +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ | b91f0d36-2822-4126-8311-eda0926e3380 | ocp4-master0 | 16cb6219-79b9-4c20-a3ea-d8e877866f38 | power on | active | False | | 51b74c0d-e8f6-415d-af6b-116f4d4980c4 | ocp4-master2 | bc581c4f-c52c-4432-86d8-582555439ebc | power on | active | False | | c89589df-0508-42e9-9038-f0707815e118 | ocp4-master1 | 5c729a5f-6e91-4d58-ba03-f2f6c5658f42 | power on | active | False | | f75a0604-0af1-473d-9135-f8d0fdee8ec9 | ocp4-worker1 | 14bec3f9-e664-4c1b-928e-4f5ce3a68486 | power on | active | False | | a0a1644a-d239-4373-872b-3bff1a869c68 | ocp4-worker2 | 8a5091f3-e28d-421f-8778-8cb0d19617c8 | power on | active | False | +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ [kni@provision ~]$ oc get machines -o wide -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ocp4-master-0 Running 32d master0.ocp4.example.com ocp4-master-1 Running 32d master1.ocp4.example.com ocp4-master-2 Running 32d master2.ocp4.example.com ocp4-worker-0-d5bqb Running 32d worker1.ocp4.example.com ocp4-worker-0-nwffc Running 28d worker2.ocp4.example.com ocp4-worker-0-p9r7s Deleting 32d worker0.ocp4.example.com
The fix in 4.5 was issued later than 4.5.7 Release 4.5.7 was created from registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-08-15-052753 On 4.5 it was verified on Server Version: 4.5.0-0.nightly-2020-08-20-011847 (see https://bugzilla.redhat.com/show_bug.cgi?id=1863010)
We are running version 4.5.4 and had the same issue, after reading the PR that fixes this issue I guessed that we could use this workaround in the meantime: 1. Manually edit nodes stuck in state "deleting" 2. Delete the finalizers entries in the metadata 3. Save and exit After that we noticed that all nodes we fixed were automatically deleted. For reference I am referrring to this PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/90
verified in 4.5.13 it looks find now ~~ [kni@provision ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.13 True False 19h Cluster version is 4.5.13 [kni@provision ~]$ oc get machineset -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api ocp4-worker-0 3 3 3 3 37h ** Remove ocp4-worker2 [kni@provision ~]$ oc annotate machine ocp4-worker-0-cbgpl machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api machine.machine.openshift.io/ocp4-worker-0-cbgpl annotated [kni@provision ~]$ oc scale machineset ocp4-worker-0 -n openshift-machine-api --replicas 2 machineset.machine.openshift.io/ocp4-worker-0 scaled [kni@provision ~]$ oc get machines -o wide -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ocp4-master-0 Running 37h master0.ocp4.example.com ocp4-master-1 Running 37h master1.ocp4.example.com ocp4-master-2 Running 37h master2.ocp4.example.com ocp4-worker-0-7s2k2 Running 19h worker0.ocp4.example.com ocp4-worker-0-b4prq Running 17h worker1.ocp4.example.com ocp4-worker-0-cbgpl Deleting 19h worker2.ocp4.example.com [kni@provision ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master0.ocp4.example.com Ready master 36h v1.18.3+47c0e71 master1.ocp4.example.com Ready master 36h v1.18.3+47c0e71 master2.ocp4.example.com Ready master 36h v1.18.3+47c0e71 worker0.ocp4.example.com Ready worker 18h v1.18.3+47c0e71 worker1.ocp4.example.com Ready worker 17h v1.18.3+47c0e71 worker2.ocp4.example.com Ready,SchedulingDisabled worker 19h v1.18.3+47c0e71 [kni@provision ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api ocp4-master0 OK externally provisioned ocp4-master-0 ipmi://10.46.26.44:6231 true openshift-machine-api ocp4-master1 OK externally provisioned ocp4-master-1 ipmi://10.46.26.44:6232 true openshift-machine-api ocp4-master2 OK externally provisioned ocp4-master-2 ipmi://10.46.26.44:6233 true openshift-machine-api ocp4-worker0 OK provisioned ocp4-worker-0-7s2k2 ipmi://10.46.26.44:6234 libvirt true openshift-machine-api ocp4-worker1 OK provisioned ocp4-worker-0-b4prq ipmi://10.46.26.44:6235 libvirt true openshift-machine-api ocp4-worker2 OK ready ipmi://10.46.26.44:6236 libvirt false Delete BMH # oc delete bmh ocp4-worker2 -n openshift-machine-api [kni@provision ~]$ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api ocp4-master0 OK externally provisioned ocp4-master-0 ipmi://10.46.26.44:6231 true openshift-machine-api ocp4-master1 OK externally provisioned ocp4-master-1 ipmi://10.46.26.44:6232 true openshift-machine-api ocp4-master2 OK externally provisioned ocp4-master-2 ipmi://10.46.26.44:6233 true openshift-machine-api ocp4-worker0 OK provisioned ocp4-worker-0-7s2k2 ipmi://10.46.26.44:6234 libvirt true openshift-machine-api ocp4-worker1 OK provisioned ocp4-worker-0-b4prq ipmi://10.46.26.44:6235 libvirt true [kni@provision ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196