Bug 1891301 - Deleting bmh by "oc delete bmh' get stuck
Summary: Deleting bmh by "oc delete bmh' get stuck
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.8.0
Assignee: Andrea Fasano
QA Contact: Adina Wolff
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-25 12:42 UTC by Nataf Sharabi
Modified: 2021-07-27 22:34 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:33:58 UTC
Target Upstream Version:
Embargoed:
nsharabi: needinfo+


Attachments (Terms of Use)
metal3 log (4.73 MB, text/plain)
2020-10-25 12:42 UTC, Nataf Sharabi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-operator pull 142 0 None open Merge upstream 2021-04-06 2021-04-07 14:49:19 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:34:13 UTC

Description Nataf Sharabi 2020-10-25 12:42:36 UTC
Created attachment 1724026 [details]
metal3 log

Description of problem:

While trying to delete node via:

oc delete bmh openshift-worker-0-2 -n openshift-machine-api

The session hangs.


In order to see the logs:

  oc get pods -A |grep metal
  oc logs metal3*** -n openshift-machine-api -c metal3-ironic-conductor


The logs show:
2020-10-25 10:09:44.630 1 INFO ironic.conductor.manager [req-42f09b97-3b7e-4def-89d4-56e04392f7b0 ironic-user - - - -] Successfully deleted node b33c547f-adae-4b4e-9d10-7c5c54b84863.





Version-Release number of selected component (if applicable):
Client Version: 4.6.0-0.nightly-2020-10-03-051134
Server Version: 4.6.0-rc.4
Kubernetes Version: v1.19.0+d59ce34

How reproducible:
I've noticed this problem while adding new worker through this [1] procedure,

By mistake I left the wrong characters with the address "<" ">" [2]

In order to correct this mistake - I've tried to delete the new worker [3]

And the session hangs.

[1] https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-29807

[2]address: <redfish://192.168.123.1:8000/redfish/v1/Systems/e2e8a52d-1012-4eec-a22b-dfd57f0df50b>

[3]oc delete bmh openshift-worker-0-2 -n openshift-machine-api
baremetalhost.metal3.io "openshift-worker-0-2" deleted


Steps to Reproduce:
1.
2.
3.

Actual results:

Session hangs & The node is not deleted.
(oc get bmh -n openshift-machine-api)

Expected results
The node should be deleted & session should not hangs.

Additional info:

Comment 3 Nataf Sharabi 2020-10-29 11:38:23 UTC
link to must-gather:
https://drive.google.com/drive/folders/1pdoRJ92mcz7_TWBRKH_P-SkTPFQV9Frb?usp=sharing

Comment 4 Zane Bitter 2020-11-03 18:45:56 UTC
Logs show it failing with this error:

2020-10-29T10:39:43.030 Reconciling BareMetalHost {Request.Namespace: 'openshift-machine-api', Request.Name: 'openshift-worker-0-2'}
2020-10-29T10:39:43.030 Fetching Status from Annotation {Request.Namespace: 'openshift-machine-api', Request.Name: 'openshift-worker-0-2'}
2020-10-29T10:39:43.030 No status cache found {Request.Namespace: 'openshift-machine-api', Request.Name: 'openshift-worker-0-2'}
2020-10-29T10:39:43.030 adding finalizer {Request.Namespace: 'openshift-machine-api', Request.Name: 'openshift-worker-0-2', existingFinalizers: [], newValue: 'baremetalhost.metal3.io'}
2020-10-29T10:39:43.042 Reconciling BareMetalHost {Request.Namespace: 'openshift-machine-api', Request.Name: 'openshift-worker-0-2'}
2020-10-29T10:39:43.042 Fetching Status from Annotation {Request.Namespace: 'openshift-machine-api', Request.Name: 'openshift-worker-0-2'}
2020-10-29T10:39:43.042 No status cache found {Request.Namespace: 'openshift-machine-api', Request.Name: 'openshift-worker-0-2'}
2020-10-29T10:39:43.042 Reconciler error {controller: 'metal3-baremetalhost-controller', request: 'openshift-machine-api/openshift-worker-0-2', error: 'failed to create provisioner: failed to parse BMC address information: failed to parse BMC address information: parse "<redfish://192.168.123.1:8000/redfish/v1/Systems/e2e8a52d-1012-4eec-a22b-dfd57f0df50b>": first path segment in URL cannot contain colon'}

Basically, because we can't figure out any valid driver for this URL, we fail and then never get to run the code that would remove the finalizer. It just keeps hitting this error.

In the medium term, we should institute a webhook that prevents invalid stuff like this being set. But in the short term we should probably do something like just go ahead and remove the finalizer if we can't create a BMC and the DeletionTimestamp is set.

Workarounds would be to update the Host to have a correct address (possible since we haven't yet implemented a webhook to prevent changing the address either) or manually remove the finalizer.

Comment 5 Andrea Fasano 2021-04-07 14:42:11 UTC
Issue will be fixed by the upstream PR https://github.com/metal3-io/baremetal-operator/pull/838 in conjunction with @Zane's commit: https://github.com/metal3-io/baremetal-operator/commit/beea4d0ead807a8f19b38d538db3502ee3504b97.

Changes will be ported downstream by PR https://github.com/openshift/baremetal-operator/pull/142

Comment 9 Adina Wolff 2021-07-12 07:57:10 UTC
verified on:
Client Version: 4.8.0-0.nightly-2021-06-13-101614
Server Version: 4.8.0-0.nightly-2021-07-09-181248
Kubernetes Version: v1.21.1+f36aa36


bmh is created but shows registration error: 

openshift-machine-api   openshift-worker-0-2   registering                                                        true     registration error


bmh deletes without a problem

Comment 12 errata-xmlrpc 2021-07-27 22:33:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.