Bug 1917484 - [BM][IPI] Failed to scale down machineset
Summary: [BM][IPI] Failed to scale down machineset
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Zane Bitter
QA Contact: Lubov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-18 15:03 UTC by Gurenko Alex
Modified: 2021-02-24 15:54 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:54:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
baremetal-operator logs (5.64 MB, application/gzip)
2021-01-18 16:36 UTC, Gurenko Alex
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github metal3-io baremetal-operator pull 772 0 None closed Ironic: Don't adopt after clean failure during deprovisioning 2021-02-08 12:28:14 UTC
Github openshift baremetal-operator pull 122 0 None closed Bug 1917484: Don't adopt after clean failure during deprovisioning 2021-02-08 12:28:15 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:54:41 UTC

Description Gurenko Alex 2021-01-18 15:03:06 UTC
Description of problem: While scaling down worker pool, node failed to transition from deprovisioning state to available


Version-Release number of selected component (if applicable):

OCP: 4.7.0-0.nightly-2021-01-12-150634


How reproducible:


Steps to Reproduce:
1. oc annotate machine  ocp-edge2-spctf-worker-0-5wt9r cluster.k8s.io/delete-machine=true
2. oc scale --replicas=1 machineset ocp-edge2-spctf-worker-0 -n openshift-machine-api
3. Process took ~25-30 minutes

Actual results:

$ oc get bmh
NAME                 STATUS   PROVISIONING STATUS      CONSUMER                            BMC                                                          HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       externally provisioned   ocp-edge2-spctf-master-0            redfish://10.46.61.34/redfish/v1/Systems/System.Embedded.1                      true
openshift-master-1   OK       externally provisioned   ocp-edge2-spctf-master-1            redfish://10.46.61.35/redfish/v1/Systems/System.Embedded.1                      true
openshift-master-2   OK       externally provisioned   ocp-edge2-spctf-master-2            redfish://10.46.61.36/redfish/v1/Systems/System.Embedded.1                      true
openshift-worker-0   OK       provisioned              ocp-edge2-spctf-worker-0-5wt9r      redfish://10.46.61.37/redfish/v1/Systems/System.Embedded.1   unknown            true
openshift-worker-1   error    deprovisioning           ocp-edge2-spctf-worker-0-5kr4p      redfish://10.46.61.38/redfish/v1/Systems/System.Embedded.1   unknown            false    Host adoption failed: Error while attempting to adopt node 48eb0788-1f8e-4aa5-85a2-8b7033d2884c: Cannot validate image information for node 48eb0788-1f8e-4aa5-85a2-8b7033d2884c because one or more parameters are missing from its instance_info and insufficent information is present to boot from a remote volume. Missing are: ['image_source', 'kernel', 'ramdisk'].
openshift-worker-2   OK       provisioned              ocp-edge2-spctf-worker-lb-0-9vrqr   redfish://10.46.61.70/redfish/v1/Systems/System.Embedded.1   unknown            true
openshift-worker-3   OK       provisioned              ocp-edge2-spctf-worker-lb-0-48pqs   redfish://10.46.61.71/redfish/v1/Systems/System.Embedded.1   unknown            true


Expected results:

BMH goes to Available state


Additional info:

{"level":"info","ts":1610982036.0369716,"logger":"controllers.BareMetalHost","msg":"saving host status","baremetalhost":"openshift-machine-api/openshift-worker-1","provisioningState":"deprovisioning","operational status":"error","provisioning state":"deprovisioning"}
{"level":"info","ts":1610982036.0580297,"logger":"controllers.BareMetalHost","msg":"publishing event","baremetalhost":"openshift-machine-api/openshift-worker-1","reason":"RegistrationError","message":"Host adoption failed: Error while attempting to adopt node 48eb0788-1f8e-4aa5-85a2-8b7033d2884c: Cannot validate image information for node 48eb0788-1f8e-4aa5-85a2-8b7033d2884c because one or more parameters are missing from its instance_info and insufficent information is present to boot from a remote volume. Missing are: ['image_source', 'kernel', 'ramdisk']."}

Comment 1 Doug Hellmann 2021-01-18 15:07:13 UTC
Upstream is seeing this as well. https://kubernetes.slack.com/archives/CHD49TLE7/p1610970360010800

Comment 2 Gurenko Alex 2021-01-18 15:14:31 UTC
(In reply to Doug Hellmann from comment #1)
> Upstream is seeing this as well.
> https://kubernetes.slack.com/archives/CHD49TLE7/p1610970360010800

I'm afraid I don't have access to that link, is there an open issue to track upstream?

Comment 3 Zane Bitter 2021-01-18 15:50:38 UTC
Could you attach the log from the baremetal-operator?

Comment 4 Gurenko Alex 2021-01-18 16:36:40 UTC
Created attachment 1748486 [details]
baremetal-operator logs

(In reply to Zane Bitter from comment #3)
> Could you attach the log from the baremetal-operator?

Please let me know if that's okay

Comment 6 Zane Bitter 2021-01-19 20:48:26 UTC
The issue reported upstream was not related, as that was tracked down to a patch that is not yet downstream. (The fix is https://github.com/metal3-io/baremetal-operator/pull/768)

The initial issue here is that the deprovisioning of the worker is timing out in ironic after 30 minutes, apparently without having received any contact from IPA:

2021-01-18T14:44:22.478 current provision state {host: 'openshift-worker-1', lastError: 'Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.', current: 'clean failed', target: 'available'}

Instead of returning that error message, the bmo just sets the state back to manageable in Ironic immediately. Prior to running the deprovision code, we always call Adopt to put the node in ironic into the active state if necessary. The node being in the manageable state (instead of the previous clean wait state) means that we will now attempt to adopt in Ironic. This is kind of unnecessary in this case, because the part of deprovisioning that we're failing at is cleaning, and we will actually end up cleaning the node again as part of attempting to move it to the available state on the next provision.

Currently the Adopt() code assumes that it will only have something to do immediately after an ironic failover that results in a fresh registration. It is during registration that we populate ironic with the current image (if we know of one) that we later use to adopt. However, in this case we have reached this state without re-registration, and ironic itself has cleared the image information in the course of 'deleting' (i.e. deprovisioning) the node. Hence the error message:

2021-01-18T14:44:32.852 current provision state {host: 'openshift-worker-1', lastError: "Error while attempting to adopt node 48eb0788-1f8e-4aa5-85a2-8b7033d2884c: Cannot validate image information for node 48eb0788-1f8e-4aa5-85a2-8b7033d2884c because one or more parameters are missing from its instance_info and insufficent information is present to boot from a remote volume. Missing are: ['image_source', 'kernel', 'ramdisk'].", current: 'adopt failed', target: 'active'}

Comment 7 Zane Bitter 2021-01-20 21:03:59 UTC
Proposed https://github.com/metal3-io/baremetal-operator/pull/772 upstream to resolve the issue of getting stuck in deprovisioning.

It's not obvious from the ironic logs why the cleaning is failing. I'm not sure how we can debug that.

Comment 8 Derek Higgins 2021-01-21 11:33:12 UTC
(In reply to Zane Bitter from comment #6)
> 2021-01-18T14:44:22.478 current provision state {host: 'openshift-worker-1',
> lastError: 'Timeout reached while cleaning the node. Please check if the
> ramdisk responsible for the cleaning is running on the node. Failed on step
> {}.', current: 'clean failed', target: 'available'}

We've seen a few cases where the provisioning nic looses its link-local address and dnsmasq stops working,
this timeout seems like it could be the same thing, can you verify if the fix for fixes your problem
https://bugzilla.redhat.com/show_bug.cgi?id=1911664

Comment 9 Gurenko Alex 2021-01-21 11:34:26 UTC
(In reply to Zane Bitter from comment #7)
> Proposed https://github.com/metal3-io/baremetal-operator/pull/772 upstream
> to resolve the issue of getting stuck in deprovisioning.
> 
> It's not obvious from the ironic logs why the cleaning is failing. I'm not
> sure how we can debug that.

Thanks for the update. As for the cleaning failure, we've been investigating and chased the problem back to this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1901040 which is caused by this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1908302 that affects both scaling up and down. Since the PXE booting was not working w/o the link local address the cleaning image was not loaded. I used a workaround and managed to scale down the machine set properly.

I suggest we  mark this BZ to be blocked by 1908302 and keep it open so we can keep track of the progress. Since the workaround exists we can lower the severity though.

Comment 11 Lubov 2021-02-02 08:25:36 UTC
Run scale down few times on 4.7.0-0.nightly-2021-02-01-095047 - the problem not reproduced
Will re-open if happens again

Comment 14 errata-xmlrpc 2021-02-24 15:54:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.