Bug 1963154 - Current BMAC reconcile flow skips Ironic's deprovision step
Summary: Current BMAC reconcile flow skips Ironic's deprovision step
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Flavio Percoco
QA Contact: Yuri Obshansky
URL:
Whiteboard: AI-Team-Platform KNI-EDGE-4.8
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-21 15:14 UTC by Flavio Percoco
Modified: 2021-07-27 23:09 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:09:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-service pull 1808 0 None open OCPBUGSM-29476 Trigger BMH's deprovision when the ISODownloadURL changes 2021-05-24 15:56:54 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:09:56 UTC

Description Flavio Percoco 2021-05-21 15:14:03 UTC
Summary:

BMAC's flow does not allow for deprovisioning the node and clearing Ironic's cache. In order to achieve this we should:

1. Attach the BMH
2. Clear the BMH image data
3. Reconcile the BMH image data *ONLY* when the BMH's state is Ready (we currently set this value if it doesn't exist, regardless of the BMH's state)

Longer:

Our current workflow is to set the BMH to `detached` once the agent has started the installation, to prevent Ironic from managing the node. When a BMH is detached, the node is force-deleted from Ironic, which means Ironic will also skip any cleanup of any resources created by/for this node, including the cached images.

In order to invalidate the image cache, we ought to make sure that the node is attached before the `Spec.Image.URL` field is cleared, to guarantee that the image is also deleted from Ironic's cache. Here are some notes from the testing done:


- Removing the `Spec.Image` does trigger the node deprovisioning (only if the BMH is attached)
- The BMH state will be set to Provisioned once the image has been and the node is booted with that image
- The BMH state will be set to ready once the image is removed and the deprovisioning steps are completed (note that BMAC disables cleaning and inspection so the only deprovision action is detaching the ISO and removing it from the cache... and the node is rebooted).
- Right now, if the InfraEnv is present, reconcile happens too fast and the deprovision is not triggered when the `Spec.Image` data is removed. For my tests, I removed the InfraEnv. This should not be needed if we also check the BMH state before setting the URL.


Things that didn't work:

1. Delete the Image URL from both the InfraEnv and BMH: The InfraEnv will get a new image fast enough, set it in the BMH and then the deprovisioning won't happen.
2. Delete the BMH's `Spec.Image.URL` field: The reconcile happens and the deprovisioning is never triggered
3. Leave the BMH as detached: The image is never deleted from the cache


Open Questions:

With the above, the deprovisioning workflow could be triggered by the user (if the BMH `Spec.Image` data is deleted, although we would have to add some logic to detect this) or by Assisted Installer, if a new image is generated.

- Is there a way to know, from the InfraEnv resource, if the image URL is "new"? (We could compare the InfraEnv URL with the one set in the BMH. If they differ, we trigger the proposed flow)

- Should the workflow of "deleting the BMH.Spec.Image" data be supported?

- Any workflow/requirements/cases that may be missing?

Comment 1 Antoni Segura Puimedon 2021-05-21 16:11:19 UTC
On the open questions:
- My understanding is that the URL does not change, is that right @ncarboni ?
- I don't think we should support that. Only BMAC should be doing that.

Comment 2 Flavio Percoco 2021-05-24 06:04:20 UTC
(In reply to Antoni Segura Puimedon from comment #1)
> On the open questions:
> - My understanding is that the URL does not change, is that right
> @ncarboni ?


Yeah, the URL never changes. That statement was more a "thought provoke" kind of statement rather than an actual proposal. I have found a workflow that may work for our case:

Trigger the deprovision (and subsequently re-set the image) when the ImageCreated condition date in the `InfraEnv` resource is newer than the `Provision` status in the BMH.

I need to study and test this workflow a bit better as I fear there may be a corner case where the user may trigger the above workflow accidentally.

> - I don't think we should support that. Only BMAC should be doing that.

I agree with you :)

Comment 4 Flavio Percoco 2021-05-24 15:56:39 UTC
>  Yeah, the URL never changes.

So, the base URL never changes but the signature does since AI generates a new one on every image change. We can use that to check when an image has changed, but this workflow will require Auth to be enabled.

Comment 5 Nick Carboni 2021-05-25 12:33:28 UTC
(In reply to Antoni Segura Puimedon from comment #1)
> On the open questions:
> - My understanding is that the URL does not change, is that right
> @ncarboni ?

Sounds like this is sorted out now. Removing needinfo

Comment 7 Chad Crum 2021-06-18 22:14:54 UTC
Validated this works on OCP 4.8.0-rc.0 hub + ACM downstream 2.3.0-DOWNSTREAM-2021-06-17-01-26-58

Tested as follows:

- Created CRs for SNO cluster as usual
- Confirmed infraenv and BMH both had discovery iso url with the following api key (After provisioning completed by bmh)
 isoDownloadURL: https://assisted-service-rhacm.apps.ocp-edge-cluster-assisted-0.qe.lab.redhat.com/api/assisted-install/v1/clusters/bba138ad-b983-46d1-ab69-71a168d56665/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVz
dGVyX2lkIjoiYmJhMTM4YWQtYjk4My00NmQxLWFiNjktNzFhMTY4ZDU2NjY1In0.ypFbmGR7Nv792qjCfLRaVwQm-u83ZScukTHa0Ioc9SUnmqGkc7kaLTSBI7zsVLdiW1SdVATHsAJ0y7fEElHJmQ                                                                                         


- Moments after provisioning completed and VM was started via ironic, I re-applied the infraenv with an ignition config override which caused the discovery iso to be rebuilt
- The BMH was deprovisioned and then began to reprovision. After reprovision I Checked both the infraenv and bmh and there was a new api key on the discovery url:

isoDownloadURL: https://assisted-service-rhacm.apps.ocp-edge-cluster-assisted-0.qe.lab.redhat.com/api/assisted-install/v1/clusters/bba138ad-b983-46d1-ab69-71a168d56665/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVz
dGVyX2lkIjoiYmJhMTM4YWQtYjk4My00NmQxLWFiNjktNzFhMTY4ZDU2NjY1In0.QVMpNKMeKiFjFtIUxgyVHH8YSHStpBch5Fo--W2_WmRCjAo0YuU4dFzYw_s3KVkeUk3y-GvurNGBaYN2nM-HyA

Comment 9 errata-xmlrpc 2021-07-27 23:09:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.