Bug 1975805 - [4.8.0] Install retry per recreating ACI, BMH error status is not cleared
Summary: [4.8.0] Install retry per recreating ACI, BMH error status is not cleared
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.9.0
Assignee: Nir Magnezi
QA Contact: Yuri Obshansky
URL:
Whiteboard: AI-Team-Hive KNI-EDGE-4.8
Depends On: 1972598
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-24 13:12 UTC by Nir Magnezi
Modified: 2021-10-18 17:36 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1972598
Environment:
Last Closed: 2021-10-18 17:36:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-service pull 2106 0 None open Bug 1975805: KubeAPI: Adds install retry doc and reorder existing docs 2021-06-28 09:04:12 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:36:48 UTC

Description Nir Magnezi 2021-06-24 13:12:39 UTC
+++ This bug was initially created as a clone of Bug #1972598 +++

Description of problem:
=======================
Tested the scenario mentioned in bug 1972525
To work around the issue, I updated the InfraEnv annotations - to force a reconcile.

Installation was not retriggered because BMAC did not clear up error status from the BMH: https://gist.github.com/nmagnezi/1249fab7dcd313bd85107cf7c9f904f7#file-bmh-yaml-L27-L32

Version-Release number of selected component (if applicable):
=============================================================
current git head 403985c8d95b5bef173a11326b3de7aec3fdef18


How reproducible:
=================
1/1

Steps to Reproduce:
===================
1. Follow bug 1972525
2. Clear up the detached annotation from BMH
3. Add some infraEnv annotation (workaround until bug 1972525 gets resolved).
4. Inspect BMH

Actual results:
===============
Image download URL not up to date with InfraEnv
status still in an error state as mentioned above

Expected results:
=================
BMH should recover and retrigger the install

--- Additional comment from Nir Magnezi on 2021-06-16 10:21:37 UTC ---

Also discussed here: https://coreos.slack.com/archives/C01FT9E4Q10/p1623834535216100

--- Additional comment from Flavio Percoco on 2021-06-17 08:16:46 UTC ---

I looked into this, here's the summary:

1. The environment was using an older version of Assisted Service, which was missing a couple of PRs
2. After updating the assisted-service container (manually modified the Operator Subscription), I was able to retry a deployment (I will attach a screenshot of the BMH's events showing the deprovision/provision of the image when the deployment was retried).

@nmagnezi do you want to give this another go before closing the issue?

--- Additional comment from Flavio Percoco on 2021-06-17 08:18:19 UTC ---

BMH's events showing re-provision of an InfraEnv URL

--- Additional comment from Nir Magnezi on 2021-06-20 14:19:05 UTC ---

(In reply to Flavio Percoco from comment #2)
> I looked into this, here's the summary:
> 
> 1. The environment was using an older version of Assisted Service, which was
> missing a couple of PRs
> 2. After updating the assisted-service container (manually modified the
> Operator Subscription), I was able to retry a deployment (I will attach a
> screenshot of the BMH's events showing the deprovision/provision of the
> image when the deployment was retried).
> 
> @nmagnezi do you want to give this another go before closing the
> issue?

I tried this again.
What I see now is that the BMH clears the error status, but didn't get the new image URL, thus no re-install.
log: https://gist.github.com/nmagnezi/a49d34d6cf2a8cc0fc110621fde43642

Let's follow up to see if I did something else / wrong that caused this, before we close this bug.

--- Additional comment from Nir Magnezi on 2021-06-22 11:32:29 UTC ---

I attempted this again because I simply forgot to remove the 'detached' label.
However, now It fails with 404:: https://gist.github.com/nmagnezi/2395564774afa4f5a812ac5cf4e3c0db#file-bmh-yaml-L317
SVC log: https://gist.github.com/nmagnezi/d2f2040ce3d391a823b1e6b3f6bfc888#file-retry-go-L83

--- Additional comment from Nir Magnezi on 2021-06-24 13:11:05 UTC ---

For QE:

The solution here is to document how to retry an installation: the user need to recreate both BMH(s) and ACI.

Comment 2 Trey West 2021-07-07 16:56:53 UTC
Verified

Comment 5 errata-xmlrpc 2021-10-18 17:36:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.