Bug 1972598 - [master] Install retry per recreating ACI, BMH error status is not cleared
Summary: [master] Install retry per recreating ACI, BMH error status is not cleared
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.9.0
Assignee: Nir Magnezi
QA Contact: Yuri Obshansky
URL:
Whiteboard: AI-Team-Hive KNI-EDGE-4.8
Depends On:
Blocks: 1975805
TreeView+ depends on / blocked
 
Reported: 2021-06-16 10:03 UTC by Nir Magnezi
Modified: 2021-10-18 17:34 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1975805 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:34:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bmh events (195.94 KB, image/png)
2021-06-17 08:18 UTC, Flavio Percoco
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-service pull 2076 0 None open Bug 1972598: KubeAPI: Adds install retry doc and reorder existing docs 2021-06-24 11:50:01 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:34:59 UTC

Description Nir Magnezi 2021-06-16 10:03:59 UTC
Description of problem:
=======================
Tested the scenario mentioned in bug 1972525
To work around the issue, I updated the InfraEnv annotations - to force a reconcile.

Installation was not retriggered because BMAC did not clear up error status from the BMH: https://gist.github.com/nmagnezi/1249fab7dcd313bd85107cf7c9f904f7#file-bmh-yaml-L27-L32

Version-Release number of selected component (if applicable):
=============================================================
current git head 403985c8d95b5bef173a11326b3de7aec3fdef18


How reproducible:
=================
1/1

Steps to Reproduce:
===================
1. Follow bug 1972525
2. Clear up the detached annotation from BMH
3. Add some infraEnv annotation (workaround until bug 1972525 gets resolved).
4. Inspect BMH

Actual results:
===============
Image download URL not up to date with InfraEnv
status still in an error state as mentioned above

Expected results:
=================
BMH should recover and retrigger the install

Comment 2 Flavio Percoco 2021-06-17 08:16:46 UTC
I looked into this, here's the summary:

1. The environment was using an older version of Assisted Service, which was missing a couple of PRs
2. After updating the assisted-service container (manually modified the Operator Subscription), I was able to retry a deployment (I will attach a screenshot of the BMH's events showing the deprovision/provision of the image when the deployment was retried).

@nmagnezi do you want to give this another go before closing the issue?

Comment 3 Flavio Percoco 2021-06-17 08:18:19 UTC
Created attachment 1791711 [details]
bmh events

BMH's events showing re-provision of an InfraEnv URL

Comment 4 Nir Magnezi 2021-06-20 14:19:05 UTC
(In reply to Flavio Percoco from comment #2)
> I looked into this, here's the summary:
> 
> 1. The environment was using an older version of Assisted Service, which was
> missing a couple of PRs
> 2. After updating the assisted-service container (manually modified the
> Operator Subscription), I was able to retry a deployment (I will attach a
> screenshot of the BMH's events showing the deprovision/provision of the
> image when the deployment was retried).
> 
> @nmagnezi do you want to give this another go before closing the
> issue?

I tried this again.
What I see now is that the BMH clears the error status, but didn't get the new image URL, thus no re-install.
log: https://gist.github.com/nmagnezi/a49d34d6cf2a8cc0fc110621fde43642

Let's follow up to see if I did something else / wrong that caused this, before we close this bug.

Comment 5 Nir Magnezi 2021-06-22 11:32:29 UTC
I attempted this again because I simply forgot to remove the 'detached' label.
However, now It fails with 404:: https://gist.github.com/nmagnezi/2395564774afa4f5a812ac5cf4e3c0db#file-bmh-yaml-L317
SVC log: https://gist.github.com/nmagnezi/d2f2040ce3d391a823b1e6b3f6bfc888#file-retry-go-L83

Comment 6 Nir Magnezi 2021-06-24 13:11:05 UTC
For QE:

The solution here is to document how to retry an installation: the user need to recreate both BMH(s) and ACI.

Comment 10 errata-xmlrpc 2021-10-18 17:34:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.