Bug 1905577

Summary: Control plane machines not adopted when provisioning network is disabled
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: Bare Metal Hardware ProvisioningAssignee: Stephen Benjamin <stbenjam>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: augol, beth.white, elgerman, hpokorny, lshilin
Version: 4.7Keywords: AutomationBlocker, Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Adoption of externally provisioned hosts was not retried upon failure. In some cases a race could occur where we try to adopt before the image cache is populated, resulting in permanent adoption failure. Consequence: Control plane bare metal hosts report "adoption failed." Fix: We now retry on adoption failure. Result: Control plane hosts are correctly adopted.
Story Points: ---
Clone Of:
: 1932452 (view as bug list) Environment:
Last Closed: 2021-07-27 22:34:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1932452    

Description Stephen Benjamin 2020-12-08 15:22:59 UTC
openshift-machine-api   ostest-master-2   error    externally provisioned   ostest-7rtdq-master-2         redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/4f149ee8-7d13-483c-978a-81038234c5b5                      true     Host adoption failed: Error while attempting to adopt node 7de2e2ff-6984-4c4b-a127-bb5cf38037df: Validation of image href http://192.168.111.5:6181/images/rhcos-47.83.202012030221-0-openstack.x86_64.qcow2/rhcos-47.83.202012030221-0-compressed.x86_64.qcow2 failed, reason: HTTPConnectionPool(host='192.168.111.5', port=6181): Max retries exceeded with url: /images/rhcos-47.83.202012030221-0-openstack.x86_64.qcow2/rhcos-47.83.202012030221-0-compressed.x86_64.qcow2 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0b9f799048>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)).

Comment 3 Stephen Benjamin 2021-02-01 14:00:36 UTC
This is already fixed upstream, https://github.com/metal3-io/baremetal-operator/pull/762.  I'm working on putting together an OpenShift PR but it's a bit challenging since BMO upstream has moved quite a bit ahead of OpenShift's 4.7 version.

Comment 4 Stephen Benjamin 2021-02-01 17:30:38 UTC
Tentatively the plan is to get the fix in the first 4.7 z-Stream, we'll have more time to let the changes soak in CI.

Comment 7 Lubov 2021-02-25 18:02:47 UTC
verified on 4.8.0-0.nightly-2021-02-25-112922 on a setup where the problem was reproducible 100%

Comment 10 errata-xmlrpc 2021-07-27 22:34:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438