Bug 1902653 - [BM][IPI] Master deployment failed: No valid host was found. Reason: No conductor service registered which supports driver redfish for conductor group
Summary: [BM][IPI] Master deployment failed: No valid host was found. Reason: No condu...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Derek Higgins
QA Contact: Lubov
URL:
Whiteboard:
Depends On:
Blocks: 1917481
TreeView+ depends on / blocked
 
Reported: 2020-11-30 09:48 UTC by Yurii Prokulevych
Modified: 2021-03-29 02:30 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously on some systems, the installer would communicate with ironic before it was ready and fail. This is now prevented.
Clone Of:
: 1917481 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:36:33 UTC
Target Upstream Version:
Embargoed:
derekh: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ironic-image pull 134 0 None closed Bug 1902653: Wait for expected number of drivers starting API 2021-02-19 09:42:57 UTC
OpenStack gerrit 764911 0 None MERGED Register all hardware_interfaces together 2021-02-19 09:42:57 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:36:56 UTC

Description Yurii Prokulevych 2020-11-30 09:48:19 UTC
Description of problem:
-----------------------
Deploying 4.7 failed during master provisioning/deployment

time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=error msg=\"  on ../../tmp/openshift-install-267106486/masters/main.tf line 1, in resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\":\"",
time=\"2020-11-30T07:39:24Z\" level=error msg=\"   1: resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\" {\"",
time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=error msg=\"Error: Bad request with: [POST http://fd00:1101::2:6385/v1/nodes], error message: {\\\"error_message\\\": \\\"{\\\\\\\"faultcode\\\\\\\": \\\\\\\"Client\\\\\\\", \\\\\\\"faultstring\\\\\\\": \\\\\\\"No valid host was found. Reason: No conductor service registered which supports driver redfish for conductor group \\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\".\\\\\\\", \\\\\\\"debuginfo\\\\\\\": null}\\\"}\"",
time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=error msg=\"  on ../../tmp/openshift-install-267106486/masters/main.tf line 1, in resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\":\"",
time=\"2020-11-30T07:39:24Z\" level=error msg=\"   1: resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\" {\"",
time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=error msg=\"Error: Bad request with: [POST http://fd00:1101::2:6385/v1/nodes], error message: {\\\"error_message\\\": \\\"{\\\\\\\"faultcode\\\\\\\": \\\\\\\"Client\\\\\\\", \\\\\\\"faultstring\\\\\\\": \\\\\\\"No valid host was found. Reason: No conductor service registered which supports driver redfish for conductor group \\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\".\\\\\\\", \\\\\\\"debuginfo\\\\\\\": null}\\\"}\"",
time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=error msg=\"  on ../../tmp/openshift-install-267106486/masters/main.tf line 1, in resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\":\"",
time=\"2020-11-30T07:39:24Z\" level=error msg=\"   1: resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\" {\"",
time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=error",
time=\"2020-11-30T07:39:24Z\" level=fatal msg=\"failed to fetch Cluster: failed to generate asset \\\"Cluster\\\": failed to create cluster: failed to apply Terraform: failed to complete the change\""


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-11-28-014852


Steps to Reproduce:
-------------------
1. Run installation procedure for OCP-4.7


Actual results:
---------------
Deployment failed


Expected results:
-----------------
Deployment succeeds


Additional info:
----------------
Virtual deployment: 3masters + 2 workers; provisioning net IPv6; baremetal net IPv6

Issue is not constantly reproduced

Comment 2 Steven Hardy 2020-11-30 10:30:50 UTC
Looking at the logs I suspect we may need to make the driver check in the terraform provider (and potentially BMO) more robust:

https://github.com/openshift-metal3/terraform-provider-ironic/blob/master/ironic/provider.go#L368

2020-11-30 07:39:23.468 38 DEBUG ironic.common.hash_ring [req-5004b5b9-ce6b-4016-92d7-fbc52ff23766 bootstrap-user - - - -] Finished rebuilding hash rings, available drivers are :fake-hardware, :idrac, :ipmi ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61
2020-11-30 07:39:23.468 38 DEBUG ironic.api.expose [req-5004b5b9-ce6b-4016-92d7-fbc52ff23766 bootstrap-user - - - -] Client-side error: No valid host was found. Reason: No conductor service registered which supports driver redfish for conductor group "". format_exception /usr/lib/python3.6/site-packages/ironic/api/expose.py:184^[[00m

Then later we see irmc show up, but not yet redfish:

2020-11-30 07:39:23.485 39 DEBUG ironic.common.hash_ring [req-39c8210c-df22-4ce4-8087-72626a72ec50 bootstrap-user - - - -] Finished rebuilding hash rings, available drivers are :fake-hardware, :idrac, :ipmi, :irmc ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61
2020-11-30 07:39:23.504 36 DEBUG ironic.common.hash_ring [req-1611e54a-6576-4b4e-8806-80df0da09e57 bootstrap-user - - - -] Finished rebuilding hash rings, available drivers are :fake-hardware, :idrac, :ipmi, :irmc ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61
2020-11-30 07:39:23.512 36 DEBUG ironic.common.hash_ring [req-1611e54a-6576-4b4e-8806-80df0da09e57 bootstrap-user - - - -] Finished rebuilding hash rings, available drivers are :fake-hardware, :idrac, :ipmi, :irmc ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61


In the conductor logs we see:

$ grep Loaded ironic-conductor.logs
2020-11-30 07:39:22.211 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following hardware types: ['fake-hardware', 'idrac', 'ipmi', 'irmc', 'redfish']
2020-11-30 07:39:22.213 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following network interfaces: ['flat', 'noop']
2020-11-30 07:39:22.213 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following storage interfaces: ['cinder', 'noop']
2020-11-30 07:39:22.280 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following vendor interfaces: ['fake', 'idrac', 'ipmitool', 'no-vendor']
2020-11-30 07:39:22.287 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following management interfaces: ['fake', 'idrac', 'idrac-redfish', 'ipmitool', 'irmc', 'redfish']
2020-11-30 07:39:22.292 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following raid interfaces: ['agent', 'fake', 'idrac-wsman', 'irmc', 'no-raid']
2020-11-30 07:39:22.296 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following bios interfaces: ['idrac-redfish', 'idrac-wsman', 'irmc', 'no-bios', 'redfish']
2020-11-30 07:39:22.296 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following rescue interfaces: ['no-rescue']
2020-11-30 07:39:22.298 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following deploy interfaces: ['direct', 'fake']
2020-11-30 07:39:22.299 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following power interfaces: ['fake', 'idrac', 'idrac-redfish', 'ipmitool', 'irmc', 'redfish']
2020-11-30 07:39:22.304 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following boot interfaces: ['fake', 'idrac-redfish-virtual-media', 'ipxe', 'pxe', 'redfish-virtual-media']
2020-11-30 07:39:22.304 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following console interfaces: ['no-console']
2020-11-30 07:39:22.306 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following inspect interfaces: ['fake', 'idrac', 'inspector', 'irmc', 'redfish']


So I think this is racy behavior, we need to pass the full expected list into the terraform provider, or find some way of having ironic tell us what the expected/configured interfaces are.

Comment 3 Steven Hardy 2020-11-30 13:18:59 UTC
Ok this was discussed and also note previous discussion ref https://github.com/openshift/installer/issues/2880#issuecomment-572547395

There seem to be two issues:

1. The Ironic code loading the drivers isn't atomic, so we can get different results each API call before we get the final list of available drivers

2. The terraform-provider-ironic (and BMO) code just expects any drivers to be present, so we get a potential false-positive when we get the partially populated driver list from the API.

It sounds like there may be fixes possible on the Ironic side, which resolves (1), in which case we may no longer need to fix (2), so moving back to the Ironic component for discussion around that.

Comment 5 yigal dalal 2021-01-05 13:26:37 UTC
Please consider that same failure occurs on 4.4, 4.5 as well so the fix will be required on all versions

Comment 7 Lubov 2021-01-07 08:44:38 UTC
verified on 4.7.0-0.nightly-2021-01-07-034013

Run deployment few times, the issue isn't reproduced
Will re-open if happens again

Comment 8 Derek Higgins 2021-01-18 15:00:40 UTC
(In reply to yigal dalal from comment #5)
> Please consider that same failure occurs on 4.4, 4.5 as well so the fix will
> be required on all versions

I've cloned this to 4.6 here
https://bugzilla.redhat.com/show_bug.cgi?id=1917481

Does it happen frequently in 4.5/4.4? iirc it used to just be a problem on systems where the bootstrap VM is hosted in a VM.

Comment 9 yigal dalal 2021-01-21 21:37:46 UTC
@derekh
Yes it occurred on upgrade from 4.4 with this build:  registry.ci.openshift.org/ocp/release:4.4.32

Comment 10 Derek Higgins 2021-02-01 09:44:08 UTC
(In reply to yigal dalal from comment #9)
> @derekh
> Yes it occurred on upgrade from 4.4 with this build: 
> registry.ci.openshift.org/ocp/release:4.4.32

ok, It's yet to merge into 4.6, if id does I can backport it further.

Comment 13 errata-xmlrpc 2021-02-24 15:36:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.