Description of problem: ----------------------- Deploying 4.7 failed during master provisioning/deployment time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=error msg=\" on ../../tmp/openshift-install-267106486/masters/main.tf line 1, in resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\":\"", time=\"2020-11-30T07:39:24Z\" level=error msg=\" 1: resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\" {\"", time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=error msg=\"Error: Bad request with: [POST http://fd00:1101::2:6385/v1/nodes], error message: {\\\"error_message\\\": \\\"{\\\\\\\"faultcode\\\\\\\": \\\\\\\"Client\\\\\\\", \\\\\\\"faultstring\\\\\\\": \\\\\\\"No valid host was found. Reason: No conductor service registered which supports driver redfish for conductor group \\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\".\\\\\\\", \\\\\\\"debuginfo\\\\\\\": null}\\\"}\"", time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=error msg=\" on ../../tmp/openshift-install-267106486/masters/main.tf line 1, in resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\":\"", time=\"2020-11-30T07:39:24Z\" level=error msg=\" 1: resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\" {\"", time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=error msg=\"Error: Bad request with: [POST http://fd00:1101::2:6385/v1/nodes], error message: {\\\"error_message\\\": \\\"{\\\\\\\"faultcode\\\\\\\": \\\\\\\"Client\\\\\\\", \\\\\\\"faultstring\\\\\\\": \\\\\\\"No valid host was found. Reason: No conductor service registered which supports driver redfish for conductor group \\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\".\\\\\\\", \\\\\\\"debuginfo\\\\\\\": null}\\\"}\"", time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=error msg=\" on ../../tmp/openshift-install-267106486/masters/main.tf line 1, in resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\":\"", time=\"2020-11-30T07:39:24Z\" level=error msg=\" 1: resource \\\"ironic_node_v1\\\" \\\"openshift-master-host\\\" {\"", time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=error", time=\"2020-11-30T07:39:24Z\" level=fatal msg=\"failed to fetch Cluster: failed to generate asset \\\"Cluster\\\": failed to create cluster: failed to apply Terraform: failed to complete the change\"" Version-Release number of selected component (if applicable): ------------------------------------------------------------- registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-11-28-014852 Steps to Reproduce: ------------------- 1. Run installation procedure for OCP-4.7 Actual results: --------------- Deployment failed Expected results: ----------------- Deployment succeeds Additional info: ---------------- Virtual deployment: 3masters + 2 workers; provisioning net IPv6; baremetal net IPv6 Issue is not constantly reproduced
Looking at the logs I suspect we may need to make the driver check in the terraform provider (and potentially BMO) more robust: https://github.com/openshift-metal3/terraform-provider-ironic/blob/master/ironic/provider.go#L368 2020-11-30 07:39:23.468 38 DEBUG ironic.common.hash_ring [req-5004b5b9-ce6b-4016-92d7-fbc52ff23766 bootstrap-user - - - -] Finished rebuilding hash rings, available drivers are :fake-hardware, :idrac, :ipmi ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61 2020-11-30 07:39:23.468 38 DEBUG ironic.api.expose [req-5004b5b9-ce6b-4016-92d7-fbc52ff23766 bootstrap-user - - - -] Client-side error: No valid host was found. Reason: No conductor service registered which supports driver redfish for conductor group "". format_exception /usr/lib/python3.6/site-packages/ironic/api/expose.py:184^[[00m Then later we see irmc show up, but not yet redfish: 2020-11-30 07:39:23.485 39 DEBUG ironic.common.hash_ring [req-39c8210c-df22-4ce4-8087-72626a72ec50 bootstrap-user - - - -] Finished rebuilding hash rings, available drivers are :fake-hardware, :idrac, :ipmi, :irmc ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61 2020-11-30 07:39:23.504 36 DEBUG ironic.common.hash_ring [req-1611e54a-6576-4b4e-8806-80df0da09e57 bootstrap-user - - - -] Finished rebuilding hash rings, available drivers are :fake-hardware, :idrac, :ipmi, :irmc ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61 2020-11-30 07:39:23.512 36 DEBUG ironic.common.hash_ring [req-1611e54a-6576-4b4e-8806-80df0da09e57 bootstrap-user - - - -] Finished rebuilding hash rings, available drivers are :fake-hardware, :idrac, :ipmi, :irmc ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61 In the conductor logs we see: $ grep Loaded ironic-conductor.logs 2020-11-30 07:39:22.211 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following hardware types: ['fake-hardware', 'idrac', 'ipmi', 'irmc', 'redfish'] 2020-11-30 07:39:22.213 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following network interfaces: ['flat', 'noop'] 2020-11-30 07:39:22.213 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following storage interfaces: ['cinder', 'noop'] 2020-11-30 07:39:22.280 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following vendor interfaces: ['fake', 'idrac', 'ipmitool', 'no-vendor'] 2020-11-30 07:39:22.287 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following management interfaces: ['fake', 'idrac', 'idrac-redfish', 'ipmitool', 'irmc', 'redfish'] 2020-11-30 07:39:22.292 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following raid interfaces: ['agent', 'fake', 'idrac-wsman', 'irmc', 'no-raid'] 2020-11-30 07:39:22.296 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following bios interfaces: ['idrac-redfish', 'idrac-wsman', 'irmc', 'no-bios', 'redfish'] 2020-11-30 07:39:22.296 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following rescue interfaces: ['no-rescue'] 2020-11-30 07:39:22.298 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following deploy interfaces: ['direct', 'fake'] 2020-11-30 07:39:22.299 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following power interfaces: ['fake', 'idrac', 'idrac-redfish', 'ipmitool', 'irmc', 'redfish'] 2020-11-30 07:39:22.304 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following boot interfaces: ['fake', 'idrac-redfish-virtual-media', 'ipxe', 'pxe', 'redfish-virtual-media'] 2020-11-30 07:39:22.304 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following console interfaces: ['no-console'] 2020-11-30 07:39:22.306 1 INFO ironic.common.driver_factory [req-375c4856-d9eb-41d8-a340-fff0091e084c - - - - -] Loaded the following inspect interfaces: ['fake', 'idrac', 'inspector', 'irmc', 'redfish'] So I think this is racy behavior, we need to pass the full expected list into the terraform provider, or find some way of having ironic tell us what the expected/configured interfaces are.
Ok this was discussed and also note previous discussion ref https://github.com/openshift/installer/issues/2880#issuecomment-572547395 There seem to be two issues: 1. The Ironic code loading the drivers isn't atomic, so we can get different results each API call before we get the final list of available drivers 2. The terraform-provider-ironic (and BMO) code just expects any drivers to be present, so we get a potential false-positive when we get the partially populated driver list from the API. It sounds like there may be fixes possible on the Ironic side, which resolves (1), in which case we may no longer need to fix (2), so moving back to the Ironic component for discussion around that.
Please consider that same failure occurs on 4.4, 4.5 as well so the fix will be required on all versions
verified on 4.7.0-0.nightly-2021-01-07-034013 Run deployment few times, the issue isn't reproduced Will re-open if happens again
(In reply to yigal dalal from comment #5) > Please consider that same failure occurs on 4.4, 4.5 as well so the fix will > be required on all versions I've cloned this to 4.6 here https://bugzilla.redhat.com/show_bug.cgi?id=1917481 Does it happen frequently in 4.5/4.4? iirc it used to just be a problem on systems where the bootstrap VM is hosted in a VM.
@derekh Yes it occurred on upgrade from 4.4 with this build: registry.ci.openshift.org/ocp/release:4.4.32
(In reply to yigal dalal from comment #9) > @derekh > Yes it occurred on upgrade from 4.4 with this build: > registry.ci.openshift.org/ocp/release:4.4.32 ok, It's yet to merge into 4.6, if id does I can backport it further.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633