Created attachment 1683652 [details] Screenshot of node that failed inspection After deployment of cluster with 3 masters I attempted to register 40 worker baremetalhosts(only 2 of them to be deployed as machinesets), a bunch of them failed inspection, see below This is all on virt, so it is possible that the system was overloaded but it feels like there is a timing bug to fix somewhere [kni@r640-u07 ~]$ oc version Client Version: 4.4.0-0.nightly-2020-04-30-145451 Server Version: 4.4.0-0.nightly-2020-04-30-145451 Kubernetes Version: v1.17.1 [kni@r640-u07 ~]$ oc get baremetalhosts -n openshift-machine-api NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR vipi-master-0 OK externally provisioned ostest-master-0 ipmi://192.168.34.1:6230 true vipi-master-1 OK externally provisioned ostest-master-1 ipmi://192.168.34.1:6231 true vipi-master-2 OK externally provisioned ostest-master-2 ipmi://192.168.34.1:6232 true vipi-worker-0 OK ready ipmi://192.168.34.1:6233 unknown true vipi-worker-1 OK ready ipmi://192.168.34.1:6234 unknown true vipi-worker-10 OK ready ipmi://192.168.34.1:6243 unknown true vipi-worker-11 error inspecting ostest-worker-0-d5qrm ipmi://192.168.34.1:6244 true Introspection timeout vipi-worker-12 OK ready ipmi://192.168.34.1:6245 unknown true vipi-worker-13 OK ready ipmi://192.168.34.1:6246 unknown true vipi-worker-14 OK ready ipmi://192.168.34.1:6247 unknown true vipi-worker-15 error inspecting ipmi://192.168.34.1:6248 true Introspection timeout vipi-worker-16 OK ready ipmi://192.168.34.1:6249 unknown true vipi-worker-17 error inspecting ipmi://192.168.34.1:6250 true Introspection timeout vipi-worker-18 OK ready ipmi://192.168.34.1:6251 unknown true vipi-worker-19 OK ready ipmi://192.168.34.1:6252 unknown true vipi-worker-2 OK ready ipmi://192.168.34.1:6235 unknown true vipi-worker-20 error inspecting ipmi://192.168.34.1:6253 true Introspection timeout vipi-worker-21 error inspecting ipmi://192.168.34.1:6254 true Introspection timeout vipi-worker-22 OK ready ipmi://192.168.34.1:6255 unknown true vipi-worker-23 OK ready ipmi://192.168.34.1:6256 unknown true vipi-worker-24 error inspecting ipmi://192.168.34.1:6257 true Introspection timeout vipi-worker-25 OK ready ipmi://192.168.34.1:6258 unknown true vipi-worker-26 OK ready ipmi://192.168.34.1:6259 unknown true vipi-worker-27 OK ready ipmi://192.168.34.1:6260 unknown true vipi-worker-28 error inspecting ipmi://192.168.34.1:6261 true Introspection timeout vipi-worker-29 error inspecting ipmi://192.168.34.1:6262 true Introspection timeout vipi-worker-3 error inspecting ipmi://192.168.34.1:6236 true Introspection timeout vipi-worker-30 error inspecting ipmi://192.168.34.1:6263 true Introspection timeout vipi-worker-31 OK ready ipmi://192.168.34.1:6264 unknown true vipi-worker-32 OK ready ipmi://192.168.34.1:6265 unknown true vipi-worker-33 OK ready ipmi://192.168.34.1:6266 unknown true vipi-worker-34 error registration error ipmi://192.168.34.1:6267 true Failed to get power state for node 63058ec9-7807-4ffd-963e-0576d88fed3d. Error: IPMI call failed: power status. vipi-worker-35 error inspecting ipmi://192.168.34.1:6268 true Introspection timeout vipi-worker-36 error inspecting ipmi://192.168.34.1:6269 true Introspection timeout vipi-worker-37 OK ready ipmi://192.168.34.1:6270 unknown true vipi-worker-38 OK ready ipmi://192.168.34.1:6271 unknown true vipi-worker-39 error inspecting ipmi://192.168.34.1:6272 true Introspection timeout vipi-worker-4 OK provisioned ostest-worker-0-mlz56 ipmi://192.168.34.1:6237 unknown true vipi-worker-5 error inspecting ipmi://192.168.34.1:6238 true Introspection timeout vipi-worker-6 OK ready ipmi://192.168.34.1:6239 unknown true vipi-worker-7 OK ready ipmi://192.168.34.1:6240 unknown true vipi-worker-8 OK ready ipmi://192.168.34.1:6241 unknown true vipi-worker-9 OK ready ipmi://192.168.34.1:6242 unknown true [kni@r640-u07 ~]$ oc get machinesets -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE ostest-worker-0 2 2 1 1 12h [derekh@r640-u07 vipi]$ openstack baremetal node list Management cluster forced to be minikube when container runtime is not docker +--------------------------------------+----------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+----------------+--------------------------------------+-------------+--------------------+-------------+ | 99c333b8-ad97-46a1-9382-e0ce59c767fd | vipi-worker-23 | None | power on | manageable | False | | d8914849-912e-4f5d-948d-48bebfbf08c2 | vipi-worker-13 | None | power on | manageable | False | | 37c7f391-d18c-4afb-8f13-55736c8af2c6 | vipi-worker-10 | None | power on | manageable | False | | 63058ec9-7807-4ffd-963e-0576d88fed3d | vipi-worker-34 | None | None | enroll | False | | b1b6c671-bde3-4d5a-a5ca-a2ce09c86b6a | vipi-worker-30 | None | power on | inspect failed | False | | 4d431d32-8847-44cb-a146-b57d6d687360 | vipi-worker-12 | None | power on | manageable | False | | 7aead464-d9b3-4a84-ad4d-d16d8b9171e8 | vipi-worker-8 | None | power on | manageable | False | | 245a84f5-92bb-45d4-a159-c08df8cf5fb7 | vipi-worker-6 | None | power on | manageable | False | | f69c3aba-b766-44fa-ba2f-1f2c12368936 | vipi-worker-37 | None | power on | manageable | False | | 6ba6e7ac-4e13-48ff-bb4a-9cd5959c7ae7 | vipi-worker-18 | None | power on | manageable | False | | 38375d0a-759f-4d79-82f9-4903b8450ab1 | vipi-worker-25 | None | power on | manageable | False | | f9337c1c-906b-436b-90f8-15a0fa2c03a8 | vipi-worker-28 | None | power on | inspect failed | False | | f81457d0-a9c0-434d-bace-4cb2ca020542 | vipi-worker-16 | None | power on | manageable | False | | 57688c4f-e1ed-43b4-a706-0798b302eda4 | vipi-worker-22 | None | power on | manageable | False | | 111db998-b33d-4692-9cb5-6c9d884e2907 | vipi-master-1 | 48f7a4f1-f851-4536-bc55-565408f14840 | power on | active | False | | 1ef0d09d-9799-4dd8-b4e3-7e14dccdee3a | vipi-worker-26 | None | power on | manageable | False | | da354df9-7a98-4f42-b39c-d3469b2b55a6 | vipi-worker-1 | None | power on | manageable | False | | 3bbd9398-69c4-4465-be7b-ebec386459f4 | vipi-worker-27 | None | power on | manageable | False | | 7ecce66d-6bab-4823-b0d0-7db7e951f008 | vipi-worker-31 | None | power on | manageable | False | | 393abd26-be1c-43bd-9178-2d749a625813 | vipi-worker-20 | None | power on | inspect failed | False | | 3c624d25-2fb7-48c6-ac8e-04d379144602 | vipi-worker-9 | None | power on | manageable | False | | bba342c0-2adc-4848-9f8e-bb0c11eb245b | vipi-worker-2 | None | power on | manageable | False | | f94a9d50-0a78-4ef6-a7a3-6d834d576a95 | vipi-worker-33 | None | power on | manageable | False | | 8b553c95-7a4a-4f63-a21d-60367d48de35 | vipi-worker-17 | None | power on | inspect failed | False | | 118a01d5-d5a1-4993-8137-c076a88c05e1 | vipi-worker-14 | None | power on | manageable | False | | 53c1b4b7-f4eb-487c-add1-69cbe038eaa8 | vipi-worker-36 | None | power on | inspect failed | False | | 3651afda-d779-42f2-b08b-375c82fc0b6b | vipi-worker-7 | None | power on | manageable | False | | 5c9c6905-6c53-48f5-8e75-73862a5578ee | vipi-worker-0 | None | power on | manageable | False | | 6b370d0d-13fe-441e-b56c-013acdd822be | vipi-worker-29 | None | power on | inspect failed | False | | 22d1ff4e-bef7-490b-980a-2ed353dd94f8 | vipi-worker-32 | None | power on | manageable | False | | fa8ec1c2-4c11-47ee-a267-40ab2fbe5e82 | vipi-worker-19 | None | power on | manageable | False | | 44163e8a-727e-4db7-927a-936543fae486 | vipi-master-0 | 03a434be-6b16-4bd2-a58b-1979a59a2e89 | power on | active | False | | 621a024f-57ef-4f24-b697-0e70dd1544f8 | vipi-worker-38 | None | power on | manageable | False | | 4b22549b-0bee-4f23-9115-5eac95e938f5 | vipi-worker-4 | 05e14ed2-fbd3-4f78-a9c9-bc8d1fdbea4f | power on | active | False | | ffcd18d7-1221-4ebc-a9f0-d985679830c8 | vipi-worker-15 | None | power on | inspect failed | False | | 35bec02f-5e97-419d-93d9-4d10604a7355 | vipi-master-2 | 9415e5ea-bd90-49a8-afc3-c66544d6710f | power on | active | False | | 3d3e2be8-46eb-4a9e-b7f4-b626c9d53556 | vipi-worker-21 | None | power off | inspect failed | False | | f2d3f65e-f592-44da-bf77-bc1a95aebd32 | vipi-worker-24 | None | power on | inspect failed | False | | 3095240e-3429-4caf-a32f-f439bafee786 | vipi-worker-35 | None | power on | inspect failed | False | | 47575d02-8a06-45cd-bd18-365d9da4d9fc | vipi-worker-3 | None | power on | inspect failed | False | | aadcb21b-1cc9-449b-8301-674c59bf0d59 | vipi-worker-39 | None | power on | inspect failed | False | | 62781c4a-2bad-4765-b5dd-6f10a006e760 | vipi-worker-5 | None | power on | inspect failed | False | | 146589c9-6453-4bcf-b588-fdaa7b922669 | vipi-worker-11 | d38551ce-fa7d-4968-be3d-c20593ef2bf4 | power on | inspect failed | False | +--------------------------------------+----------------+--------------------------------------+-------------+--------------------+-------------+ screenshot of one of the failed nodes attached, Error being reported of a "Port <MAC> already exists, uuid: <uuid>"....
There appears to be two different types of errors going on in this setup (will keep this bug for error No. 2) - 1. intermittent ipmi errors Some of these that I've seen are down to a ssh error (vbmc ssh's to the virt host to talk to libvirt) I'm investigating my virt setup and will open another bug should one be needed to deal with this - 2 > screenshot of one of the failed nodes attached, Error being reported of a > "Port <MAC> already exists, uuid: <uuid>".... I't looks this this is the error that can be expected if a node boots to inspector but the inspector service has yet to "Whitelist" the node As far as I can see ironic powers on a node and triggers inspection with the inspector at the same time(roughly), but inspector whitelists each node for inspection asynchronously. It only whitlists one node at a time with "introspection_delay" (5 seconds) between each node. So the more nodes you have the higher the chance a nodes at the end of the list will boot and callback to the inspector before they get whitelisted. With the example above the last nodes on the inspection list wont get whitelisted until over 2 minutes after they are powered on. In my case reducing introspection_delay to 1 eliminates the problem and I can inspect all of the nodes but I've presumably just moved the limit higher The problem is presumably worse on virt as nodes don't have a long POST time to delay the process.
A little more detail here (In reply to Derek Higgins from comment #1) > As far as I can see ironic powers on a node and triggers inspection with the > inspector at the same time(roughly), but inspector whitelists > each node for inspection asynchronously. It only whitlists one node at a > time with "introspection_delay" (5 seconds) between each node. > So the more nodes you have the higher the chance a nodes at the end of the > list will boot and callback to the inspector before they get whitelisted. > > With the example above the last nodes on the inspection list wont get > whitelisted until over 2 minutes after they are powered on. in ironic/drivers/modules/inspector.py def _start_managed_inspection(task): ... client.start_introspection(task.node.uuid, manage_boot=False) cond_utils.node_power_action(task, states.POWER_ON) the problem is client.start_introspection isn't a synchronous call, it actually queues a node to the inspected (one every 5 seconds)
Ugh. You're right, the delay should not be used when boot is managed by ironic.
From ironic-inspector logs: 2020-09-03 07:05:25.166 1 INFO ironic_inspector.introspect [-] [node: 07fd191c-e6df-4d09-88ff-68b7a037f36e state starting] Whitelisting MAC's ['52:54:00:59:ad:33'] for a PXE boot 2020-09-03 07:05:25.200 1 INFO ironic_inspector.introspect [-] [node: d22a9923-ddc7-400e-b7a8-d8acabcef820 state starting] Whitelisting MAC's ['52:54:00:1f:44:07'] for a PXE boot 2020-09-03 07:05:25.205 1 INFO ironic_inspector.introspect [-] [node: c2529f25-9f9d-4328-a0f6-5b7fafb97d0d state starting] Whitelisting MAC's ['52:54:00:6e:a9:c9'] for a PXE boot 2020-09-03 07:05:25.253 1 INFO ironic_inspector.introspect [-] [node: 8e3e0776-532e-4b4b-8919-64c85c210885 state starting] Whitelisting MAC's ['52:54:00:71:1d:a1'] for a PXE boot From ironic-conductor logs: 2020-09-03 07:05:27.241 1 INFO ironic.conductor.utils [req-4f43849f-a9d9-4f58-abd6-f8d8fa1eaa6d ironic-user - - - -] Successfully set node c2529f25-9f9d-4328-a0f6-5b7fafb97d0d power state to power on by power on. 2020-09-03 07:05:27.259 1 INFO ironic.conductor.utils [req-869b395b-9312-4299-b656-05648d5932fd ironic-user - - - -] Successfully set node d22a9923-ddc7-400e-b7a8-d8acabcef820 power state to power on by power on. 2020-09-03 07:05:27.272 1 INFO ironic.conductor.utils [req-f4631f8d-7ba1-4868-9484-660418bd8fc7 ironic-user - - - -] Successfully set node 07fd191c-e6df-4d09-88ff-68b7a037f36e power state to power on by power on. 2020-09-03 07:05:27.369 1 INFO ironic.conductor.utils [req-bd9307dd-29e4-4a24-8f85-60474999957d ironic-user - - - -] Successfully set node 8e3e0776-532e-4b4b-8919-64c85c210885 power state to power on by power on. ironic-inspector approved the MAC addresses before ironic-conductor turned on the Hosts, the logs show that it's does so the bug fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196