Created attachment 1719903 [details] example of configuration yaml Version: $ ./openshift-baremetal-install version ./openshift-baremetal-install 4.6.0-0.nightly-2020-10-07-022140 built from commit bff124c9941762d2532490774b3f910241bd63f6 release image registry.svc.ci.openshift.org/ocp/release@sha256:0dd1ac669443d84d925546b20b2e66f6b1febb1afc8c6550ffdf700d840cf65a Platform: baremetal IPI What happened? Trying to add a worker node using not correct value in rootDeviceHints(e.g. not existing device in deviceName). The node not added, but bmh and machine become provisioned $ oc get machineset -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api ocp-edge-cluster-0-bd2cl-worker-0 3 3 2 2 7h14m $ oc get bmh -A NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-machine-api openshift-master-0-0 OK externally provisioned ocp-edge-cluster-0-bd2cl-master-0 redfish://192.168.123.1:8000/redfish/v1/Systems/a9243185-5e8e-4b57-83ed-3a06ed6f8607 true openshift-machine-api openshift-master-0-1 OK externally provisioned ocp-edge-cluster-0-bd2cl-master-1 redfish://192.168.123.1:8000/redfish/v1/Systems/e1706dbf-6579-40b3-bb65-09db6ab76578 true openshift-machine-api openshift-master-0-2 OK externally provisioned ocp-edge-cluster-0-bd2cl-master-2 redfish://192.168.123.1:8000/redfish/v1/Systems/f9391532-c0e9-49aa-999a-428686f052fc true openshift-machine-api openshift-worker-0-0 OK provisioned ocp-edge-cluster-0-bd2cl-worker-0-6sl69 redfish://192.168.123.1:8000/redfish/v1/Systems/84f8be25-4c45-487b-9739-2e2c716110f3 unknown true openshift-machine-api openshift-worker-0-1 OK provisioned ocp-edge-cluster-0-bd2cl-worker-0-r6nrf redfish://192.168.123.1:8000/redfish/v1/Systems/5e08cfdd-3117-40e2-a842-d894a25b732e unknown true openshift-machine-api openshift-worker-0-2 OK provisioned ocp-edge-cluster-0-bd2cl-worker-0-xvcxq redfish://192.168.123.1:8000/redfish/v1/Systems/b4898e66-fefd-4caa-a630-84bc6943a3cf unknown true $ oc get machine -o wide -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE openshift-machine-api ocp-edge-cluster-0-bd2cl-master-0 Running 7h15m master-0-0 baremetalhost:///openshift-machine-api/openshift-master-0-0 openshift-machine-api ocp-edge-cluster-0-bd2cl-master-1 Running 7h15m master-0-1 baremetalhost:///openshift-machine-api/openshift-master-0-1 openshift-machine-api ocp-edge-cluster-0-bd2cl-master-2 Running 7h15m master-0-2 baremetalhost:///openshift-machine-api/openshift-master-0-2 openshift-machine-api ocp-edge-cluster-0-bd2cl-worker-0-6sl69 Running 7h worker-0-0 baremetalhost:///openshift-machine-api/openshift-worker-0-0 openshift-machine-api ocp-edge-cluster-0-bd2cl-worker-0-r6nrf Running 7h worker-0-1 baremetalhost:///openshift-machine-api/openshift-worker-0-1 openshift-machine-api ocp-edge-cluster-0-bd2cl-worker-0-xvcxq Provisioned 50m baremetalhost:///openshift-machine-api/openshift-worker-0-2 $ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 7h6m v1.19.0+db1fc96 master-0-1 Ready master 7h6m v1.19.0+db1fc96 master-0-2 Ready master 7h6m v1.19.0+db1fc96 worker-0-0 Ready worker 6h43m v1.19.0+db1fc96 worker-0-1 Ready worker 6h43m v1.19.0+db1fc96 What did you expect to happen? Expected some problem reported, at least status of bmh should be 'error' and not OK How to reproduce it (as minimally and precisely as possible)? 1. Deploy OCP4.6 clusterwitj 3 masters and 2 workers 2. Create configuration yaml file for adding new bmh. Set rootDeviceHints: deviceName to not existing device (see in attachment) 3. Add bmh for worker using the created yaml file. $ oc create -f new-node2.yaml -n openshift-machine-api 4. Wait till bmh becomes ready 5. Scale up machineset to add the new machine $ oc scale machineset MACHINESETNAME -n openshift-machine-api --replicas=3 Anything else we need to know? In metal3-ironic-conductor log there is report of failure: Error finding the disk or partition device to deploy the image onto', 'details': "No suitable device was found for deployment using these hints {'name': 's== /dev/sdc' Attaching metal3-ironic-conductor log
Yes it looks like ironic logs the invalid hint: 2020-10-07 15:41:40.777 1 DEBUG ironic.drivers.modules.agent_client [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Command image.install_bootloader has finished for node d096adf1-e78a-4f2a-8b59-d33324447ddc with result {'id': 'e7d05fcb-17ef-4b00-9613-48caece507e1', 'command_name': 'install_bootloader', 'command_params': {'root_uuid': None, 'efi_system_part_uuid': None, 'prep_boot_part_uuid': None, 'target_boot_mode': 'uefi'}, 'command_status': 'FAILED', 'command_error': {'type': 'DeviceNotFound', 'code': 404, 'message': 'Error finding the disk or partition device to deploy the image onto', 'details': "No suitable device was found for deployment using these hints {'name': 's== /dev/sdc'}"}, 'command_result': None} _wait_for_command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:132 but it does deploy this node: 2020-10-07 15:41:48.360 1 INFO ironic.conductor.utils [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Successfully set node d096adf1-e78a-4f2a-8b59-d33324447ddc power state to power on by power on. 2020-10-07 15:41:48.360 1 INFO ironic.conductor.deployments [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Node d096adf1-e78a-4f2a-8b59-d33324447ddc finished deploy step {'step': 'boot_instance', 'priority': 20, 'argsinfo': None, 'interface': 'deploy'} 2020-10-07 15:41:48.388 1 INFO ironic.conductor.task_manager [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Node d096adf1-e78a-4f2a-8b59-d33324447ddc moved to provision state "active" from state "deploying"; target provision state is "None" 2020-10-07 15:41:48.389 1 INFO ironic.conductor.deployments [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Successfully deployed node d096adf1-e78a-4f2a-8b59-d33324447ddc with instance a9c484b9-6daa-4530-b276-5b0ba23a98d0. Can we get the ramdisk logs for this node (d096adf1-e78a-4f2a-8b59-d33324447ddc) or another node that doesn't fail? You can get the ramdisk logs in /shared/log/ironic/deploy/ in the ironic-conductor container.
Setting to 4.7, I don't think this blocks 4.6.
*** Bug 1886735 has been marked as a duplicate of this bug. ***
Moving to 4.6.z.
Lubov - you can disregard request for ramdisk log in Comment 2. Thanks.
While provisioning, getting error for bmh openshift-worker-0-2 error provisioning error ocp-edge-cluster-0-276ht-worker-0-5n2rn redfish://192.168.123.1:8000/redfish/v1/Systems /677473a6-6dad-4c8d-a5b5-d6576f2cd0a6 unknown true Image provisioning failed: Agent returned error for deploy step {'step': 'writ e_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 95684b67-d5a1-4061-85d9-d55bc7dd5f1b : Error performing deploy_step wri te_image: Error finding the disk or partition device to deploy the image onto: No suitable device was found for deployment using these hints {'nam e': 's== /dev/sdc'}. BMH enters deprovisioning status, then ready->provisioning->provisioning error->deprovisioning infinite loop Machine in Provisioned phase Openning a new bz. Cannot finish to verify this bz till the new is solved
> Cannot finish to verify this bz till the new is solved Up to you, but I think if we do get an error, this can be called done (and continued in the new bug).
Agree, closing this one
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633