Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1886327

Summary: Attempt to add a worker using bad roodDeviceHint: bmh and machine become Provisioned, no error in status
Product: OpenShift Container Platform Reporter: Lubov <lshilin>
Component: Bare Metal Hardware ProvisioningAssignee: Dmitry Tantsur <dtantsur>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: bfournie, derekh, lshilin, nstielau
Version: 4.6Keywords: Triaged
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Baremetal IPI no longer silently skips writing an image when invalid root device hints are provided.
Story Points: ---
Clone Of:
: 1886769 (view as bug list) Environment:
Last Closed: 2021-02-24 15:24:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1900378    
Bug Blocks: 1886769    
Attachments:
Description Flags
example of configuration yaml none

Description Lubov 2020-10-08 08:14:43 UTC
Created attachment 1719903 [details]
example of configuration yaml

Version:
$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.6.0-0.nightly-2020-10-07-022140
built from commit bff124c9941762d2532490774b3f910241bd63f6
release image registry.svc.ci.openshift.org/ocp/release@sha256:0dd1ac669443d84d925546b20b2e66f6b1febb1afc8c6550ffdf700d840cf65a

Platform:
baremetal IPI

What happened?
Trying to add a worker node using not correct value in rootDeviceHints(e.g. not existing device in deviceName). 
The node not added, but bmh and machine become provisioned

$ oc get machineset -A
NAMESPACE               NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   ocp-edge-cluster-0-bd2cl-worker-0   3         3         2       2           7h14m

$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-bd2cl-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/a9243185-5e8e-4b57-83ed-3a06ed6f8607                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-bd2cl-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/e1706dbf-6579-40b3-bb65-09db6ab76578                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-bd2cl-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/f9391532-c0e9-49aa-999a-428686f052fc                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-bd2cl-worker-0-6sl69   redfish://192.168.123.1:8000/redfish/v1/Systems/84f8be25-4c45-487b-9739-2e2c716110f3   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-bd2cl-worker-0-r6nrf   redfish://192.168.123.1:8000/redfish/v1/Systems/5e08cfdd-3117-40e2-a842-d894a25b732e   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       provisioned              ocp-edge-cluster-0-bd2cl-worker-0-xvcxq   redfish://192.168.123.1:8000/redfish/v1/Systems/b4898e66-fefd-4caa-a630-84bc6943a3cf   unknown            true     

$ oc get machine -o wide -A
NAMESPACE               NAME                                      PHASE         TYPE   REGION   ZONE   AGE     NODE         PROVIDERID                                                    STATE
openshift-machine-api   ocp-edge-cluster-0-bd2cl-master-0         Running                              7h15m   master-0-0   baremetalhost:///openshift-machine-api/openshift-master-0-0   
openshift-machine-api   ocp-edge-cluster-0-bd2cl-master-1         Running                              7h15m   master-0-1   baremetalhost:///openshift-machine-api/openshift-master-0-1   
openshift-machine-api   ocp-edge-cluster-0-bd2cl-master-2         Running                              7h15m   master-0-2   baremetalhost:///openshift-machine-api/openshift-master-0-2   
openshift-machine-api   ocp-edge-cluster-0-bd2cl-worker-0-6sl69   Running                              7h      worker-0-0   baremetalhost:///openshift-machine-api/openshift-worker-0-0   
openshift-machine-api   ocp-edge-cluster-0-bd2cl-worker-0-r6nrf   Running                              7h      worker-0-1   baremetalhost:///openshift-machine-api/openshift-worker-0-1   
openshift-machine-api   ocp-edge-cluster-0-bd2cl-worker-0-xvcxq   Provisioned                          50m                  baremetalhost:///openshift-machine-api/openshift-worker-0-2  
 
$ oc get nodes
NAME         STATUS   ROLES    AGE     VERSION
master-0-0   Ready    master   7h6m    v1.19.0+db1fc96
master-0-1   Ready    master   7h6m    v1.19.0+db1fc96
master-0-2   Ready    master   7h6m    v1.19.0+db1fc96
worker-0-0   Ready    worker   6h43m   v1.19.0+db1fc96
worker-0-1   Ready    worker   6h43m   v1.19.0+db1fc96


What did you expect to happen?
Expected some problem reported, at least status of bmh should be 'error' and not OK

How to reproduce it (as minimally and precisely as possible)?
1. Deploy OCP4.6 clusterwitj 3 masters and 2 workers 
2. Create configuration yaml file for adding new bmh. Set rootDeviceHints: deviceName to not existing device (see in attachment)
3. Add bmh for worker using the created yaml file.
$ oc create -f new-node2.yaml -n openshift-machine-api
4. Wait till bmh becomes ready
5. Scale up machineset to add the new machine
$ oc scale machineset MACHINESETNAME -n openshift-machine-api --replicas=3

Anything else we need to know?
In metal3-ironic-conductor log there is report of failure:
Error finding the disk or partition device to deploy the image onto', 'details': "No suitable device was found for deployment using these hints {'name': 's== /dev/sdc'

Attaching metal3-ironic-conductor log

Comment 2 Bob Fournier 2020-10-08 19:06:15 UTC
Yes it looks like ironic logs the invalid hint:
2020-10-07 15:41:40.777 1 DEBUG ironic.drivers.modules.agent_client [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Command image.install_bootloader has finished for node d096adf1-e78a-4f2a-8b59-d33324447ddc with result {'id': 'e7d05fcb-17ef-4b00-9613-48caece507e1', 'command_name': 'install_bootloader', 'command_params': {'root_uuid': None, 'efi_system_part_uuid': None, 'prep_boot_part_uuid': None, 'target_boot_mode': 'uefi'}, 'command_status': 'FAILED', 'command_error': {'type': 'DeviceNotFound', 'code': 404, 'message': 'Error finding the disk or partition device to deploy the image onto', 'details': "No suitable device was found for deployment using these hints {'name': 's== /dev/sdc'}"}, 'command_result': None} _wait_for_command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:132

but it does deploy this node:
2020-10-07 15:41:48.360 1 INFO ironic.conductor.utils [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Successfully set node d096adf1-e78a-4f2a-8b59-d33324447ddc power state to power on by power on.
2020-10-07 15:41:48.360 1 INFO ironic.conductor.deployments [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Node d096adf1-e78a-4f2a-8b59-d33324447ddc finished deploy step {'step': 'boot_instance', 'priority': 20, 'argsinfo': None, 'interface': 'deploy'}
2020-10-07 15:41:48.388 1 INFO ironic.conductor.task_manager [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Node d096adf1-e78a-4f2a-8b59-d33324447ddc moved to provision state "active" from state "deploying"; target provision state is "None"
2020-10-07 15:41:48.389 1 INFO ironic.conductor.deployments [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Successfully deployed node d096adf1-e78a-4f2a-8b59-d33324447ddc with instance a9c484b9-6daa-4530-b276-5b0ba23a98d0.

Can we get the ramdisk logs for this node (d096adf1-e78a-4f2a-8b59-d33324447ddc) or another node that doesn't fail?  You can get the ramdisk logs in /shared/log/ironic/deploy/ in the ironic-conductor container.

Comment 3 Nick Stielau 2020-10-08 21:49:27 UTC
Setting to 4.7, I don't think this blocks 4.6.

Comment 4 Derek Higgins 2020-10-09 10:33:42 UTC
*** Bug 1886735 has been marked as a duplicate of this bug. ***

Comment 5 Bob Fournier 2020-10-09 10:39:51 UTC
Moving to 4.6.z.

Comment 6 Bob Fournier 2020-10-09 10:44:33 UTC
Lubov - you can disregard request for ramdisk log in Comment 2. Thanks.

Comment 9 Lubov 2020-11-22 16:28:47 UTC
While provisioning, getting error for bmh
openshift-worker-0-2   error	provisioning error       ocp-edge-cluster-0-276ht-worker-0-5n2rn   redfish://192.168.123.1:8000/redfish/v1/Systems
/677473a6-6dad-4c8d-a5b5-d6576f2cd0a6   unknown            true     Image provisioning failed: Agent returned error for deploy step {'step': 'writ
e_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 95684b67-d5a1-4061-85d9-d55bc7dd5f1b : Error performing deploy_step wri
te_image: Error finding the disk or partition device to deploy the image onto: No suitable device was found for deployment using these hints {'nam
e': 's== /dev/sdc'}. 

BMH enters deprovisioning status, then ready->provisioning->provisioning error->deprovisioning infinite loop

Machine in Provisioned phase

Openning a new bz. Cannot finish to verify this bz till the new is solved

Comment 10 Dmitry Tantsur 2020-11-24 17:12:19 UTC
> Cannot finish to verify this bz till the new is solved

Up to you, but I think if we do get an error, this can be called done (and continued in the new bug).

Comment 11 Lubov 2020-11-24 17:29:41 UTC
Agree, closing this one

Comment 14 errata-xmlrpc 2021-02-24 15:24:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633