1886327 – Attempt to add a worker using bad roodDeviceHint: bmh and machine become Provisioned, no error in status

Bug 1886327 - Attempt to add a worker using bad roodDeviceHint: bmh and machine become Provisioned, no error in status

Summary: Attempt to add a worker using bad roodDeviceHint: bmh and machine become Prov...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Dmitry Tantsur
QA Contact:	Lubov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1886735 (view as bug list)
Depends On:	1900378
Blocks:	1886769
TreeView+	depends on / blocked

Reported:	2020-10-08 08:14 UTC by Lubov
Modified:	2021-02-24 15:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Baremetal IPI no longer silently skips writing an image when invalid root device hints are provided.
Clone Of:
Clones:	1886769 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:24:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
example of configuration yaml (581 bytes, text/plain) 2020-10-08 08:14 UTC, Lubov	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	757037	0	None	MERGED	Do not silently swallow errors in the write_image deploy step	2021-02-18 19:21:01 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:24:29 UTC

Description Lubov 2020-10-08 08:14:43 UTC

Created attachment 1719903 [details]
example of configuration yaml

Version:
$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.6.0-0.nightly-2020-10-07-022140
built from commit bff124c9941762d2532490774b3f910241bd63f6
release image registry.svc.ci.openshift.org/ocp/release@sha256:0dd1ac669443d84d925546b20b2e66f6b1febb1afc8c6550ffdf700d840cf65a

Platform:
baremetal IPI

What happened?
Trying to add a worker node using not correct value in rootDeviceHints(e.g. not existing device in deviceName). 
The node not added, but bmh and machine become provisioned

$ oc get machineset -A
NAMESPACE               NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   ocp-edge-cluster-0-bd2cl-worker-0   3         3         2       2           7h14m

$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-bd2cl-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/a9243185-5e8e-4b57-83ed-3a06ed6f8607                      true     
openshift-machine-api   openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-bd2cl-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/e1706dbf-6579-40b3-bb65-09db6ab76578                      true     
openshift-machine-api   openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-bd2cl-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/f9391532-c0e9-49aa-999a-428686f052fc                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-bd2cl-worker-0-6sl69   redfish://192.168.123.1:8000/redfish/v1/Systems/84f8be25-4c45-487b-9739-2e2c716110f3   unknown            true     
openshift-machine-api   openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-bd2cl-worker-0-r6nrf   redfish://192.168.123.1:8000/redfish/v1/Systems/5e08cfdd-3117-40e2-a842-d894a25b732e   unknown            true     
openshift-machine-api   openshift-worker-0-2   OK       provisioned              ocp-edge-cluster-0-bd2cl-worker-0-xvcxq   redfish://192.168.123.1:8000/redfish/v1/Systems/b4898e66-fefd-4caa-a630-84bc6943a3cf   unknown            true     

$ oc get machine -o wide -A
NAMESPACE               NAME                                      PHASE         TYPE   REGION   ZONE   AGE     NODE         PROVIDERID                                                    STATE
openshift-machine-api   ocp-edge-cluster-0-bd2cl-master-0         Running                              7h15m   master-0-0   baremetalhost:///openshift-machine-api/openshift-master-0-0   
openshift-machine-api   ocp-edge-cluster-0-bd2cl-master-1         Running                              7h15m   master-0-1   baremetalhost:///openshift-machine-api/openshift-master-0-1   
openshift-machine-api   ocp-edge-cluster-0-bd2cl-master-2         Running                              7h15m   master-0-2   baremetalhost:///openshift-machine-api/openshift-master-0-2   
openshift-machine-api   ocp-edge-cluster-0-bd2cl-worker-0-6sl69   Running                              7h      worker-0-0   baremetalhost:///openshift-machine-api/openshift-worker-0-0   
openshift-machine-api   ocp-edge-cluster-0-bd2cl-worker-0-r6nrf   Running                              7h      worker-0-1   baremetalhost:///openshift-machine-api/openshift-worker-0-1   
openshift-machine-api   ocp-edge-cluster-0-bd2cl-worker-0-xvcxq   Provisioned                          50m                  baremetalhost:///openshift-machine-api/openshift-worker-0-2  
 
$ oc get nodes
NAME         STATUS   ROLES    AGE     VERSION
master-0-0   Ready    master   7h6m    v1.19.0+db1fc96
master-0-1   Ready    master   7h6m    v1.19.0+db1fc96
master-0-2   Ready    master   7h6m    v1.19.0+db1fc96
worker-0-0   Ready    worker   6h43m   v1.19.0+db1fc96
worker-0-1   Ready    worker   6h43m   v1.19.0+db1fc96


What did you expect to happen?
Expected some problem reported, at least status of bmh should be 'error' and not OK

How to reproduce it (as minimally and precisely as possible)?
1. Deploy OCP4.6 clusterwitj 3 masters and 2 workers 
2. Create configuration yaml file for adding new bmh. Set rootDeviceHints: deviceName to not existing device (see in attachment)
3. Add bmh for worker using the created yaml file.
$ oc create -f new-node2.yaml -n openshift-machine-api
4. Wait till bmh becomes ready
5. Scale up machineset to add the new machine
$ oc scale machineset MACHINESETNAME -n openshift-machine-api --replicas=3

Anything else we need to know?
In metal3-ironic-conductor log there is report of failure:
Error finding the disk or partition device to deploy the image onto', 'details': "No suitable device was found for deployment using these hints {'name': 's== /dev/sdc'

Attaching metal3-ironic-conductor log

Comment 2 Bob Fournier 2020-10-08 19:06:15 UTC

Yes it looks like ironic logs the invalid hint:
2020-10-07 15:41:40.777 1 DEBUG ironic.drivers.modules.agent_client [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Command image.install_bootloader has finished for node d096adf1-e78a-4f2a-8b59-d33324447ddc with result {'id': 'e7d05fcb-17ef-4b00-9613-48caece507e1', 'command_name': 'install_bootloader', 'command_params': {'root_uuid': None, 'efi_system_part_uuid': None, 'prep_boot_part_uuid': None, 'target_boot_mode': 'uefi'}, 'command_status': 'FAILED', 'command_error': {'type': 'DeviceNotFound', 'code': 404, 'message': 'Error finding the disk or partition device to deploy the image onto', 'details': "No suitable device was found for deployment using these hints {'name': 's== /dev/sdc'}"}, 'command_result': None} _wait_for_command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:132

but it does deploy this node:
2020-10-07 15:41:48.360 1 INFO ironic.conductor.utils [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Successfully set node d096adf1-e78a-4f2a-8b59-d33324447ddc power state to power on by power on.
2020-10-07 15:41:48.360 1 INFO ironic.conductor.deployments [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Node d096adf1-e78a-4f2a-8b59-d33324447ddc finished deploy step {'step': 'boot_instance', 'priority': 20, 'argsinfo': None, 'interface': 'deploy'}
2020-10-07 15:41:48.388 1 INFO ironic.conductor.task_manager [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Node d096adf1-e78a-4f2a-8b59-d33324447ddc moved to provision state "active" from state "deploying"; target provision state is "None"
2020-10-07 15:41:48.389 1 INFO ironic.conductor.deployments [req-c5bc2237-cbea-4bec-9d5d-1fed728fa86c - - - - -] Successfully deployed node d096adf1-e78a-4f2a-8b59-d33324447ddc with instance a9c484b9-6daa-4530-b276-5b0ba23a98d0.

Can we get the ramdisk logs for this node (d096adf1-e78a-4f2a-8b59-d33324447ddc) or another node that doesn't fail?  You can get the ramdisk logs in /shared/log/ironic/deploy/ in the ironic-conductor container.

Comment 3 Nick Stielau 2020-10-08 21:49:27 UTC

Setting to 4.7, I don't think this blocks 4.6.

Comment 4 Derek Higgins 2020-10-09 10:33:42 UTC

*** Bug 1886735 has been marked as a duplicate of this bug. ***

Comment 5 Bob Fournier 2020-10-09 10:39:51 UTC

Moving to 4.6.z.

Comment 6 Bob Fournier 2020-10-09 10:44:33 UTC

Lubov - you can disregard request for ramdisk log in Comment 2. Thanks.

Comment 9 Lubov 2020-11-22 16:28:47 UTC

While provisioning, getting error for bmh
openshift-worker-0-2   error	provisioning error       ocp-edge-cluster-0-276ht-worker-0-5n2rn   redfish://192.168.123.1:8000/redfish/v1/Systems
/677473a6-6dad-4c8d-a5b5-d6576f2cd0a6   unknown            true     Image provisioning failed: Agent returned error for deploy step {'step': 'writ
e_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 95684b67-d5a1-4061-85d9-d55bc7dd5f1b : Error performing deploy_step wri
te_image: Error finding the disk or partition device to deploy the image onto: No suitable device was found for deployment using these hints {'nam
e': 's== /dev/sdc'}. 

BMH enters deprovisioning status, then ready->provisioning->provisioning error->deprovisioning infinite loop

Machine in Provisioned phase

Openning a new bz. Cannot finish to verify this bz till the new is solved

Comment 10 Dmitry Tantsur 2020-11-24 17:12:19 UTC

> Cannot finish to verify this bz till the new is solved

Up to you, but I think if we do get an error, this can be called done (and continued in the new bug).

Comment 11 Lubov 2020-11-24 17:29:41 UTC

Agree, closing this one

Comment 14 errata-xmlrpc 2021-02-24 15:24:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.