Description of problem: Hi, this customer is trying to deploy RHOSP 17.0 on diskless baremetal. Disks are on FC 3par array. Introspection seems to go fine, but once he tries to deploy the overcloud, it throws: --- 2023-06-07 00:13:41,067 p=862197 u=stack n=ansible | 2023-06-07 00:13:41.066541 | 00215a9b-ab32-7762-1a16-00000000001a | FATAL | Provision instances | localhost | error={"changed": false, "logging": "Created port controller-0-ctlplane (UUID 2697361b-8819-4987-96a0-d6f2c79229a3) for node overcloud-controller (UUID 555208e1-0912-4210-aef1-756956750fe8) with {'network_id': '3b5b70a8-9a90-4ef2-9469-9ba2af2ec2c8', 'name': 'controller-0-ctlplane'}\nCreated port storage-0-ctlplane (UUID b4a116b4-fbb0-4771-8593-40df04a752d3) for node overcloud-storage (UUID 97c58772-0973-4f65-837b-e30305c1a89d) with {'network_id': '3b5b70a8-9a90-4ef2-9469-9ba2af2ec2c8', 'name': 'storage-0-ctlplane'}\nCreated port compute-0-ctlplane (UUID d6449978-5254-49c3-8da6-5ba2d1cee3aa) for node overcloud-compute (UUID 973d3023-08c2-430a-8468-704b7c937af0) with {'network_id': '3b5b70a8-9a90-4ef2-9469-9ba2af2ec2c8', 'name': 'compute-0-ctlplane'}\nAttached port controller-0-ctlplane (UUID 2697361b-8819-4987-96a0-d6f2c79229a3) to node overcloud-controller (UUID 555208e1-0912-4210-aef1-756956750fe8)\nAttached port storage-0-ctlplane (UUID b4a116b4-fbb0-4771-8593-40df04a752d3) to node overcloud-storage (UUID 97c58772-0973-4f65-837b-e30305c1a89d)\nProvisioning started on node overcloud-controller (UUID 555208e1-0912-4210-aef1-756956750fe8)\nProvisioning started on node overcloud-storage (UUID 97c58772-0973-4f65-837b-e30305c1a89d)\nAttached port compute-0-ctlplane (UUID d6449978-5254-49c3-8da6-5ba2d1cee3aa) to node overcloud-compute (UUID 973d3023-08c2-430a-8468-704b7c937af0)\nProvisioning started on node overcloud-compute (UUID 973d3023-08c2-430a-8468-704b7c937af0)\n", "msg": "Node 97c58772-0973-4f65-837b-e30305c1a89d reached failure state \"deploy failed\"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 97c58772-0973-4f65-837b-e30305c1a89d : Error performing deploy_step write_image: Command execution failed: Failed to create config drive on disk /dev/dm-0 for node 97c58772-0973-4f65-837b-e30305c1a89d. Error: Unexpected error while running command.\nCommand: test -e /dev/dm-0p4\nExit code: 1\nStdout: ''\nStderr: ''."} --- It looks like it cannot retrieve partitions from the boot device. Boot mode is set to UEFI, so by default it should use GPT as disk_label. From the journal log in the ironic/deploy dir: --- Jun 06 14:43:34 host-192-168-24-8 ironic-python-agent[1799]: 2023-06-06 14:43:34.150 1799 ERROR ironic_lib.disk_utils [-] Failed to create config drive on disk /dev/dm-0 for node 97c58772-0973-4f65-837b-e30305c1a89d. Error: Unexpected error while running command. Command: test -e /dev/dm-0p4 Exit code: 1 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Jun 06 14:43:35 host-192-168-24-8 ironic-python-agent[1799]: 2023-06-06 14:43:34.303 1799 ERROR root [-] Command failed: prepare_image, error: Failed to create config drive on disk /dev/dm-0 for node 97c58772-0973-4f65-837b-e30305c1a89d. Error: Unexpected error while running command. Command: test -e /dev/dm-0p4 Exit code: 1 Stdout: '' Stderr: '': ironic_lib.exception.InstanceDeployFailure: Failed to create config drive on disk /dev/dm-0 for node 97c58772-0973-4f65-837b-e30305c1a89d. Error: Unexpected error while running command. Command: test -e /dev/dm-0p4 Exit code: 1 --- Multipath looks working good: --- [ldavidde@supportshell-1 97c58772-0973-4f65-837b-e30305c1a89d_overcloud-storage_6c142113-6070-4ad6-96d7-a237259b187b_2023-06-06-18-43-24]$ cat multipath mpatha (360002ac00000000000003c970001e2a6) dm-0 3PARdata,VV size=200G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 1:0:0:0 sda 8:0 active ready running |- 1:0:1:0 sdc 8:32 active ready running |- 2:0:0:0 sde 8:64 active ready running `- 2:0:1:0 sdg 8:96 active ready running mpathb (360002ac000000000000063f70001e2a6) dm-1 3PARdata,VV size=4.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 1:0:0:1 sdb 8:16 active ready running |- 1:0:1:1 sdd 8:48 active ready running |- 2:0:0:1 sdf 8:80 active ready running `- 2:0:1:1 sdh 8:112 active ready running --- but I don't see partitions on the device: --- [ldavidde@supportshell-1 97c58772-0973-4f65-837b-e30305c1a89d_overcloud-storage_6c142113-6070-4ad6-96d7-a237259b187b_2023-06-06-18-43-24]$ cat lsblk KNAME MODEL SIZE ROTA TYPE UUID PARTUUID sda VV 200G 0 disk sdb VV 4T 0 disk sdc VV 200G 0 disk sdd VV 4T 0 disk sde VV 200G 0 disk sdf VV 4T 0 disk sdg VV 200G 0 disk sdh VV 4T 0 disk sr0 Virtual DVD-ROM 9.1G 1 rom 2023-05-08-06-07-49-00 dm-0 200G 0 mpath dm-0 200G 0 mpath dm-0 200G 0 mpath dm-0 200G 0 mpath dm-1 4T 0 mpath dm-1 4T 0 mpath dm-1 4T 0 mpath dm-1 4T 0 mpath --- Version-Release number of selected component (if applicable): How reproducible: everytime, on customer environment Steps to Reproduce: 1. 2. 3. Actual results: deployment fails Expected results: deployment succeed Additional info: In the sosreport are present: sosreport from the director full tarball of /var/log from the director ansible.log of the deployment
Okay, the underlying issue appears that we're missing https://review.opendev.org/c/openstack/ironic-lib/+/839949 in OSP17.0, which makes sense timeline wise since we had already frozen 17.0 when the patch was created. The issue appears to be fixed in 17.1, at least based upon the change logs. The base underlying issue is that usage of cloud tools with backend SAN storage is very rare. The software was designed to operate on commodity hardware, not highly specialized storage hardware. From a redundancy standpoint, the multipath IO management causes everything to be presented as device mapper devices, instead of the older style partition paths, and prior to change https://review.opendev.org/c/openstack/ironic-lib/+/839949, we didn't have a way to navigate that for device mapper devices. I'm recommending the following actions: 1) Clean the nodes, at least with metadata erasure. Upon reviewing the logs, it is clear we've attempted to write to both LUNs being provided by the SAN which is... confusing and can cause other issues depending on the desired end state of the node. 2) Use RHOSP 17.1 when it releases instead of 17.0. Another 17.0 release is not scheduled at this time and as such this issue is not expected to be fixed on OSP17.0. 3) As a path forward, the customer can create a 64 MegaByte partition with a "config-2" label inside of the whole disk image being deployed. Using guestfish is likely the best path, to do so. Something like the following would be required, although these are not precise steps. Do always make a copy of the disk image before editing it. sgdisk -n 0:-64MB:0 /dev/vda mkfs.ext3 -L config-2 /dev/vda4 With a partition inside of the disk image with the "config-2" label, it will be chosen by Ironic *instead* of ironic attempting to create a new partition. This should completely bypass the failing logic in 17.0 allowing the customer to deploy until they can move to OSP17.1 and resume using a stock image.
Many thanks for the response Julia. I shared your analysis with the customer. Luca
To help debug growvols, could you provide the output for the following: lsblk -Po kname,pkname,name,label,type,fstype,mountpoint However, the fact that the script failed when the command "vgs --noheadings --options vg_name" returned zero volume groups is very concerning. Could you also paste the full output of "vgs".