Bug 2213443

Summary: [RHOSP 17.0] Deployment fails with "Error performing deploy_step write_image: Command execution failed: Failed to create config drive on disk /dev/dm-0"
Product: Red Hat OpenStack Reporter: Luca Davidde <ldavidde>
Component: openstack-ironic-python-agentAssignee: Julia Kreger <jkreger>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: bporwal, jkreger, sbaker
Target Milestone: ---Keywords: Triaged
Target Release: ---Flags: jkreger: needinfo-
sbaker: needinfo? (bporwal)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-07 19:50:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Luca Davidde 2023-06-08 07:31:01 UTC
Description of problem:
Hi,
this customer is trying to deploy RHOSP 17.0 on diskless baremetal. Disks are on FC 3par array. Introspection seems to go fine, but once he tries to deploy the overcloud, it throws:
---
2023-06-07 00:13:41,067 p=862197 u=stack n=ansible | 2023-06-07 00:13:41.066541 | 00215a9b-ab32-7762-1a16-00000000001a |      FATAL | Provision instances | localhost | error={"changed": false, "logging": "Created port controller-0-ctlplane (UUID 2697361b-8819-4987-96a0-d6f2c79229a3) for node overcloud-controller (UUID 555208e1-0912-4210-aef1-756956750fe8) with {'network_id': '3b5b70a8-9a90-4ef2-9469-9ba2af2ec2c8', 'name': 'controller-0-ctlplane'}\nCreated port storage-0-ctlplane (UUID b4a116b4-fbb0-4771-8593-40df04a752d3) for node overcloud-storage (UUID 97c58772-0973-4f65-837b-e30305c1a89d) with {'network_id': '3b5b70a8-9a90-4ef2-9469-9ba2af2ec2c8', 'name': 'storage-0-ctlplane'}\nCreated port compute-0-ctlplane (UUID d6449978-5254-49c3-8da6-5ba2d1cee3aa) for node overcloud-compute (UUID 973d3023-08c2-430a-8468-704b7c937af0) with {'network_id': '3b5b70a8-9a90-4ef2-9469-9ba2af2ec2c8', 'name': 'compute-0-ctlplane'}\nAttached port controller-0-ctlplane (UUID 2697361b-8819-4987-96a0-d6f2c79229a3) to node overcloud-controller (UUID 555208e1-0912-4210-aef1-756956750fe8)\nAttached port storage-0-ctlplane (UUID b4a116b4-fbb0-4771-8593-40df04a752d3) to node overcloud-storage (UUID 97c58772-0973-4f65-837b-e30305c1a89d)\nProvisioning started on node overcloud-controller (UUID 555208e1-0912-4210-aef1-756956750fe8)\nProvisioning started on node overcloud-storage (UUID 97c58772-0973-4f65-837b-e30305c1a89d)\nAttached port compute-0-ctlplane (UUID d6449978-5254-49c3-8da6-5ba2d1cee3aa) to node overcloud-compute (UUID 973d3023-08c2-430a-8468-704b7c937af0)\nProvisioning started on node overcloud-compute (UUID 973d3023-08c2-430a-8468-704b7c937af0)\n", "msg": "Node 97c58772-0973-4f65-837b-e30305c1a89d reached failure state \"deploy failed\"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 97c58772-0973-4f65-837b-e30305c1a89d : Error performing deploy_step write_image: Command execution failed: Failed to create config drive on disk /dev/dm-0 for node 97c58772-0973-4f65-837b-e30305c1a89d. Error: Unexpected error while running command.\nCommand: test -e /dev/dm-0p4\nExit code: 1\nStdout: ''\nStderr: ''."}
---

It looks like it cannot retrieve partitions from the boot device.
Boot mode is set to UEFI, so by default it should use GPT as disk_label.

From the journal log in the ironic/deploy dir:

---
Jun 06 14:43:34 host-192-168-24-8 ironic-python-agent[1799]: 2023-06-06 14:43:34.150 1799 ERROR ironic_lib.disk_utils [-] Failed to create config drive on disk /dev/dm-0 for node 97c58772-0973-4f65-837b-e30305c1a89d. Error: Unexpected error while running command.
                                                             Command: test -e /dev/dm-0p4
                                                             Exit code: 1
                                                             Stdout: ''
                                                             Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Jun 06 14:43:35 host-192-168-24-8 ironic-python-agent[1799]: 2023-06-06 14:43:34.303 1799 ERROR root [-] Command failed: prepare_image, error: Failed to create config drive on disk /dev/dm-0 for node 97c58772-0973-4f65-837b-e30305c1a89d. Error: Unexpected error while running command.
                                                             Command: test -e /dev/dm-0p4
                                                             Exit code: 1
                                                             Stdout: ''
                                                             Stderr: '': ironic_lib.exception.InstanceDeployFailure: Failed to create config drive on disk /dev/dm-0 for node 97c58772-0973-4f65-837b-e30305c1a89d. Error: Unexpected error while running command.
                                                             Command: test -e /dev/dm-0p4
                                                             Exit code: 1
---

Multipath looks working good:

---
[ldavidde@supportshell-1 97c58772-0973-4f65-837b-e30305c1a89d_overcloud-storage_6c142113-6070-4ad6-96d7-a237259b187b_2023-06-06-18-43-24]$ cat multipath 
mpatha (360002ac00000000000003c970001e2a6) dm-0 3PARdata,VV
size=200G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:0 sda 8:0   active ready running
  |- 1:0:1:0 sdc 8:32  active ready running
  |- 2:0:0:0 sde 8:64  active ready running
  `- 2:0:1:0 sdg 8:96  active ready running
mpathb (360002ac000000000000063f70001e2a6) dm-1 3PARdata,VV
size=4.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:1 sdb 8:16  active ready running
  |- 1:0:1:1 sdd 8:48  active ready running
  |- 2:0:0:1 sdf 8:80  active ready running
  `- 2:0:1:1 sdh 8:112 active ready running
---

but I don't see partitions on the device:

---
[ldavidde@supportshell-1 97c58772-0973-4f65-837b-e30305c1a89d_overcloud-storage_6c142113-6070-4ad6-96d7-a237259b187b_2023-06-06-18-43-24]$ cat lsblk 
KNAME MODEL            SIZE ROTA TYPE  UUID                                   PARTUUID
sda   VV               200G    0 disk                                         
sdb   VV                 4T    0 disk                                         
sdc   VV               200G    0 disk                                         
sdd   VV                 4T    0 disk                                         
sde   VV               200G    0 disk                                         
sdf   VV                 4T    0 disk                                         
sdg   VV               200G    0 disk                                         
sdh   VV                 4T    0 disk                                         
sr0   Virtual DVD-ROM  9.1G    1 rom   2023-05-08-06-07-49-00                 
dm-0                   200G    0 mpath                                        
dm-0                   200G    0 mpath                                        
dm-0                   200G    0 mpath                                        
dm-0                   200G    0 mpath                                        
dm-1                     4T    0 mpath                                        
dm-1                     4T    0 mpath                                        
dm-1                     4T    0 mpath                                        
dm-1                     4T    0 mpath                                        
---

Version-Release number of selected component (if applicable):


How reproducible:

everytime, on customer environment
Steps to Reproduce:
1.
2.
3.

Actual results:
deployment fails
Expected results:
deployment succeed

Additional info:
In the sosreport are present:
sosreport from the director
full tarball of /var/log from the director
ansible.log of the deployment

Comment 1 Julia Kreger 2023-06-08 14:30:15 UTC
Okay, the underlying issue appears that we're missing https://review.opendev.org/c/openstack/ironic-lib/+/839949 in OSP17.0, which makes sense timeline wise since we had already frozen 17.0 when the patch was created.

The issue appears to be fixed in 17.1, at least based upon the change logs.

The base underlying issue is that usage of cloud tools with backend SAN storage is very rare. The software was designed to operate on commodity hardware, not highly specialized storage hardware. From a redundancy standpoint, the multipath IO management causes everything to be presented as device mapper devices, instead of the older style partition paths, and prior to change https://review.opendev.org/c/openstack/ironic-lib/+/839949, we didn't have a way to navigate that for device mapper devices.

I'm recommending the following actions:

1) Clean the nodes, at least with metadata erasure. Upon reviewing the logs, it is clear we've attempted to write to both LUNs being provided by the SAN which is... confusing and can cause other issues depending on the desired end state of the node.
2) Use RHOSP 17.1 when it releases instead of 17.0. Another 17.0 release is not scheduled at this time and as such this issue is not expected to be fixed on OSP17.0.
3) As a path forward, the customer can create a 64 MegaByte partition with a "config-2" label inside of the whole disk image being deployed. Using guestfish is likely the best path, to do so. Something like the following would be required, although these are not precise steps. Do always make a copy of the disk image before editing it.

sgdisk -n 0:-64MB:0 /dev/vda
mkfs.ext3 -L config-2 /dev/vda4

With a partition inside of the disk image with the "config-2" label, it will be chosen by Ironic *instead* of ironic attempting to create a new partition. This should completely bypass the failing logic in 17.0 allowing the customer to deploy until they can move to OSP17.1 and resume using a stock image.

Comment 2 Luca Davidde 2023-06-08 15:13:29 UTC
Many thanks for the response Julia. 
I shared your analysis with the customer.



Luca

Comment 13 Steve Baker 2023-07-17 20:20:12 UTC
To help debug growvols, could you provide the output for the following:
lsblk -Po kname,pkname,name,label,type,fstype,mountpoint

However, the fact that the script failed when the command "vgs --noheadings --options vg_name" returned zero volume groups is very concerning. Could you also paste the full output of "vgs".