Bug 2252076

Summary: [RHOSP 17.1] Couldn't deploy 17.1 on san backed baremetal
Product: Red Hat OpenStack Reporter: Luca Davidde <ldavidde>
Component: diskimage-builderAssignee: Steve Baker <sbaker>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: apevec, fpiccion, hjensas, jjoyce, jlabarre, mariel, pgrist, pweeks, sbaker
Target Milestone: z3Keywords: Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: diskimage-builder-3.31.1-17.1.20231013080820.0576fad.el9ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-05-22 20:39:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Luca Davidde 2023-11-29 10:40:15 UTC
Description of problem:
Hi,
this customer is trying to deploy RHOSP 17.1 on diskless baremetal. There's 1TB disk on a 3par storage array.
Introspection went well.
After the provisioning, the nodes didn't boot.
In a remote with the customer, we mounted a rhel 9.2 iso and took a look in the os.
We noticed that the san device had lvm logical volumes. 
But, invoking multipath -ll it wasn't showing any devices because of the lack of multipath.conf.
We then added a standard multipath.conf and the devices started to be shown.
So we regenerated the initramfs adding multipath support and rebooted the node, that come back up correctly.
Then we copied the patched initramfs to the director, and we added it together with multipath.conf in the overcloud image and re-uploaded it in openstack.

Then customer unprovisioned the nodes and tried again:

openstack overcloud node provision --stack overcloud --network-config /home/stack/templates/baremetal_deployment.yaml --output deployed_baremetal_deployment.yaml

but even if now the nodes boot, provisioning fails throwing error in /usr/local/sbin/growvols step:

~~~
2023-11-27 09:05:05,291 p=1011676 u=stack n=ansible | 2023-11-27 09:05:05.291000 | 00215a9b-df09-6b1b-6b8e-000000000013 |      FATAL | Running /usr/local/sbin/growvols /=500GB /tmp=20GB /var/log=30GB /var/log/audit=5GB /home=400GB /var=100% | osp-ctrl03 | error={"changed": true, "cmd": "/usr/local/sbin/growvols --yes /=500GB /tmp=20GB /var/log=30GB /var/log/audit=5GB /home=400GB /var=100%", "delta": "0:00:00.097199", "end": "2023-11-27 02:06:20.704117", "msg": "non-zero return code", "rc": 1, "start": "2023-11-27 02:06:20.606918", "stderr": "Traceback (most recent call last):\n  File \"/usr/local/sbin/growvols\", line 624, in <module>\n    sys.exit(main(sys.argv))\n  File \"/usr/local/sbin/growvols\", line 524, in main\n    devname = find_next_device_name(devices, disk_name, partnum)\n  File \"/usr/local/sbin/growvols\", line 338, in find_next_device_name\n    raise Exception('Could not find partition naming scheme for %s'\nException: Could not find partition naming scheme for sda", "stderr_lines": ["Traceback (most recent call last):", "  File \"/usr/local/sbin/growvols\", line 624, in <module>", "    sys.exit(main(sys.argv))", "  File \"/usr/local/sbin/growvols\", line 524, in main", "    devname = find_next_device_name(devices, disk_name, partnum)", "  File \"/usr/local/sbin/growvols\", line 338, in find_next_device_name", "    raise Exception('Could not find partition naming scheme for %s'", "Exception: Could not find partition naming scheme for sda"], "stdout": "", "stdout_lines": []}
~~~

Version-Release number of selected component (if applicable):


How reproducible:
In customer environment

Steps to Reproduce:
1.
2.
3.

Actual results:
deployment not working

Expected results:
deployment work

Additional info:

Comment 8 Steve Baker 2023-12-04 20:44:51 UTC
Moving to diskimage-builder, where the growvols script is maintained

Comment 13 Steve Baker 2023-12-07 20:14:19 UTC
Actually lets stick with testing changes with patch files for now.

I think the growvols should be run manually while logged into the machine while we're debugging san support. This patch should get past the current error but there may be other issues.

Once growvols is patched please run the following and attach the output:

/usr/local/sbin/growvols --debug --device mpatha /=500GB /tmp=20GB /var/log=30GB /var/log/audit=5GB /home=400GB /var=100%

Once we get a successful run then I can provide instructions to modify the overcloud-hardened-uefi-full.qcow2 image to include this fix.

Comment 53 James E. LaBarre 2024-04-11 14:43:32 UTC
Looking at the growvols file in diskimage-builder-3.31.1-17.1.20231013080820.0576fad.el9ost.noarch, there are doubled lines between L255->290, L338->413, L593->657

Will this break the way the tool will work?

@hjensas

Comment 54 James E. LaBarre 2024-04-11 21:12:29 UTC
Re-examining the RPM package (extracting with rpm2cpio | cpio, rather than through Archive Manager), growvols and  test_growvols.py compared to the gerrit versions correctly.

Marking as validated against diskimage-builder-3.31.1-17.1.20231013080820.0576fad.el9ost.noarch

Comment 62 errata-xmlrpc 2024-05-22 20:39:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:2741

Comment 63 Red Hat Bugzilla 2024-09-20 04:25:15 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days