Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2222981

Summary: Overcloud deploy fails when mounting config drive on 4k disks
Product: Red Hat OpenStack Reporter: nalmond
Component: openstack-ironic-python-agentAssignee: Julia Kreger <jkreger>
Status: CLOSED MIGRATED QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: jkreger, pweeks, sbaker
Target Milestone: ---Keywords: Triaged
Target Release: ---Flags: jkreger: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-01-04 17:25:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nalmond 2023-07-14 17:34:36 UTC
Description of problem:
After the node with a 4k root disk is provisioned, the config drive partition is unable to be mounted. This prevents cloud-init from creating the heat-admin user and copying the ssh keys.

Version-Release number of selected component (if applicable):
openstack-ironic-conductor:16.2.4-15

How reproducible:
Consistently with a 4k root disk, other nodes with a 512 sector size disk work fine

Steps to Reproduce:
1. Attempt to scale out a new node with a 4k root disk
2.
3.

Actual results:
Unable to mount config drive, deploy fails.
In dmesg:
[   82.511907] isofs_fill_super: bread failed, dev=sda2, iso_blknum=17, block=-2147483648

cloud-init logs:
2023-07-11 05:42:14,601 - subp.py[DEBUG]: Running command ['mount', '-o', 'ro', '-t', 'auto', '/dev/sda2', '/run/cloud-init/tmp/tmpkf7q2n3s'] with allowed return codes [0] (shell=False, capture=True)
2023-07-11 05:42:14,621 - util.py[DEBUG]: Failed mount of '/dev/sda2' as 'auto': Unexpected error while running command.
Command: ['mount', '-o', 'ro', '-t', 'auto', '/dev/sda2', '/run/cloud-init/tmp/tmpkf7q2n3s']
Exit code: 32
Reason: -
Stdout:
Stderr: mount: /run/cloud-init/tmp/tmpkf7q2n3s: wrong fs type, bad option, bad superblock on /dev/sda2, missing codepage or helper program, or other error.
2023-07-11 05:42:14,621 - __init__.py[DEBUG]: Datasource DataSourceConfigDrive [net,ver=None][source=None] not updated for events: New instance first boot
2023-07-11 05:42:14,622 - handlers.py[DEBUG]: finish: init-local/search-ConfigDrive: SUCCESS: no local data found from DataSourceConfigDrive

Expected results:
config drive mounts, cloud-init runs, deploy succeeds

Additional info:
Manual mount fails as well.

The default filesystem for the config drive as configured by ironic is iso9660, which does not support >2k disks.
Changing this to vfat in ironic.conf also fails but with a different mount error in dmesg:

config_drive_format=vfat

[   83.346563] FAT-fs (sda2): logical sector size too small for device (logical sector size = 512)

Comment 2 Julia Kreger 2023-07-17 16:21:52 UTC
There doesn't seem to be a precisely clear single path forward.

There are a few different, and distinct things going on here.

1) Obviously a filesystem to underlying block IO device incompatibility. Realistically, there is no "fix" for this, we can only realistically work around and prevent such a case later on in the code path.
2) Changing the default type to vfat fails, because the configuration drive ends up being too small on a non-4k system and promptly explodes.

The inherent challenge is we support a few different ways of getting a configuration drive:

1) We get a pre-prepared binary payload from the client, be it Nova, Metalsmith, OpenStackSDK, or even python-ironicclient, and the contents are written out byte for byte as requested by original requester.
2) We can be sent chunks of the data, and then assemble a fresh configuration drive payload to write to disk.

There is a third issue though, with this bug. Ironic doesn't present a configuration parameter named ``config_drive_format``. Nova does[0]. 

Which leaves us in an odd place.

Thoughts on paths forward:

1) I do suspect we should clone this out to RHEL and see if they can resolve iso9660 being unfriendly to 4k devices, since there is such a huge build-up already of writers to such volumes.
2) I also think we might need to look at transforming the payload, given we have so many *different* ways of getting payloads to support. Further team discussion and research is required.


[0]: https://opendev.org/openstack/nova/src/branch/master/nova/conf/configdrive.py#L18

Comment 3 Julia Kreger 2023-07-17 16:41:55 UTC
Adding an upstream bug.

Comment 5 Julia Kreger 2023-07-17 17:30:19 UTC
Could we please get the output of the following command from the customer's system:

sudo blockdev --report /path/to/device

Specifically we need to make sure we understand which field is different, since it seems odd that this would also be presenting now and this way. If we can get it from an existing deployed machine which deployed without issues, and the machine they are attempting to deploy to, that would be helpful.

Thanks!

Comment 8 nalmond 2023-08-07 16:28:17 UTC
Here are the blockdev outputs:

working node:
[heat-admin@ctrl1 ~]$ sudo blockdev --report /dev/sda2
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw  8192   512  4096     411648         1048576   /dev/sda2

non-working (4k) node:
[root@gen16gpu0 ~]# sudo blockdev --report /dev/sda2
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw  8192  4096  4096     411648         1048576   /dev/sda2

Comment 9 Julia Kreger 2023-08-24 17:00:49 UTC
Greetings, we've updated the upstream patch which is pending CI and review upstream. If upstream agrees to the path forward, it will take a little time to get this into the product, but given we've not seen this exact behavior with in-kernel block device drivers yet, I suspect we may be paving over some a kernel bug with the third party driver. Regardless initial upstream feedback was positive and in agreement since it seems like a logical constraint of behavior in some applications.

Comment 10 Julia Kreger 2023-09-05 15:10:19 UTC
Greetings, we won't be able to fix this in OSP 17.x much less OSP16.x as it involves changes to stable libraries which are essentially frozen in time at this point. We anticipate this fix will be available in OSP18. As this issue was rooted in a third party driver, we recommend you engage the hardware manufacturer regarding their supplied driver.