2222981 – Overcloud deploy fails when mounting config drive on 4k disks

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Bug 2222981 - Overcloud deploy fails when mounting config drive on 4k disks

Summary: Overcloud deploy fails when mounting config drive on 4k disks

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ironic-python-agent
Sub Component:
Version:	16.2 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Julia Kreger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-14 17:34 UTC by nalmond
Modified:	2024-01-04 17:27 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-01-04 17:25:47 UTC
Target Upstream Version:
Embargoed:
Flags:	jkreger: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	2028002	None	None	None	2023-07-17 16:41:55 UTC
Red Hat Issue Tracker	OSP-26632	None	None	None	2024-01-04 17:25:47 UTC
Red Hat Issue Tracker	OSP-31034	None	None	None	2024-01-04 17:27:28 UTC

Description nalmond 2023-07-14 17:34:36 UTC

Description of problem:
After the node with a 4k root disk is provisioned, the config drive partition is unable to be mounted. This prevents cloud-init from creating the heat-admin user and copying the ssh keys.

Version-Release number of selected component (if applicable):
openstack-ironic-conductor:16.2.4-15

How reproducible:
Consistently with a 4k root disk, other nodes with a 512 sector size disk work fine

Steps to Reproduce:
1. Attempt to scale out a new node with a 4k root disk
2.
3.

Actual results:
Unable to mount config drive, deploy fails.
In dmesg:
[   82.511907] isofs_fill_super: bread failed, dev=sda2, iso_blknum=17, block=-2147483648

cloud-init logs:
2023-07-11 05:42:14,601 - subp.py[DEBUG]: Running command ['mount', '-o', 'ro', '-t', 'auto', '/dev/sda2', '/run/cloud-init/tmp/tmpkf7q2n3s'] with allowed return codes [0] (shell=False, capture=True)
2023-07-11 05:42:14,621 - util.py[DEBUG]: Failed mount of '/dev/sda2' as 'auto': Unexpected error while running command.
Command: ['mount', '-o', 'ro', '-t', 'auto', '/dev/sda2', '/run/cloud-init/tmp/tmpkf7q2n3s']
Exit code: 32
Reason: -
Stdout:
Stderr: mount: /run/cloud-init/tmp/tmpkf7q2n3s: wrong fs type, bad option, bad superblock on /dev/sda2, missing codepage or helper program, or other error.
2023-07-11 05:42:14,621 - __init__.py[DEBUG]: Datasource DataSourceConfigDrive [net,ver=None][source=None] not updated for events: New instance first boot
2023-07-11 05:42:14,622 - handlers.py[DEBUG]: finish: init-local/search-ConfigDrive: SUCCESS: no local data found from DataSourceConfigDrive

Expected results:
config drive mounts, cloud-init runs, deploy succeeds

Additional info:
Manual mount fails as well.

The default filesystem for the config drive as configured by ironic is iso9660, which does not support >2k disks.
Changing this to vfat in ironic.conf also fails but with a different mount error in dmesg:

config_drive_format=vfat

[   83.346563] FAT-fs (sda2): logical sector size too small for device (logical sector size = 512)

Comment 2 Julia Kreger 2023-07-17 16:21:52 UTC

There doesn't seem to be a precisely clear single path forward.

There are a few different, and distinct things going on here.

1) Obviously a filesystem to underlying block IO device incompatibility. Realistically, there is no "fix" for this, we can only realistically work around and prevent such a case later on in the code path.
2) Changing the default type to vfat fails, because the configuration drive ends up being too small on a non-4k system and promptly explodes.

The inherent challenge is we support a few different ways of getting a configuration drive:

1) We get a pre-prepared binary payload from the client, be it Nova, Metalsmith, OpenStackSDK, or even python-ironicclient, and the contents are written out byte for byte as requested by original requester.
2) We can be sent chunks of the data, and then assemble a fresh configuration drive payload to write to disk.

There is a third issue though, with this bug. Ironic doesn't present a configuration parameter named ``config_drive_format``. Nova does[0].

Which leaves us in an odd place.

Thoughts on paths forward:

1) I do suspect we should clone this out to RHEL and see if they can resolve iso9660 being unfriendly to 4k devices, since there is such a huge build-up already of writers to such volumes.
2) I also think we might need to look at transforming the payload, given we have so many *different* ways of getting payloads to support. Further team discussion and research is required.

[0]: https://opendev.org/openstack/nova/src/branch/master/nova/conf/configdrive.py#L18

Comment 3 Julia Kreger 2023-07-17 16:41:55 UTC

Adding an upstream bug.

Comment 5 Julia Kreger 2023-07-17 17:30:19 UTC

Could we please get the output of the following command from the customer's system:

sudo blockdev --report /path/to/device

Specifically we need to make sure we understand which field is different, since it seems odd that this would also be presenting now and this way. If we can get it from an existing deployed machine which deployed without issues, and the machine they are attempting to deploy to, that would be helpful.

Thanks!

Comment 8 nalmond 2023-08-07 16:28:17 UTC

Here are the blockdev outputs:

working node:
[heat-admin@ctrl1 ~]$ sudo blockdev --report /dev/sda2
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw  8192   512  4096     411648         1048576   /dev/sda2

non-working (4k) node:
[root@gen16gpu0 ~]# sudo blockdev --report /dev/sda2
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw  8192  4096  4096     411648         1048576   /dev/sda2

Comment 9 Julia Kreger 2023-08-24 17:00:49 UTC

Greetings, we've updated the upstream patch which is pending CI and review upstream. If upstream agrees to the path forward, it will take a little time to get this into the product, but given we've not seen this exact behavior with in-kernel block device drivers yet, I suspect we may be paving over some a kernel bug with the third party driver. Regardless initial upstream feedback was positive and in agreement since it seems like a logical constraint of behavior in some applications.

Comment 10 Julia Kreger 2023-09-05 15:10:19 UTC

Greetings, we won't be able to fix this in OSP 17.x much less OSP16.x as it involves changes to stable libraries which are essentially frozen in time at this point. We anticipate this fix will be available in OSP18. As this issue was rooted in a third party driver, we recommend you engage the hardware manufacturer regarding their supplied driver.

Note You need to log in before you can comment on or make changes to this bug.