Bug 1430435

Summary:	overcloud image support for 4K native disks
Product:	Red Hat OpenStack	Reporter:	Matt Flusche <mflusche>
Component:	openstack-tripleo-common	Assignee:	Ryan Brady <rbrady>
Status:	CLOSED ERRATA	QA Contact:	Alexander Chuzhoy <sasha>
Severity:	high	Docs Contact:
Priority:	high
Version:	10.0 (Newton)	CC:	bfournie, bschmaus, dchinner, derekh, dtantsur, esandeen, jmelvin, lmartins, mburns, mlammon, pgsousa, racedoro, rhel-osp-director-maint, slinaber, smykhail, srevivo
Target Milestone:	beta	Keywords:	Triaged
Target Release:	13.0 (Queens)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-tripleo-common-8.6.1-0.20180410165747.4d8ca16.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-27 13:29:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1473267

Description Matt Flusche 2017-03-08 15:18:49 UTC

Description of problem:
Overcloud deployment fails for hardware using 4K native sectors.

Ironic deployment failure:

   XFS (sda2): device supports 4096 byte sectors (not 512)

Version-Release number of selected component (if applicable):
rhosp-director-images-10.0-20170201.1.el7ost.noarch

How reproducible:


Steps to Reproduce:
1. Deploy overcloud to hardware using 4K native disks
2.
3.

Actual results:
XFS (sda2): device supports 4096 byte sectors (not 512)

Expected results:


Additional info:
Looking to get more details on the state of support for 4K native disks and OSP.

Is using 512e (emulated) sectors the only option for deployment?

Comment 2 Matt Flusche 2017-03-08 15:24:12 UTC

Created attachment 1261320 [details]
ironic deployment error with 4K native disks

Comment 3 Matt Flusche 2017-03-08 15:35:04 UTC

Reference, 4K native sector support for RHEL:  https://access.redhat.com/solutions/56494

Comment 5 Lucas Alvares Gomes 2017-03-16 14:30:20 UTC

Hi,

Sorry for the delay on this. I have never seems or tried deploying a node with 4K native disks.

By the screenshot it seems that the problem occurs at the time we are installing the bootloader, specifically when the deploy ramdisk tries to mount the image partition (Ironic uses the GRUB contained in the image being deployed).

So I would suggest, just for testing purpose if it would be possible to disable local booting to see if it works so we can narrow down the problem. To disable it, we need to update the node's "boot_option" capabilities. If you run "ironic node-show <name>" and look at the properties/capabilities field you will see the "boot_option:local" there we need to change that to "boot_option:netboot", e.g:

$ ironic node-update <node uuid/name> add properties/capabilities="...,boot_option:netboot"

The flavor in nova needs to be modified to match that behavior as well.

And that would disable the boot loader installation. With "netboot" the nodes will always PXE boot (and chainload to the local disk after it's being deployed).

...

@Matt, the link you posted is also interesting [0]. And it mentions that " traditional BIOS cannot boot from 4k Native devices". This is something we could try as well, to boot the nodes in UEFI mode.

Ironic does support UEFI deployments, similar to the "boot_option" capability described above, we need to set a "boot_mode:uefi" capability in the nodes and flavor. Here's the upstream documentation about it:

https://docs.openstack.org/project-install-guide/baremetal/draft/setup-drivers.html

[0] https://access.redhat.com/solutions/56494

Hope that helps,
Lucas

Comment 6 Matt Flusche 2017-03-16 19:34:48 UTC

Hi Lucas,

Thanks for getting back with me.

I agree that the error is occurring during the procedure to install the boot loader but it seems the issue is that /dev/sda2 can not even be mounted due to the mismatch in the sector size of the image and physical disk.  I don't believe grub is even trying to install a boot loader yet as it needs to mount and chroot the root file system first.  It would seem the same error would occur during boot regardless of where the boot loader is; do you agree?

Thanks,

Matt

Comment 17 Eric Sandeen 2017-06-20 21:51:51 UTC

As a workaround, it should be relatively easy to script a conversion by extracting the raw fs image from the qcow2 image, creating a new xfs filesystem of similar size and geometry but with 4k sectors, loop mounting it, copying files to it, unmounting it, and recreating a qcow image from that new xfs filesystem.  I'm not sure how feasible that is in actual deployments...

Comment 23 Dmitry Tantsur 2017-11-17 17:45:50 UTC

Just got another case of the same problem

Comment 27 Dmitry Tantsur 2017-11-20 16:27:17 UTC

Derek was so kind to agree to take a look at this. We may also get hardware reasonably soon, so stay tuned.

Any helps with triaging is still appreciated.

Comment 30 Dmitry Tantsur 2017-11-20 18:18:23 UTC

I was told that the problem can be worked around using whole disk images as described in https://teknoarticles.blogspot.de/2016/12/start-using-whole-disk-images-with.html. Just make sure to change the sector size to 4k in the script you end up with.

Comment 31 Derek Higgins 2017-11-21 14:53:42 UTC

I've reproduced this in virt by adding a block size to the disk exposed to the guest
<blockio logical_block_size='4096' physical_block_size='4096'/>

(I also tried using scsi_debug on the host and exposing this to the guest but qemu presented it with a 512 sector size)

ironic-python-agent[629]: 2017-11-21 00:21:14.925 629 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): mount /dev/sda3 /tmp/tmpdmWBzG execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:367
kernel: SGI XFS with ACLs, security attributes, no debug enabled
kernel: XFS (sda3): device supports 4096 byte sectors (not 512)
ironic-python-agent[629]: 2017-11-21 00:21:14.996 629 DEBUG oslo_concurrency.processutils [-] CMD "mount /dev/sda3 /tmp/tmpdmWBzG" returned: 32 in 0.071s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:404
ironic-python-agent[629]: 2017-11-21 00:21:14.997 629 DEBUG oslo_concurrency.processutils [-] u'mount /dev/sda3 /tmp/tmpdmWBzG' failed. Not Retrying. execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:452
ironic-python-agent[629]: 2017-11-21 00:21:14.997 629 ERROR ironic_python_agent.extensions.image [-] Installing GRUB2 boot loader to device /dev/sda failed with Unexpected error while running command.
ironic-python-agent[629]: Command: mount /dev/sda3 /tmp/tmpdmWBzG
ironic-python-agent[629]: Exit code: 32
ironic-python-agent[629]: Stdout: u''
ironic-python-agent[629]: Stderr: u'mount: mount /dev/sda3 on /tmp/tmpdmWBzG failed: Function not implemented\n'.: ProcessExecutionError: Unexpected error while running command.



I then built an overcloud-full image with "-s size=4096" passed into mkfs (mentioned in comment 16)

the resulting image appears to solve the problem, and can still be used with disks with a 512 sector size.

======== Image used with 4096 sector sizes =========
[root@t2 centos]# echo p | fdisk /dev/sda
Disk /dev/sda: 85.9 GB, 85899345920 bytes, 20971520 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
Disk identifier: C50C47EF-30DA-4BBF-9140-010A5B76E1F7
#         Start          End    Size  Type            Name
 1          256        51455    200M  EFI System      primary
 2        51456        51711      1M  Microsoft basic primary
 3        51712     20971514   79.8G  Microsoft basic primary
[root@t2 centos]# xfs_info /
meta-data=/dev/sda3              isize=512    agcount=69, agsize=305284 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=20919803, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


======== The same image used with 512 sector sizes =========
[root@t3 centos]# echo p | fdisk /dev/sda
Disk /dev/sda: 85.9 GB, 85899345920 bytes, 167772160 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: 480F61BE-53C5-414E-AAF6-7ADBFEE90C57
#         Start          End    Size  Type            Name
 1         2048       411647    200M  EFI System      primary
 2       411648       413695      1M  Microsoft basic primary
 3       413696    167772126   79.8G  Microsoft basic primary
[root@t3 centos]# xfs_info /
meta-data=/dev/sda3              isize=512    agcount=69, agsize=305284 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=20919803, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Once we assess that there isn't any signifigant performance problems etc
introduced by using a 4k xfs filesystem on a traditional
512b disk, I'd suggest we just switch the default on the overcloud-full image.

Comment 32 Dmitry Tantsur 2017-11-21 15:03:01 UTC

Derek, could you please provide an example command you used to build an image? Such a workaround could help a lot.

Comment 33 Eric Sandeen 2017-11-21 15:13:32 UTC

cc: dchinner - Dave, can you spare any brain cycles on making sure this is headed in a safe direction?

"Once we assess that there isn't any signifigant performance problems etc
introduced by using a 4k xfs filesystem on a traditional
512b disk, I'd suggest we just switch the default on the overcloud-full image."

My worry is atomicity of 4k metadata writes on 512 disk, but I /think/ the torn write detection bfoster did makes that safe.

Thanks,
-Eric

Comment 34 Derek Higgins 2017-11-21 16:05:24 UTC

(In reply to Dmitry Tantsur from comment #32)
> Derek, could you please provide an example command you used to build an
> image? Such a workaround could help a lot.

Note: I was testing this upstream so generating a centos image, but the RHEL equivalent would be similar but
1. use overcloud-images-rhel7.yaml in place of overcloud-images-centos7.yaml
2. set the various REG_* env variables relevant to your RHEL subscription (documented in the rhel-common disk-image-builder element), and no need for DIB_YUM_REPO_CONF

To create the image with a 4096 sector size, I had to patch two files in tripleo-common

1. Add -s size=4096 to the mkfs command (if building a rhel image then patch overcloud-images-rhel7.yaml instead)
diff -r /usr/share_/tripleo-common/image-yaml/overcloud-images-centos7.yaml /usr/share/tripleo-common/image-yaml/overcloud-images-centos7.yaml
9a10,11
>     options:
>       - "--mkfs-options '-s size=4096'"

2. support spaces in mkfs options correctly
diff -r /usr/lib/python2.7/site-packages_/tripleo_common/image/image_builder.py /usr/lib/python2.7/site-packages/tripleo_common/image/image_builder.py
19a20
> import shlex
97c98
<                 cmd.extend(option.split(' '))
---
>                 cmd.extend(shlex.split(option))

Then to build a centos overcloud-full image
$ export DIB_YUM_REPO_CONF="/etc/yum.repos.d/delorean-current.repo  /etc/yum.repos.d/delorean-queens-deps.repo  /etc/yum.repos.d/delorean.repo  /etc/yum.repos.d/quickstart-centos-qemu.repo" 
$ openstack overcloud image build \
    --config-file /usr/share/openstack-tripleo-common/image-yaml/overcloud-images.yaml \
    --config-file /usr/share/openstack-tripleo-common/image-yaml/overcloud-images-centos7.yaml

Once built, you can register it with 
$ openstack overcloud image upload --update-existing

Comment 35 Dave Chinner 2017-11-21 22:58:49 UTC

(In reply to Eric Sandeen from comment #33)
> cc: dchinner - Dave, can you spare any brain cycles on making sure this is
> headed in a safe direction?
> 
> "Once we assess that there isn't any signifigant performance problems etc
> introduced by using a 4k xfs filesystem on a traditional
> 512b disk, I'd suggest we just switch the default on the overcloud-full
> image."

Performance will not be any different, because almost all of XFS's IO is filesystem block sized and aligned, which in both cases is already 4kB.

> My worry is atomicity of 4k metadata writes on 512 disk, but I /think/ the
> torn write detection bfoster did makes that safe.

Traditional 512b disk, or a 512e disk? Two very different behaviours between them....

With traditional (native) 512b sector disks, the torn log write issue for larger sectors then 512b should be solved (need to check what RHEL7.x version that was introduced). The log replay protects against torn metadata writes, and hence there should be no issues with undetected torn log and metadata writes. Torn data writes will still be an issue, but that indicates unsafe application data integrity practices so torn writes are the least of the user's worries here.

For 512e disks, they use 4k sectors on the media and so 512 byte writes are
actually RMW cycles hidden by the hardware. Using 4k sectors will stop this RMW cycle from happening for filesystem metadata and the log, but applications will still be able to do 512 byte aligned direct IO because the logical sector size is still 512. The page cache should insulate apps from this, so it's only direct IO where this gets exposed. IOWs, the torn write problem has a smaller "user data w/ direct IO" scope because of the internal RMW the drive does. The problem is still there, but very few users are likely going to be exposed to it on 512e drives.

Cheers,

Dave.

Comment 36 pgsousa 2017-11-22 11:11:08 UTC

Hi, 

thanks to Derek procedure https://bugzilla.redhat.com/show_bug.cgi?id=1430435#c34 I've managed to get this working on OSP-11 with my 4K disks with UEFI enabled. 

Here's the procedure I used to create the overcloud image:

1. Add -s size=4096 to the mkfs command to /usr/share/openstack-tripleo-common/image-yaml/overcloud-images-rhel7.yaml 

>     options:
>       - "--mkfs-options '-s size=4096'"

2. support spaces in mkfs options correctly
diff -r /usr/lib/python2.7/site-packages_/tripleo_common/image/image_builder.py /usr/lib/python2.7/site-packages/tripleo_common/image/image_builder.py
19a20
> import shlex
97c98
<                 cmd.extend(option.split(' '))
---
>                 cmd.extend(shlex.split(option))

3. Generate image

export DIB_LOCAL_IMAGE=rhel-server-7.4-x86_64-kvm.qcow2
export REG_USER=user
export REG_PASSWORD=password
export REG_METHOD=portal
export REG_POOL_ID="poolid"
export REG_SERVER_URL="subscription.rhn.redhat.com"
export REG_SERVICE_LEVEL="Self-Support"
export REG_REPOS="rhel-7-server-rpms rhel-7-server-extras-rpms rhel-ha-for-rhel-7-server-rpms rhel-7-server-optional-rpms rhel-7-server-openstack-11-rpms"

openstack overcloud image build --config-file /usr/share/openstack-tripleo-common/image-yaml/overcloud-images.yaml --config-file /usr/share/openstack-tripleo-common/image-yaml/overcloud-images-rhel7.yaml

Thanks Derek :)

Comment 39 Dave Chinner 2017-11-22 22:41:20 UTC

(In reply to Derek Higgins from comment #38)
> > Torn data writes will still be an issue, but that indicates unsafe application
> > data integrity practices so torn writes are the least of the user's worries
> > here.
> 
> Dave can you elaborate on this, what kind of practices would cause problems
> and what would the problems be?

The problem is applications not using fsync/fdatasync() where they need to to guarantee data is on stable storage. If they don't do this, then torn writes don't matter - critical data is going to be lost on crash/power loss regardless of the storage setup.

Really, bugzilla is not the place to discuss this. We've documented how to write data to stable storage safely in our developer guides and all over the web, such as this excellent article by Jeff Moyer:

https://lwn.net/Articles/457667/

-Dave.

Comment 40 pgsousa 2017-11-22 22:51:01 UTC

(In reply to Dave Chinner from comment #39)
> (In reply to Derek Higgins from comment #38)
> > > Torn data writes will still be an issue, but that indicates unsafe application
> > > data integrity practices so torn writes are the least of the user's worries
> > > here.
> > 
> > Dave can you elaborate on this, what kind of practices would cause problems
> > and what would the problems be?
> 
> The problem is applications not using fsync/fdatasync() where they need to
> to guarantee data is on stable storage. If they don't do this, then torn
> writes don't matter - critical data is going to be lost on crash/power loss
> regardless of the storage setup.
> 
> Really, bugzilla is not the place to discuss this. We've documented how to
> write data to stable storage safely in our developer guides and all over the
> web, such as this excellent article by Jeff Moyer:
> 
> https://lwn.net/Articles/457667/
> 
> -Dave.


Hi,

in my case I will use 512 disks for Controller nodes and 4k for compute. My question is if this is safe or not? 

Thanks

Comment 41 Benjamin Schmaus 2018-01-23 19:51:33 UTC

Any update on when we might see this in a RHOSP release?

Comment 42 Derek Higgins 2018-02-02 10:11:19 UTC

I've submitted this change into tripleo-common upstream, once it has landed we can consider potential backports etc.

Comment 48 Alexander Chuzhoy 2018-05-18 19:59:51 UTC

Verified:

Environment:
instack-undercloud-8.4.1-4.el7ost.noarch
openstack-tripleo-common-containers-8.6.1-12.el7ost.noarch
openstack-tripleo-common-8.6.1-12.el7ost.noarch


Notes:
It requires UEFI.

Successfully booted a server with 4096 block size disk.

Comment 51 errata-xmlrpc 2018-06-27 13:29:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086