If GFS2 utilities ever partition or create file systems directly, they should know about the device topology information now exported from the kernel.
Good summary page:
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release. Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release. This request is not yet committed for
Is this the same as bug #512171 ?
In answer to comment #3, yes it is. That bug was filed against Fedora though since its an RFE and should go into upstream first.
Created attachment 396422 [details]
Here is an early prototype patch. Likely needs more work.
Created attachment 397094 [details]
Another prototype patch
This patch is untested, but more advanced than the previous:
1. This one doesn't die if it gets errors trying to fetch topology
information. The previous one died. It just moves forward.
2. This one assumes physical_block_size or the default, nothing
3. This one checks for block size specified < physical_block_size
because that's an inefficient configuration. It asks permission
to use the smaller block size, unless the override flag is set
(used for automated testing).
4. It squirrels away the topology information into new fields in
the GFS2 superblock, if defined by the kernel's gfs2_ondisk
include. This way the kernel code can make use of the info if
it deems fit.
5. It allows displaying / printing of the new superblock values
6. As part of the topology info, I'm also saving off the rgrp
size numbers used by mkfs.gfs2: first rgrp length and subsequent
rgrp length. This will be helpful and more accurate for fsck
to repair damaged rgrps at some time in the future.
Right now, GFS2 does most of its IO using buffer_heads, which use
4K pages of memory. The current mkfs.gfs2 assumes 4K block sizes
to take advantage of that. However, raw SCSI disks often report
physical_block_size == 512. Last Friday there was some discussion
on some of the IRC channels about whether or not we should change
the default from 4K.
From talking to Mike Snitzer, it sounds like GFS2 might be best
left with 4K as the default. It sounds like the most useful of
the topology numbers is actually the stripe size for RAID for
alignment purposes. I need to discuss this with Steve Whitehouse
and get his input on how we can make the best use of these numbers.
Created attachment 397167 [details]
Try 3 patch
Here's another prototype patch. I spoke with Steve Whitehouse
this morning. We decided to _not_ save any of the numbers in
the superblock. We also decided to modify how the numbers are
used to decide the block size.
logical_block_size = the smallest unit we can address using the
minimum_io_size = physical_block_size = the smallest unit we can
write without incurring read-modify-write penalty
optimal_io_size = the biggest I/O we can submit without incurring a
penalty (stall, cache or queue full). A multiple
alignment_offset = padding to the start of the lowest aligned
I investigated how we can best make use of the alignment info
for RAID controllers and such. The bottom line is that there
are problems and it will require a lot more thought and research.
One of the problems is best explained with an example from one
of the web pages pointed out. In it, Alasdair gives this example:
Example 5-disk RAID5 array:
- 512-byte logical_block_size (was hardsect_size)
- 64-Kbyte physical_block_size (minimum_io_size == chunk size)
- 256-Kbyte optimal_io_size (4 * minimum_io_size == full stripe)
- 3584-byte alignment_offset (lowest logical block aligned to a
64-Kbyte boundary is LBA 7)
Ideally for performance, we want to align our rgrps and subsequent
bitmaps to fall on the alignment boundary, because the bitmaps are
the heaviest hit blocks. In this example, the alignment_offset is
3584, which means that ideally we should align everything to be
7 "basic blocks" of 512b into the device. The trouble is, neither
the kernel code nor the utils are designed to add an offset; they
do all their calculations based on block number * block size, and
3584 is not a multiple of 4096.
GFS2 does most of its IO with buffer_heads, which in turn use
page_size. In this example, GFS2 can't use a block size of
physical_block_size (64K) because it's greater than page_size.
It certainly can't write optimal_io_size because that's even
bigger (256K). It should definitely not use logical_block_size
because that's highly inefficient. Which means its ideal block
size is the default 4K. But the offset of 3584 is 7/8ths of a
4K block, and as I said, there's no way to add in an offset
unless we change the kernel and utils to use an offset. I
think this is the way to go, but not today.
The other problem is that gfs2 starts with a desired rgrp size,
then calculates how many rgrps of that size can fit on the device:
nrgrp = how_many_rgrps(sdp, dev, rgsize_specified);
rglength = dev->length / nrgrp;
Essentially carving the device into equally-spaced rgrps.
Ideally, it should do just the opposite: It should calculate the
best rglength based on optimal alignment information and stripe
size and just carve the file system into how every many pieces fit:
rglength = optimized based on stripe size for performance and
nrgrp = dev->length / rglength;
The patch in comment #11 looks good to me.
(In reply to comment #13)
> The patch in comment #11 looks good to me.
Looks good to me too.
The patch from comment #11 was tested on system west-09. I pushed
the patch to the master branch of the gfs2-utils git tree and the
STABLE3 branch of the cluster git tree for inclusion into RHEL6.
I'm also going to clone this bug in order to do the follow-on work,
that is: we need to make gfs2 aware of RAID stripes and such.
For now, changing this to POST.
Bob, is there anything GFS2 will do differently using topology information besides complaining if the user selects a block size smaller than the physical block size?
It does a bunch of things, but it's designed to not have wildly
different behavior from before on x86_64.
1. If the debug option is specified, mkfs.gfs2 will print the
topography values reported by the hardware.
2. If block size is specified less than logical block size,
it complains and exits with an error.
3. If block size is specified less than physical block size,
it will give you a warning that it's inefficient and give an
option to abort.
4. If the hardware reports an optimal IO size, it sets the
default block size to that.
5. If the hardware reports a valid physical block size, it
sets block size to that (if the test in step 4 fails).
So you could run it on ppc to see if it gives you a different
block size and run it on different storage hardware if you want.
My original version saved the topography values in the superblock
in order to determine those values when the metadata is saved,
but Steve talked me out of that, so I backed it out.
I ran through all the scenarios by hand with scsi_debug options and found mkfs.gfs2 to follow the behavior Bob described with gfs2-utils-3.0.12-13.el6.
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.