Bug 519491 - GFS2 utilities should make use of exported device topology information
Summary: GFS2 utilities should make use of exported device topology information
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster
Version: 6.0
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Robert Peterson
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 512171
Blocks: 519834 569845 576381 894348
TreeView+ depends on / blocked
 
Reported: 2009-08-26 19:48 UTC by Valerie Aurora Henson
Modified: 2013-01-11 14:13 UTC (History)
14 users (show)

Fixed In Version: cluster-3.0.9-1.el6
Doc Type: Enhancement
Doc Text:
Clone Of:
: 569845 (view as bug list)
Environment:
Last Closed: 2010-11-10 19:58:19 UTC


Attachments (Terms of Use)
Early prototype (5.35 KB, patch)
2010-02-25 23:07 UTC, Robert Peterson
no flags Details | Diff
Another prototype patch (9.98 KB, patch)
2010-03-01 15:42 UTC, Robert Peterson
no flags Details | Diff
Try 3 patch (6.38 KB, patch)
2010-03-01 20:19 UTC, Robert Peterson
no flags Details | Diff

Description Valerie Aurora Henson 2009-08-26 19:48:44 UTC
If GFS2 utilities ever partition or create file systems directly, they should know about the device topology information now exported from the kernel.

Good summary page:

http://osdir.com/ml/linux-raid/2009-06/msg00309.html

Comment 2 RHEL Product and Program Management 2009-08-26 20:16:32 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Robert Peterson 2009-08-26 20:35:59 UTC
Is this the same as bug #512171 ?

Comment 4 Steve Whitehouse 2009-08-27 07:45:13 UTC
In answer to comment #3, yes it is. That bug was filed against Fedora though since its an RFE and should go into upstream first.

Comment 9 Robert Peterson 2010-02-25 23:07:55 UTC
Created attachment 396422 [details]
Early prototype

Here is an early prototype patch.  Likely needs more work.

Comment 10 Robert Peterson 2010-03-01 15:42:04 UTC
Created attachment 397094 [details]
Another prototype patch

This patch is untested, but more advanced than the previous:

1. This one doesn't die if it gets errors trying to fetch topology
   information.  The previous one died.  It just moves forward.
2. This one assumes physical_block_size or the default, nothing
   more.
3. This one checks for block size specified < physical_block_size
   because that's an inefficient configuration.  It asks permission
   to use the smaller block size, unless the override flag is set
   (used for automated testing).
4. It squirrels away the topology information into new fields in
   the GFS2 superblock, if defined by the kernel's gfs2_ondisk
   include.  This way the kernel code can make use of the info if
   it deems fit.
5. It allows displaying / printing of the new superblock values
   from gfs2_edit.
6. As part of the topology info, I'm also saving off the rgrp
   size numbers used by mkfs.gfs2: first rgrp length and subsequent
   rgrp length.  This will be helpful and more accurate for fsck
   to repair damaged rgrps at some time in the future.

Known issue:

Right now, GFS2 does most of its IO using buffer_heads, which use
4K pages of memory.  The current mkfs.gfs2 assumes 4K block sizes
to take advantage of that.  However, raw SCSI disks often report
physical_block_size == 512.  Last Friday there was some discussion
on some of the IRC channels about whether or not we should change
the default from 4K.

From talking to Mike Snitzer, it sounds like GFS2 might be best
left with 4K as the default.  It sounds like the most useful of
the topology numbers is actually the stripe size for RAID for
alignment purposes.  I need to discuss this with Steve Whitehouse
and get his input on how we can make the best use of these numbers.

Comment 11 Robert Peterson 2010-03-01 20:19:46 UTC
Created attachment 397167 [details]
Try 3 patch

Here's another prototype patch.  I spoke with Steve Whitehouse
this morning.  We decided to _not_ save any of the numbers in
the superblock.  We also decided to modify how the numbers are
used to decide the block size.

Comment 12 Robert Peterson 2010-03-01 20:54:01 UTC
Notes:

logical_block_size = the smallest unit we can address using the
                     programming interface

minimum_io_size = physical_block_size = the smallest unit we can
                  write without incurring read-modify-write penalty

optimal_io_size = the biggest I/O we can submit without incurring a
                  penalty (stall, cache or queue full). A multiple
                  of minimum_io_size.

alignment_offset = padding to the start of the lowest aligned
                   logical block.

I investigated how we can best make use of the alignment info
for RAID controllers and such.  The bottom line is that there
are problems and it will require a lot more thought and research.

One of the problems is best explained with an example from one
of the web pages pointed out.  In it, Alasdair gives this example:

Example 5-disk RAID5 array:
- 512-byte logical_block_size (was hardsect_size)
- 64-Kbyte physical_block_size (minimum_io_size == chunk size)
- 256-Kbyte optimal_io_size (4 * minimum_io_size == full stripe)
- 3584-byte alignment_offset (lowest logical block aligned to a
64-Kbyte boundary is LBA 7)

Ideally for performance, we want to align our rgrps and subsequent
bitmaps to fall on the alignment boundary, because the bitmaps are
the heaviest hit blocks.  In this example, the alignment_offset is
3584, which means that ideally we should align everything to be
7 "basic blocks" of 512b into the device.  The trouble is, neither
the kernel code nor the utils are designed to add an offset; they
do all their calculations based on block number * block size, and
3584 is not a multiple of 4096.

GFS2 does most of its IO with buffer_heads, which in turn use
page_size.  In this example, GFS2 can't use a block size of
physical_block_size (64K) because it's greater than page_size.
It certainly can't write optimal_io_size because that's even
bigger (256K).  It should definitely not use logical_block_size
because that's highly inefficient.  Which means its ideal block
size is the default 4K.   But the offset of 3584 is 7/8ths of a
4K block, and as I said, there's no way to add in an offset
unless we change the kernel and utils to use an offset.  I
think this is the way to go, but not today.

The other problem is that gfs2 starts with a desired rgrp size,
then calculates how many rgrps of that size can fit on the device:

    nrgrp = how_many_rgrps(sdp, dev, rgsize_specified);
    rglength = dev->length / nrgrp;

Essentially carving the device into equally-spaced rgrps.
Ideally, it should do just the opposite:  It should calculate the
best rglength based on optimal alignment information and stripe
size and just carve the file system into how every many pieces fit:

    rglength = optimized based on stripe size for performance and
               alignment;
    nrgrp = dev->length / rglength;

Comment 13 Steve Whitehouse 2010-03-01 23:34:49 UTC
The patch in comment #11 looks good to me.

Comment 14 Mike Snitzer 2010-03-02 02:10:18 UTC
(In reply to comment #13)
> The patch in comment #11 looks good to me.    

Looks good to me too.

Comment 15 Robert Peterson 2010-03-02 15:03:29 UTC
The patch from comment #11 was tested on system west-09.  I pushed
the patch to the master branch of the gfs2-utils git tree and the
STABLE3 branch of the cluster git tree for inclusion into RHEL6.
I'm also going to clone this bug in order to do the follow-on work,
that is: we need to make gfs2 aware of RAID stripes and such.
For now, changing this to POST.

Comment 21 Nate Straz 2010-07-28 18:50:11 UTC
Bob, is there anything GFS2 will do differently using topology information besides complaining if the user selects a block size smaller than the physical block size?

Comment 22 Robert Peterson 2010-07-28 20:03:44 UTC
It does a bunch of things, but it's designed to not have wildly
different behavior from before on x86_64.

1. If the debug option is specified, mkfs.gfs2 will print the
   topography values reported by the hardware.
2. If block size is specified less than logical block size,
   it complains and exits with an error.
3. If block size is specified less than physical block size,
   it will give you a warning that it's inefficient and give an
   option to abort.
4. If the hardware reports an optimal IO size, it sets the
   default block size to that.
5. If the hardware reports a valid physical block size, it
   sets block size to that (if the test in step 4 fails).

So you could run it on ppc to see if it gives you a different
block size and run it on different storage hardware if you want.

My original version saved the topography values in the superblock
in order to determine those values when the metadata is saved,
but Steve talked me out of that, so I backed it out.

Comment 23 Nate Straz 2010-07-30 14:55:40 UTC
I ran through all the scenarios by hand with scsi_debug options and found mkfs.gfs2 to follow the behavior Bob described with gfs2-utils-3.0.12-13.el6.

Comment 24 releng-rhel@redhat.com 2010-11-10 19:58:19 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.