Bug 519491
Summary: | GFS2 utilities should make use of exported device topology information | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Valerie Aurora Henson <vaurora> | ||||||||
Component: | cluster | Assignee: | Robert Peterson <rpeterso> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 6.0 | CC: | borgan, ccaulfie, cluster-maint, esandeen, fdinitto, kzhang, lhh, msnitzer, nstraz, rpeterso, rwheeler, swhiteho, teigland, yanwang | ||||||||
Target Milestone: | rc | Keywords: | FutureFeature | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | cluster-3.0.9-1.el6 | Doc Type: | Enhancement | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 569845 (view as bug list) | Environment: | |||||||||
Last Closed: | 2010-11-10 19:58:19 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 512171 | ||||||||||
Bug Blocks: | 519834, 569845, 576381, 894348 | ||||||||||
Attachments: |
|
Description
Valerie Aurora Henson
2009-08-26 19:48:44 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion. Is this the same as bug #512171 ? In answer to comment #3, yes it is. That bug was filed against Fedora though since its an RFE and should go into upstream first. Created attachment 396422 [details]
Early prototype
Here is an early prototype patch. Likely needs more work.
Created attachment 397094 [details]
Another prototype patch
This patch is untested, but more advanced than the previous:
1. This one doesn't die if it gets errors trying to fetch topology
information. The previous one died. It just moves forward.
2. This one assumes physical_block_size or the default, nothing
more.
3. This one checks for block size specified < physical_block_size
because that's an inefficient configuration. It asks permission
to use the smaller block size, unless the override flag is set
(used for automated testing).
4. It squirrels away the topology information into new fields in
the GFS2 superblock, if defined by the kernel's gfs2_ondisk
include. This way the kernel code can make use of the info if
it deems fit.
5. It allows displaying / printing of the new superblock values
from gfs2_edit.
6. As part of the topology info, I'm also saving off the rgrp
size numbers used by mkfs.gfs2: first rgrp length and subsequent
rgrp length. This will be helpful and more accurate for fsck
to repair damaged rgrps at some time in the future.
Known issue:
Right now, GFS2 does most of its IO using buffer_heads, which use
4K pages of memory. The current mkfs.gfs2 assumes 4K block sizes
to take advantage of that. However, raw SCSI disks often report
physical_block_size == 512. Last Friday there was some discussion
on some of the IRC channels about whether or not we should change
the default from 4K.
From talking to Mike Snitzer, it sounds like GFS2 might be best
left with 4K as the default. It sounds like the most useful of
the topology numbers is actually the stripe size for RAID for
alignment purposes. I need to discuss this with Steve Whitehouse
and get his input on how we can make the best use of these numbers.
Created attachment 397167 [details]
Try 3 patch
Here's another prototype patch. I spoke with Steve Whitehouse
this morning. We decided to _not_ save any of the numbers in
the superblock. We also decided to modify how the numbers are
used to decide the block size.
Notes: logical_block_size = the smallest unit we can address using the programming interface minimum_io_size = physical_block_size = the smallest unit we can write without incurring read-modify-write penalty optimal_io_size = the biggest I/O we can submit without incurring a penalty (stall, cache or queue full). A multiple of minimum_io_size. alignment_offset = padding to the start of the lowest aligned logical block. I investigated how we can best make use of the alignment info for RAID controllers and such. The bottom line is that there are problems and it will require a lot more thought and research. One of the problems is best explained with an example from one of the web pages pointed out. In it, Alasdair gives this example: Example 5-disk RAID5 array: - 512-byte logical_block_size (was hardsect_size) - 64-Kbyte physical_block_size (minimum_io_size == chunk size) - 256-Kbyte optimal_io_size (4 * minimum_io_size == full stripe) - 3584-byte alignment_offset (lowest logical block aligned to a 64-Kbyte boundary is LBA 7) Ideally for performance, we want to align our rgrps and subsequent bitmaps to fall on the alignment boundary, because the bitmaps are the heaviest hit blocks. In this example, the alignment_offset is 3584, which means that ideally we should align everything to be 7 "basic blocks" of 512b into the device. The trouble is, neither the kernel code nor the utils are designed to add an offset; they do all their calculations based on block number * block size, and 3584 is not a multiple of 4096. GFS2 does most of its IO with buffer_heads, which in turn use page_size. In this example, GFS2 can't use a block size of physical_block_size (64K) because it's greater than page_size. It certainly can't write optimal_io_size because that's even bigger (256K). It should definitely not use logical_block_size because that's highly inefficient. Which means its ideal block size is the default 4K. But the offset of 3584 is 7/8ths of a 4K block, and as I said, there's no way to add in an offset unless we change the kernel and utils to use an offset. I think this is the way to go, but not today. The other problem is that gfs2 starts with a desired rgrp size, then calculates how many rgrps of that size can fit on the device: nrgrp = how_many_rgrps(sdp, dev, rgsize_specified); rglength = dev->length / nrgrp; Essentially carving the device into equally-spaced rgrps. Ideally, it should do just the opposite: It should calculate the best rglength based on optimal alignment information and stripe size and just carve the file system into how every many pieces fit: rglength = optimized based on stripe size for performance and alignment; nrgrp = dev->length / rglength; The patch in comment #11 looks good to me. (In reply to comment #13) > The patch in comment #11 looks good to me. Looks good to me too. The patch from comment #11 was tested on system west-09. I pushed the patch to the master branch of the gfs2-utils git tree and the STABLE3 branch of the cluster git tree for inclusion into RHEL6. I'm also going to clone this bug in order to do the follow-on work, that is: we need to make gfs2 aware of RAID stripes and such. For now, changing this to POST. Bob, is there anything GFS2 will do differently using topology information besides complaining if the user selects a block size smaller than the physical block size? It does a bunch of things, but it's designed to not have wildly different behavior from before on x86_64. 1. If the debug option is specified, mkfs.gfs2 will print the topography values reported by the hardware. 2. If block size is specified less than logical block size, it complains and exits with an error. 3. If block size is specified less than physical block size, it will give you a warning that it's inefficient and give an option to abort. 4. If the hardware reports an optimal IO size, it sets the default block size to that. 5. If the hardware reports a valid physical block size, it sets block size to that (if the test in step 4 fails). So you could run it on ppc to see if it gives you a different block size and run it on different storage hardware if you want. My original version saved the topography values in the superblock in order to determine those values when the metadata is saved, but Steve talked me out of that, so I backed it out. I ran through all the scenarios by hand with scsi_debug options and found mkfs.gfs2 to follow the behavior Bob described with gfs2-utils-3.0.12-13.el6. Red Hat Enterprise Linux 6.0 is now available and should resolve the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you. |