Red Hat Bugzilla – Bug 216902
mkfs.gfs2 allows non-4K block size
Last modified: 2010-01-11 22:36:49 EST
Description of problem:
I've been told that gfs2 can only handle a 4K block size.
Therefore, I took made gfs2_convert reject gfs1 file systems
with non-4K block sizes. See bz214513. However, mkfs.gfs2
still allows the user to create file systems with non-4K blocks.
We need to determine if the gfs2 kernel can handle these non-4K
blocks. If it can, we need to revert the change to gfs2_convert.
If it can't, we need to remove the -b parameter from mkfs.gfs2
and its man page.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
mkfs.gfs2 -b 2048 -t bob_cluster2:lv7 -p lock_dlm -j 2 /dev/bobs_vg/lvol7
mkfs.gfs2: invalid option -- b
I spoke with Steve Whitehouse about this, and we need to understand
where the 4K block size limitation is. Perhaps I've been misled and
it handles it properly.
Originally reported by Russell Cattelan, so I'm adding him to the cc list.
We need to understand which direction we are going with this one prior to RHEL5 GA.
Assuming that gfs2 should run in mixed pages size clusters
limiting gfs2 to file system block sizes == page size seems
like a major regression/limitation from gfs1.
I assume page size == 4k on x86_* is the reason that gfs2 is asserting
4k blocks only.
Note this is the error repoted from the kernel when trying to mount a
1k gfs2 file system.
bear-02.lab.msp.redhat.com login: GFS2: fsid=sda7.0: fatal: invalid metadata block
GFS2: fsid=sda7.0: bh = 15886 (magic number)
GFS2: fsid=sda7.0: function = gfs2_meta_indirect_buffer, file =
fs/gfs2/meta_io.c, line = 666
GFS2: fsid=sda7.0: about to withdraw from the cluster
GFS2: fsid=sda7.0: waiting for outstanding I/O
GFS2: fsid=sda7.0: telling LM to withdraw
GFS2: fsid=sda7.0: withdrawn
[<f8f7404f>] gfs2_lm_withdraw+0x9a/0xa5 [gfs2]
[<f8f86280>] gfs2_meta_check_ii+0x51/0x5d [gfs2]
[<f8f77884>] gfs2_meta_indirect_buffer+0x1e1/0x283 [gfs2]
[<f8f68101>] gfs2_block_pointers+0x1a2/0x35f [gfs2]
[<f8f68378>] gfs2_extent_map+0xba/0xff [gfs2]
[<f8f68554>] gfs2_write_alloc_required+0x197/0x1d3 [gfs2]
[<f8f83eac>] gfs2_jdesc_check+0x90/0xc5 [gfs2]
[<f8f7b7a2>] init_journal+0x250/0x3f5 [gfs2]
[<f8f7b99b>] init_inodes+0x54/0x1da [gfs2]
[<f8f7c484>] fill_super+0x50e/0x632 [gfs2]
[<f8f7b16e>] gfs2_get_sb+0x21/0x3e [gfs2]
Seems like we should allow multiple block sizes unless it muddies the design or
manifestly makes the fs less stable. Looking for opinions on this.
If we are going to support mixed page size nodes in a cluster
I don't how we have any other option but support sub page fs block sizes?
I have a feeling that page based glocking is the biggest reason
sub page IO is going to be difficult.
page based glocking is probably responsible much of the IO performance problems,
since the overhead of grabbing and releasing glocks for each page is
much higher than file based glocking.
I agree that support for multiple block sizes would be the best way forward at
this stage. So far as I know the current situation is this: we support block
sizes only of 4k, we support multiple blocks per page therefore only when
PAGE_SIZE > 4k although I don't have a suitable machine on which to test that,
but I don't recall seeing anything in the code to say otherwise.
Using the smaller block sizes is not recommended, mainly due to the large size
of (for example) the common metadata header which means that the tree of
indirect pointers for an inode would potentially be much deeper. This may result
in us having to review the current policy of the allocation of the path through
the indirect pointer tree on the stack at bmap time. We currently get away with
this, although its not ideal, by virtue of using 16 bit offsets into each block.
I can't think of any other places in the code which might cause a tricky problem
to solve, but thats not to say that there aren't any, so careful testing is
required in this area.
I very much doubt that the fact that our locking is page based (well its not
entirely, in fact) is a great concern performance-wise. The lock state is cached
after all, so it should be no worse than grabbing a mutex, which the VFS does
for us on write, for example, for the most part (and certainly thats true on
single node setups). The glock code could probably do with a bit of optimisation
in this area, but I don't think its responsible for any big performance problem.
Nate did a quick test on a ppc box with 64k pages and it was able to mount
a 4k file system, and run a few basic tests.
The fact that a 1k filesystem does not work on a 4k page systems is probably an
asumtion being made about 4k filesystem. This should probably be regarded as a bug
since it may come back and bite > 4k filesystems.
glocks might be no more expensive than mutex's (I have not tried timing
But given that there is no per page mutex, it seem hard to make the argument
that per page glock is not a big peformance hit.
Ken even makes note of the fact (in his "thanks for the fish" document) that gfs
does way to much fine grain locking and that expensive cluster lock/releases
should be reduced as much as possible.
Created attachment 142315 [details]
Patch to disable the -b option in mkfs.gfs2
The executive decision was made to remove the -b option in mkfs.gfs2
until we can get all of this sorted out with the gfs2 kernel.
Fix was committed to CVS in HEAD, RHEL5 and RHEL50.
A package has been built which should help the problem described in
this bug report. This report is therefore being closed with a resolution
of CURRENTRELEASE. You may reopen this bug report if the solution does
not work for you.