Description of problem: I've been told that gfs2 can only handle a 4K block size. Therefore, I took made gfs2_convert reject gfs1 file systems with non-4K block sizes. See bz214513. However, mkfs.gfs2 still allows the user to create file systems with non-4K blocks. We need to determine if the gfs2 kernel can handle these non-4K blocks. If it can, we need to revert the change to gfs2_convert. If it can't, we need to remove the -b parameter from mkfs.gfs2 and its man page. Version-Release number of selected component (if applicable): RHEL5 How reproducible: Always Steps to Reproduce: mkfs.gfs2 -b 2048 -t bob_cluster2:lv7 -p lock_dlm -j 2 /dev/bobs_vg/lvol7 Actual results: mkfs.gfs2 works. Expected results: mkfs.gfs2: invalid option -- b Additional info: I spoke with Steve Whitehouse about this, and we need to understand where the 4K block size limitation is. Perhaps I've been misled and it handles it properly.
Originally reported by Russell Cattelan, so I'm adding him to the cc list.
We need to understand which direction we are going with this one prior to RHEL5 GA.
Assuming that gfs2 should run in mixed pages size clusters limiting gfs2 to file system block sizes == page size seems like a major regression/limitation from gfs1. I assume page size == 4k on x86_* is the reason that gfs2 is asserting 4k blocks only. Note this is the error repoted from the kernel when trying to mount a 1k gfs2 file system. bear-02.lab.msp.redhat.com login: GFS2: fsid=sda7.0: fatal: invalid metadata block GFS2: fsid=sda7.0: bh = 15886 (magic number) GFS2: fsid=sda7.0: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 666 GFS2: fsid=sda7.0: about to withdraw from the cluster GFS2: fsid=sda7.0: waiting for outstanding I/O GFS2: fsid=sda7.0: telling LM to withdraw GFS2: fsid=sda7.0: withdrawn [<c04051db>] dump_trace+0x69/0x1af [<c0405339>] show_trace_log_lvl+0x18/0x2c [<c04058ed>] show_trace+0xf/0x11 [<c04059ea>] dump_stack+0x15/0x17 [<f8f7404f>] gfs2_lm_withdraw+0x9a/0xa5 [gfs2] [<f8f86280>] gfs2_meta_check_ii+0x51/0x5d [gfs2] [<f8f77884>] gfs2_meta_indirect_buffer+0x1e1/0x283 [gfs2] [<f8f68101>] gfs2_block_pointers+0x1a2/0x35f [gfs2] [<f8f68378>] gfs2_extent_map+0xba/0xff [gfs2] [<f8f68554>] gfs2_write_alloc_required+0x197/0x1d3 [gfs2] [<f8f83eac>] gfs2_jdesc_check+0x90/0xc5 [gfs2] [<f8f7b7a2>] init_journal+0x250/0x3f5 [gfs2] [<f8f7b99b>] init_inodes+0x54/0x1da [gfs2] [<f8f7c484>] fill_super+0x50e/0x632 [gfs2] [<c04756c1>] get_sb_bdev+0xce/0x11c [<f8f7b16e>] gfs2_get_sb+0x21/0x3e [gfs2] [<c0475279>] vfs_kern_mount+0x83/0xf6 [<c047532e>] do_kern_mount+0x2d/0x3e [<c0488564>] do_mount+0x5fa/0x66d [<c048864e>] sys_mount+0x77/0xae [<c0404013>] syscall_call+0x7/0xb
Seems like we should allow multiple block sizes unless it muddies the design or manifestly makes the fs less stable. Looking for opinions on this.
If we are going to support mixed page size nodes in a cluster I don't how we have any other option but support sub page fs block sizes? I have a feeling that page based glocking is the biggest reason sub page IO is going to be difficult. page based glocking is probably responsible much of the IO performance problems, since the overhead of grabbing and releasing glocks for each page is much higher than file based glocking.
I agree that support for multiple block sizes would be the best way forward at this stage. So far as I know the current situation is this: we support block sizes only of 4k, we support multiple blocks per page therefore only when PAGE_SIZE > 4k although I don't have a suitable machine on which to test that, but I don't recall seeing anything in the code to say otherwise. Using the smaller block sizes is not recommended, mainly due to the large size of (for example) the common metadata header which means that the tree of indirect pointers for an inode would potentially be much deeper. This may result in us having to review the current policy of the allocation of the path through the indirect pointer tree on the stack at bmap time. We currently get away with this, although its not ideal, by virtue of using 16 bit offsets into each block. I can't think of any other places in the code which might cause a tricky problem to solve, but thats not to say that there aren't any, so careful testing is required in this area. I very much doubt that the fact that our locking is page based (well its not entirely, in fact) is a great concern performance-wise. The lock state is cached after all, so it should be no worse than grabbing a mutex, which the VFS does for us on write, for example, for the most part (and certainly thats true on single node setups). The glock code could probably do with a bit of optimisation in this area, but I don't think its responsible for any big performance problem.
Nate did a quick test on a ppc box with 64k pages and it was able to mount a 4k file system, and run a few basic tests. The fact that a 1k filesystem does not work on a 4k page systems is probably an asumtion being made about 4k filesystem. This should probably be regarded as a bug since it may come back and bite > 4k filesystems. glocks might be no more expensive than mutex's (I have not tried timing mutex_lock/unlock yet). But given that there is no per page mutex, it seem hard to make the argument that per page glock is not a big peformance hit. Ken even makes note of the fact (in his "thanks for the fish" document) that gfs does way to much fine grain locking and that expensive cluster lock/releases should be reduced as much as possible.
Created attachment 142315 [details] Patch to disable the -b option in mkfs.gfs2 The executive decision was made to remove the -b option in mkfs.gfs2 until we can get all of this sorted out with the gfs2 kernel.
Fix was committed to CVS in HEAD, RHEL5 and RHEL50.
A package has been built which should help the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you.