216902 – mkfs.gfs2 allows non-4K block size

Bug 216902 - mkfs.gfs2 allows non-4K block size

Summary: mkfs.gfs2 allows non-4K block size

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	gfs2-utils
Sub Component:
Version:	5.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Robert Peterson
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-11-22 16:34 UTC by Robert Peterson
Modified:	2010-01-12 03:36 UTC (History)
CC List:	4 users (show)
Fixed In Version:	RC
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-02-08 00:49:49 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch to disable the -b option in mkfs.gfs2 (2.15 KB, patch) 2006-11-28 18:30 UTC, Robert Peterson	no flags	Details \| Diff
View All

Description Robert Peterson 2006-11-22 16:34:22 UTC

Description of problem:
I've been told that gfs2 can only handle a 4K block size.
Therefore, I took made gfs2_convert reject gfs1 file systems
with non-4K block sizes.  See bz214513.  However, mkfs.gfs2
still allows the user to create file systems with non-4K blocks.

We need to determine if the gfs2 kernel can handle these non-4K
blocks.  If it can, we need to revert the change to gfs2_convert.
If it can't, we need to remove the -b parameter from mkfs.gfs2
and its man page.

Version-Release number of selected component (if applicable):
RHEL5

How reproducible:
Always

Steps to Reproduce:
mkfs.gfs2 -b 2048 -t bob_cluster2:lv7 -p lock_dlm -j 2 /dev/bobs_vg/lvol7
  
Actual results:
mkfs.gfs2 works.

Expected results:
mkfs.gfs2: invalid option -- b

Additional info:
I spoke with Steve Whitehouse about this, and we need to understand
where the 4K block size limitation is.  Perhaps I've been misled and
it handles it properly.

Comment 1 Robert Peterson 2006-11-22 16:35:34 UTC

Originally reported by Russell Cattelan, so I'm adding him to the cc list.

Comment 2 Kiersten (Kerri) Anderson 2006-11-22 16:47:37 UTC

We need to understand which direction we are going with this one prior to RHEL5 GA.

Comment 3 Russell Cattelan 2006-11-22 19:23:03 UTC

Assuming that gfs2 should run in mixed pages size clusters
limiting gfs2 to file system block sizes == page size seems 
like a major regression/limitation from gfs1.

I assume page size == 4k on x86_* is the reason that gfs2 is asserting 
4k blocks only.

Note this is the error repoted from the kernel when trying to mount a
1k gfs2 file system.

bear-02.lab.msp.redhat.com login: GFS2: fsid=sda7.0: fatal: invalid metadata block
GFS2: fsid=sda7.0:   bh = 15886 (magic number)
GFS2: fsid=sda7.0:   function = gfs2_meta_indirect_buffer, file =
fs/gfs2/meta_io.c, line = 666
GFS2: fsid=sda7.0: about to withdraw from the cluster
GFS2: fsid=sda7.0: waiting for outstanding I/O
GFS2: fsid=sda7.0: telling LM to withdraw
GFS2: fsid=sda7.0: withdrawn
 [<c04051db>] dump_trace+0x69/0x1af
 [<c0405339>] show_trace_log_lvl+0x18/0x2c
 [<c04058ed>] show_trace+0xf/0x11
 [<c04059ea>] dump_stack+0x15/0x17
 [<f8f7404f>] gfs2_lm_withdraw+0x9a/0xa5 [gfs2]
 [<f8f86280>] gfs2_meta_check_ii+0x51/0x5d [gfs2]
 [<f8f77884>] gfs2_meta_indirect_buffer+0x1e1/0x283 [gfs2]
 [<f8f68101>] gfs2_block_pointers+0x1a2/0x35f [gfs2]
 [<f8f68378>] gfs2_extent_map+0xba/0xff [gfs2]
 [<f8f68554>] gfs2_write_alloc_required+0x197/0x1d3 [gfs2]
 [<f8f83eac>] gfs2_jdesc_check+0x90/0xc5 [gfs2]
 [<f8f7b7a2>] init_journal+0x250/0x3f5 [gfs2]
 [<f8f7b99b>] init_inodes+0x54/0x1da [gfs2]
 [<f8f7c484>] fill_super+0x50e/0x632 [gfs2]
 [<c04756c1>] get_sb_bdev+0xce/0x11c
 [<f8f7b16e>] gfs2_get_sb+0x21/0x3e [gfs2]
 [<c0475279>] vfs_kern_mount+0x83/0xf6
 [<c047532e>] do_kern_mount+0x2d/0x3e
 [<c0488564>] do_mount+0x5fa/0x66d
 [<c048864e>] sys_mount+0x77/0xae
 [<c0404013>] syscall_call+0x7/0xb

Comment 4 Rob Kenna 2006-11-22 20:30:03 UTC

Seems like we should allow multiple block sizes unless it muddies the design or
manifestly makes the fs less stable.  Looking for opinions on this.

Comment 5 Russell Cattelan 2006-11-22 23:03:58 UTC

If we are going to support mixed page size nodes in a cluster
I don't how we have any other option but support sub page fs block sizes?
I have a feeling that page based glocking is the biggest reason
sub page IO is going to be difficult.

page based glocking is probably responsible much of the IO performance problems,
since the overhead of grabbing and releasing glocks for each page is
much higher than file based glocking.

Comment 6 Steve Whitehouse 2006-11-23 09:04:57 UTC

I agree that support for multiple block sizes would be the best way forward at
this stage. So far as I know the current situation is this: we support block
sizes only of 4k, we support multiple blocks per page therefore only when
PAGE_SIZE > 4k although I don't have a suitable machine on which to test that,
but I don't recall seeing anything in the code to say otherwise.

Using the smaller block sizes is not recommended, mainly due to the large size
of (for example) the common metadata header which means that the tree of
indirect pointers for an inode would potentially be much deeper. This may result
in us having to review the current policy of the allocation of the path through
the indirect pointer tree on the stack at bmap time. We currently get away with
this, although its not ideal, by virtue of using 16 bit offsets into each block.

I can't think of any other places in the code which might cause a tricky problem
to solve, but thats not to say that there aren't any, so careful testing is
required in this area.

I very much doubt that the fact that our locking is page based (well its not
entirely, in fact) is a great concern performance-wise. The lock state is cached
after all, so it should be no worse than grabbing a mutex, which the VFS does
for us on write, for example, for the most part (and certainly thats true on
single node setups). The glock code could probably do with a bit of optimisation
in this area, but I don't think its responsible for any big performance problem.

Comment 7 Russell Cattelan 2006-11-27 20:20:12 UTC

Nate did a quick test on a ppc box with 64k pages and it was able to mount
a 4k file system, and run a few basic tests.

The fact that a 1k filesystem does not work on a 4k page systems is probably an
asumtion being made about 4k filesystem. This should probably be regarded as a bug
since it may come back and bite > 4k filesystems.


glocks might be no more expensive than mutex's (I have not tried timing
mutex_lock/unlock yet).
But given that there is no per page mutex, it seem hard to make the argument
that per page glock is not a big peformance hit.

Ken even makes note of the fact (in his "thanks for the fish" document) that gfs
does way to much fine grain locking and that expensive cluster lock/releases
should be reduced as much as possible.

Comment 8 Robert Peterson 2006-11-28 18:30:58 UTC

Created attachment 142315 [details]
Patch to disable the -b option in mkfs.gfs2

The executive decision was made to remove the -b option in mkfs.gfs2
until we can get all of this sorted out with the gfs2 kernel.

Comment 9 Robert Peterson 2006-11-28 18:42:53 UTC

Fix was committed to CVS in HEAD, RHEL5 and RHEL50.

Comment 11 RHEL Program Management 2007-02-08 00:49:49 UTC

A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.

Note You need to log in before you can comment on or make changes to this bug.