529914 – GFS2 fatal: filesystem consistency error on rename

Bug 529914 - GFS2 fatal: filesystem consistency error on rename

Summary: GFS2 fatal: filesystem consistency error on rename

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Robert Peterson
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	526947 638657
TreeView+	depends on / blocked

Reported:	2009-10-20 17:54 UTC by Toure Dunnon
Modified:	2018-11-14 20:29 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	638657 (view as bug list)
Environment:
Last Closed:	2011-01-13 20:54:01 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
GFS2 metadate from drive. (14.06 MB, application/x-bzip) 2009-10-20 17:54 UTC, Toure Dunnon	no flags	Details
gfs2_fsck.log (21.76 KB, text/x-log) 2009-10-21 15:47 UTC, Toure Dunnon	no flags	Details
gfs2_fsck.log2 (64.23 KB, text/plain) 2009-10-21 15:49 UTC, Toure Dunnon	no flags	Details
messages file with backtrace (63.95 KB, application/octet-stream) 2009-10-23 16:11 UTC, Shane Bradley	no flags	Details
metadata with FS that will trigger the bug (575.33 KB, application/x-gzip) 2010-09-20 21:52 UTC, Jaroslav Kortus	no flags	Details
Try 1 Patch (632 bytes, patch) 2010-09-27 19:20 UTC, Robert Peterson	no flags	Details \| Diff
Try 3 patch (3.97 KB, patch) 2010-09-29 20:00 UTC, Robert Peterson	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0017	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update	2011-01-13 10:37:42 UTC

Description Toure Dunnon 2009-10-20 17:54:17 UTC

Created attachment 365392 [details]
GFS2 metadate from drive.

Description of problem:

GFS2 filesystem looks to have a lost a resource group and when we went to check for it we returned NULL instead of the rgd.

Observed behavior:

gfs2 filesystem withdraws with the following messages:

Sep 18 04:36:57 tchtom04test kernel: GFS2: fsid=testsystem2:gfs_pool2.3: fatal: filesystem consistency error
Sep 18 04:36:57 tchtom04test kernel: GFS2: fsid=testsystem2:gfs_pool2.3:   function = rgblk_free, file = fs/gfs2/rgrp.c, line = 1401
Sep 18 04:36:57 tchtom04test kernel: GFS2: fsid=testsystem2:gfs_pool2.3: about to withdraw this file system
Sep 18 04:36:57 tchtom04test kernel: GFS2: fsid=testsystem2:gfs_pool2.3: telling LM to withdraw

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:
File system should remain healthy and not withdraw.


Additional info:

Comment 3 Steve Whitehouse 2009-10-21 11:10:43 UTC

I've run the saved metadata though fsck.gfs2 and it claims that there is nothing wrong with it. Was this metadata saved while the fs was still mounted? It seems that all the journals are dirty, but there are no blocks to reply.

Comment 4 Steve Whitehouse 2009-10-21 11:11:45 UTC

Also, for some reason it looks as if the master inode was not saved. Maybe an old version of gfs2_edit perhaps?

Comment 8 Issue Tracker 2009-10-21 13:36:20 UTC

Event posted on 10-21-2009 09:36am EDT by pkeloth




This event sent from IssueTracker by pkeloth 
 issue 351115
it_file 266642

Comment 9 Issue Tracker 2009-10-21 13:37:16 UTC

Event posted on 10-21-2009 09:37am EDT by pkeloth




This event sent from IssueTracker by pkeloth 
 issue 351115
it_file 266643

Comment 10 Toure Dunnon 2009-10-21 15:47:57 UTC

Created attachment 365537 [details]
gfs2_fsck.log

Comment 11 Toure Dunnon 2009-10-21 15:49:16 UTC

Created attachment 365539 [details]
gfs2_fsck.log2

Comment 14 Shane Bradley 2009-10-23 16:11:55 UTC

Created attachment 365863 [details]
messages file with backtrace

Comment 17 Robert Peterson 2009-12-04 17:04:35 UTC

The consistency error indicates a problem with the free block
count in the resource group information.  I've recently added
better integrity checking of this information in my latest and
greatest fsck.gfs2 for bug #455300, which includes a HUGE number
of code changes.  This new version of fsck.gfs2 is not quite
ready yet and needs more testing.  The good thing is that this
version has a lot of additional bug fixes and serious performance
enhancements, so when it is done, it will be a much better tool.
I have been developing this new version for weeks, so it may
require several more days of testing and possibly a few more
small fixes.  I'm hoping to run it through a battery of very
nasty fsck.gfs2 tests this weekend.

I will test the new version on the customer's metadata shortly
to see how it handles their file system.

Is the customer willing to take a test version of fsck.gfs2
rather than a hotfix, at least temporarily, to get them running?

Comment 19 Robert Peterson 2010-02-23 18:15:28 UTC

The experimental new RHEL5 version of fsck.gfs2 has passed a
lot of tests and is now on my people page at this location:

http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/fsck.gfs2

I highly recommend making backups and/or saving off the gfs2
metadata (with gfs2_edit savemeta /dev/device ~/some/file)
before trying this version of fsck.gfs2.

Comment 20 Robert Peterson 2010-03-18 14:54:27 UTC

I never heard back to see if the new fsck.gfs2 fixed the problem.
Closing.  Feel free to re-open if there's an issue.

Comment 21 Robert Peterson 2010-08-16 13:21:42 UTC

In trying to reproduce bug #602025 on an i686 machine in a cluster,
followed by trying my rename benchmark program, bobbench, Jaroslav
Kortus seems to have recreated this problem by accident.  It is
possible that this bug is a duplicate of bug #602025.  We will
try to recreate the failure again and see if the patch for bug
#602025 fixes the problem.

Comment 22 Robert Peterson 2010-08-16 14:27:19 UTC

I tried running the reproducer for bug #602025 on my own i686
machine and it did not seem to fail.  My guess is that it just
silently corrupts kernel memory.  Meanwhile, Jaroslav recreated
the failure by following the same steps he did before.  He got:

   1.
      GFS2: fsid=Z_Cluster:vedder0.1: fatal: filesystem consistency error
   2.
      GFS2: fsid=Z_Cluster:vedder0.1:   function = rgblk_free, file = fs/gfs2/rgrp.c, line = 1401
   3.
      GFS2: fsid=Z_Cluster:vedder0.1: about to withdraw this file system
   4.
      ----------- [cut here ] --------- [please bite here ] ---------
   5.
      Kernel BUG at fs/gfs2/lm.c:107
   6.
      invalid opcode: 0000 [1] SMP
   7.
      last sysfs file: /fs/gfs2/Z_Cluster:vedder0/lock_module/recover_done
   8.
      CPU 0
   9.
      Modules linked in: nfs fscache lock_dlm gfs2 dlm configfs nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev pcspkr i2c_i801 i2c_core ide_cd serio_raw sg i5000_edac edac_mc cdrom bnx2 dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_piix libata shpchp aacraid sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
  10.
      Pid: 3660, comm: br Not tainted 2.6.18-128.18.1.el5 #1
  11.
      RIP: 0010:[<ffffffff8867d541>]  [<ffffffff8867d541>] :gfs2:gfs2_lm_withdraw+0x84/0xd0
  12.
      RSP: 0018:ffff81006440fa38  EFLAGS: 00010202
  13.
      RAX: 0000000000000046 RBX: ffff81006a161000 RCX: ffffffff802f7aa8
  14.
      RDX: ffffffff802f7aa8 RSI: 0000000000000000 RDI: ffffffff802f7aa0
  15.
      RBP: ffff81006a161930 R08: ffffffff802f7aa8 R09: 0000000000000001
  16.
      R10: 0000000000000000 R11: 0000000000000080 R12: 0000000000000000
  17.
      R13: ffff81006456dcc0 R14: 0000000000000001 R15: ffff8100644f39f0
  18.
      FS:  00002b2d10dbf220(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000
  19.
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  20.
      CR2: 0000003364e7a000 CR3: 0000000064b71000 CR4: 00000000000006e0
  21.
      Process br (pid: 3660, threadinfo ffff81006440e000, task ffff810064b18820)
  22.
      Stack:  0000003000000030 ffff81006440fb30 ffff81006440fa58 0000000000000050
  23.
       000000000000882f ffffffff8002567f ffff81006a161930 ffff81006a161930
  24.
       ffffffff88692394 ffffffff88694921 ffff81006467c3d0 ffffffff80013631
  25.
      Call Trace:
  26.
       [<ffffffff8002567f>] find_or_create_page+0x1e/0x75
  27.
       [<ffffffff80013631>] find_lock_page+0x26/0xa1
  28.
       [<ffffffff8002567f>] find_or_create_page+0x1e/0x75
  29.
       [<ffffffff886809e7>] :gfs2:gfs2_getbuf+0x106/0x115
  30.
       [<ffffffff886907ff>] :gfs2:gfs2_consist_i+0x27/0x2c
  31.
       [<ffffffff8868d2a5>] :gfs2:rgblk_free+0x46/0x159
  32.
       [<ffffffff8868d4cc>] :gfs2:gfs2_unlink_di+0x25/0x60
  33.
       [<ffffffff8867bd2a>] :gfs2:gfs2_change_nlink+0xf8/0x103
  34.
       [<ffffffff88687987>] :gfs2:gfs2_rename+0x467/0x643
  35.
       [<ffffffff8868761f>] :gfs2:gfs2_rename+0xff/0x643
  36.
       [<ffffffff88687666>] :gfs2:gfs2_rename+0x146/0x643
  37.
       [<ffffffff8868768b>] :gfs2:gfs2_rename+0x16b/0x643
  38.
       [<ffffffff80124386>] selinux_inode_rename+0x172/0x184
  39.
       [<ffffffff8002a52d>] vfs_rename+0x2f4/0x471
  40.
       [<ffffffff800366db>] sys_renameat+0x180/0x1eb
  41.
       [<ffffffff80066b32>] do_page_fault+0x4fe/0x830
  42.
       [<ffffffff80062ff8>] thread_return+0x62/0xfe
  43.
       [<ffffffff800b4697>] audit_syscall_entry+0x180/0x1b3
  44.
       [<ffffffff8005d28d>] tracesys+0xd5/0xe0 

I've asked him to apply the patch to fix the problem and he will
follow the same steps and see if that patch fixes this issue as
well.

Comment 23 Robert Peterson 2010-08-16 14:29:27 UTC

Note on recreating this problem from Jaroslav:

<jkortus> bob: I do mkfs on one node (the only one having that mounted), mount it, run bobrenamer couple of times (tens of runs), then I run bobbench also like 30 times
<jkortus> and the I mount the FS on 2nd node and run bobrenamer
<jkortus> 0000 103. <-- that's all bobrenamer prints :)

Comment 24 Jaroslav Kortus 2010-09-20 21:52:34 UTC

Created attachment 448555 [details]
metadata with FS that will trigger the bug

Comment 25 Jaroslav Kortus 2010-09-20 21:58:18 UTC

How to trigger the bug:

Easy way:
1. restore metadata from bobrenamer.meta.gz
2. mount the FS, run bobrenamer on it

How to make your own broken metadata:
0. Form a cluster, only one node is required to have the FS mounted
1. create fresh GFS2 FS (mkfs -t gfs2 -j 4 -J 32 -p lock_dlm -t Z_Cluster:vedder0 /dev/vedder/vedder0 -O) and mount it
2. get bobrenamer
3. run while true; do ./bobrenamer; done
4. keep it running for 20-60 secs
5. hit CTRL-C
6. umount -at gfs2
7. mount the FS back
8. run bobrenamer once again

"successful" run looks like this:
(16:50:21) [root@z1:~]$ ./bobrenamer
0000 103. 
open error! path:/mnt/vedder0/olddir/x

and the FS is withdrawn.

If you don't get that error, go to step 3 and increase the delay.

Reopening the bug, there should be enough data now.

Comment 27 Robert Peterson 2010-09-23 17:23:38 UTC

Hi Jaroslav,

I tried the method described in comment #25 to recreate the
problem, but the problem would not recreate for me.  I was
using the 2.6.18-222.el5 kernel on x86_64 which contains the
latest gfs2 patches.  In your comment you referred to test
program "bobrenamer" but I've actually got six different
versions of that program.  I tried the two versions that I
thought were most likely to fail and neither did.

Can you try to recreate the problem with the newer kernel?
I'd also like to know which version of the test program you
were using when you had the problem; perhaps you could
email it to me.

It's very possible that the patch for bug #602025 fixed the
corruption problem described by this bugzilla.

Comment 28 Jaroslav Kortus 2010-09-27 15:49:01 UTC

Sep 27 10:47:07 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: fatal: filesystem consistency error
Sep 27 10:47:07 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0:   function = rgblk_free, file = fs/gfs2/rgrp.c, line = 1406
Sep 27 10:47:07 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: about to withdraw this file system
Sep 27 10:47:07 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: telling LM to withdraw
Sep 27 10:47:08 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: withdrawn
Sep 27 10:47:08 z1 kernel:  [<f919d55c>] gfs2_lm_withdraw+0x8b/0xae [gfs2]
Sep 27 10:47:08 z1 kernel:  [<f91afc18>] gfs2_consist_i+0x1c/0x20 [gfs2]
Sep 27 10:47:08 z1 kernel:  [<f91ac94a>] rgblk_free+0x3c/0x15e [gfs2]
Sep 27 10:47:08 z1 kernel:  [<f91accd5>] gfs2_unlink_di+0x22/0x55 [gfs2]
Sep 27 10:47:08 z1 kernel:  [<f919b9b8>] gfs2_change_nlink+0xea/0xf3 [gfs2]
Sep 27 10:47:08 z1 kernel:  [<f91a71e8>] gfs2_rename+0x443/0x622 [gfs2]
Sep 27 10:47:08 z1 kernel:  [<f91a6e97>] gfs2_rename+0xf2/0x622 [gfs2]
Sep 27 10:47:08 z1 kernel:  [<f91a6ee4>] gfs2_rename+0x13f/0x622 [gfs2]
Sep 27 10:47:08 z1 kernel:  [<f91a6f0c>] gfs2_rename+0x167/0x622 [gfs2]
Sep 27 10:47:08 z1 kernel:  [<c04c8377>] selinux_inode_rename+0x150/0x15a
Sep 27 10:47:08 z1 kernel:  [<c0482bf4>] vfs_rename+0x2cd/0x430
Sep 27 10:47:08 z1 kernel:  [<c0482ff8>] __lookup_hash+0xc5/0xf3
Sep 27 10:47:08 z1 kernel:  [<c0484862>] sys_renameat+0x155/0x1af
Sep 27 10:47:08 z1 kernel:  [<c061de82>] schedule+0x9c6/0xa4f
Sep 27 10:47:08 z1 kernel:  [<c0407f4b>] do_syscall_trace+0xab/0xb1
Sep 27 10:47:08 z1 kernel:  [<c0404f4b>] syscall_call+0x7/0xb
Sep 27 10:47:08 z1 kernel:  =======================
Sep 27 10:47:08 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: block = 42263
(10:47:47) [root@z1:~]$ uname -a
Linux z1 2.6.18-223.el5PAE #1 SMP Mon Sep 20 15:42:57 EDT 2010 i686 athlon i386 GNU/Linux
(10:47:49) [root@z1:~]$ 

The procedure was the same again.

Comment 29 Jaroslav Kortus 2010-09-27 15:52:10 UTC

Sep 27 10:50:47 z2 kernel: GFS2: fsid=Z_Cluster:vedder0.0:   function = rgblk_free, file = fs/gfs2/rgrp.c, line = 1406
Sep 27 10:50:47 z2 kernel: GFS2: fsid=Z_Cluster:vedder0.0: about to withdraw this file system
Sep 27 10:50:47 z2 kernel: GFS2: fsid=Z_Cluster:vedder0.0: telling LM to withdraw
Sep 27 10:50:48 z2 kernel: GFS2: fsid=Z_Cluster:vedder0.0: withdrawn
Sep 27 10:50:48 z2 kernel: 
Sep 27 10:50:48 z2 kernel: Call Trace:
Sep 27 10:50:48 z2 kernel:  [<ffffffff886c5652>] :gfs2:gfs2_lm_withdraw+0xd1/0xfe
Sep 27 10:50:48 z2 kernel:  [<ffffffff80025c33>] find_or_create_page+0x22/0x72
Sep 27 10:50:48 z2 kernel:  [<ffffffff80025c33>] find_or_create_page+0x22/0x72
Sep 27 10:50:48 z2 kernel:  [<ffffffff886c8c3b>] :gfs2:gfs2_getbuf+0x106/0x115
Sep 27 10:50:48 z2 kernel:  [<ffffffff886d8d07>] :gfs2:gfs2_consist_i+0x27/0x2c
Sep 27 10:50:48 z2 kernel:  [<ffffffff886d55c7>] :gfs2:rgblk_free+0x46/0x15b
Sep 27 10:50:48 z2 kernel:  [<ffffffff886d595c>] :gfs2:gfs2_unlink_di+0x25/0x60
Sep 27 10:50:48 z2 kernel:  [<ffffffff886c3c3d>] :gfs2:gfs2_change_nlink+0xf8/0x102
Sep 27 10:50:48 z2 kernel:  [<ffffffff886cfc9f>] :gfs2:gfs2_rename+0x470/0x652
Sep 27 10:50:48 z2 kernel:  [<ffffffff886cf92f>] :gfs2:gfs2_rename+0x100/0x652
Sep 27 10:50:48 z2 kernel:  [<ffffffff886cf975>] :gfs2:gfs2_rename+0x146/0x652
Sep 27 10:50:48 z2 kernel:  [<ffffffff886cf99a>] :gfs2:gfs2_rename+0x16b/0x652
Sep 27 10:50:48 z2 kernel:  [<ffffffff8012d347>] selinux_inode_rename+0x172/0x184
Sep 27 10:50:48 z2 kernel:  [<ffffffff8002a9d7>] vfs_rename+0x2f4/0x471
Sep 27 10:50:48 z2 kernel:  [<ffffffff80036d07>] sys_renameat+0x180/0x1eb
Sep 27 10:50:48 z2 kernel:  [<ffffffff8006723e>] do_page_fault+0x4fe/0x874
Sep 27 10:50:48 z2 kernel:  [<ffffffff8003a859>] tty_ldisc_deref+0x68/0x7b
Sep 27 10:50:48 z2 kernel:  [<ffffffff800b94f5>] audit_syscall_entry+0x1a4/0x1cf
Sep 27 10:50:48 z2 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Sep 27 10:50:48 z2 kernel: 
Sep 27 10:50:48 z2 kernel: GFS2: fsid=Z_Cluster:vedder0.0: block = 42405

(10:51:06) [root@z2:~]$ uname -a
Linux z2 2.6.18-223.el5 #1 SMP Mon Sep 20 15:37:50 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Comment 32 Robert Peterson 2010-09-27 16:43:23 UTC

I recreated the failure on roth-02.  It seems to happen only
on the first rename after mount.  Possibly while journal recovery
is still happening?
 
[root@roth-02 /home/bob/tools]# umount -at gfs2
[root@roth-02 /home/bob/tools]# mount -tgfs2 /dev/roth_vg/roth_lv /mnt/gfs2
[root@roth-02 /home/bob/tools]# while true ; do ./bobrenamer5 ; done
0000 103. 
open error! path:/mnt/gfs2/olddir/x

0000 means it's the very first rename done by that run of the
program.  The 103 means the file name length is 103.
There also might already be a file with that name.
I'm going to try to narrow down the problem.

Comment 33 Robert Peterson 2010-09-27 19:20:28 UTC

Created attachment 450010 [details]
Try 1 Patch

Well, I figured out what the problem is, at least in Jaroslav's
case.  The problem is that function gfs2_rename is called before
the rgrp list has been brought into memory after a mount.
Therefore, the most-recently-used (mru) list of rgrps is empty,
and therefore function gfs2_blk2rgrpd() is not finding the
rgrp it needs in order to do the unlink of the dinode once its
directory's link count drops to zero.  Having not found the
rgrp it needs, it asserts.

This patch circumvents the problem by checking if the rgrps
have been read into memory when the gfs2_rename function starts.
This is not an elegant solution, and we may want to consider
other solutions.  I thought I'd toss this one out there and
let Steve Whitehouse devise a better way, as he often does.

Comment 34 Robert Peterson 2010-09-28 14:21:26 UTC

This is a pretty serious bug.  I think we should try to squeeze
a patch in to 5.6 if we can.  I asked Steve about my patch and
he said we may need to hold the rindex throughout the rename
operation.  That's what the unlink code does.  My concern is
that it may impact rename performance.  Maybe that's why the
unlink code is so slow.  My patch may already impact rename
performance.

We really should have an overarching approach that makes gfs2
read in the rindex whenever it's needed, and the rgrps and
bitmaps only on-demand.  That would likely make everything run
faster.  The problem is, that's a design change and would be
pervasive to the code: something for upstream, perhaps, but
we don't have time for that in 5.6.  Plus, reading rgrps on
an "as needed" basis is very tricky because we need to maintain
the proper lock ordering.

Comment 35 RHEL Program Management 2010-09-28 14:31:24 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 36 Robert Peterson 2010-09-29 15:18:47 UTC

I had some design discussions with Steve Whitehouse about this
issue.  The thing is, gfs2_rename's call to gfs2_blk2rgrpd and
its resulting scan of the rgrp list is unprotected.  Since it's
unprotected, an operation like gfs2_grow, which results in the
rgrp list being re-read, could cause a rename to get into serious
trouble.  At least theoretically.  How much trouble is still
unknown.

I added some debug instrumentation to try to tell if there were
other places that called gfs2_blk2rgrpd unprotected.  There were,
even in other parts of the rename code.  It calls function
gfs2_change_nlink(-1) which sometimes calls gfs2_unlink_di,
which calls rgblk_free, which calls gfs2_blk2rgrpd (unprotected).

I tried to fix the situation by making gfs2_rename hold the
rindex lock throughout.  After all, that's what gfs2_unlink does
so it can't be all bad, right?  But that failed miserably.  The
problem is, rename itself may end up allocating blocks for a new
dirent, which calls gfs2_inplace_reserve_i, which also tries to
hold the rindex.  That results in an assert.

The thing is, the call to gfs2_rindex_hold is a central point
of glock contention for anyone doing block allocations, which
means it's a big performance pitfall.  If we can eliminate them
we might just speed gfs2 block allocations, renames and deletes
up considerably.  (That might be why my changes to make the rgrp
list into a rbtree didn't show much performance.)

Steve and I did a little brainstorming about how we can
improve this situation.  We came up with an idea, but it's too
big of a design change to squeeze in to 5.6, especially given
the timeframe we have: we're already way past the 5.6 cutoff
for kernel changes.

So our current plan is this: For 5.6, I'll go forward with my
kludge patch as given above, minus the printk instrumentation
I added to make sure it was doing its job properly.  For
upstream and RHEL6.1, I'll write a prototype of the new idea.

The thing is: gfs2_rindex_hold has two basic uses: (1) It makes
sure the rgrp linked list is in memory and (2) it makes sure
it's the latest version in case someone did gfs2_grow.
Use (1) can probably be protected by a rwsem or spin_lock
(i.e. not clustered).  We only need the individual rgrp locks
to be clustered.  Use (2) is very unlikely (how often do
gfs2_grows happen?) and can be protected by a cluster-wide
glock that holds the same spin_lock to protect the list.

The other idea Steve had was to not re-read the entire list,
but just the new rgrps, and that way functions like
gfs2_blk2rgrpd don't have to worry about the linked list
disappearing while they're being used.  Of course, that means
we would need a "grow-only" policy for gfs2, which we already
have (by virtue of having no tool).  In other words, we'd have
to make special provisions if gfs2_shrink is ever implemented
because that makes it a lot more problematic.

Bottom line: I'm going to implement the kludge in 5.6, ignore
the "unprotected" calls to gfs2_blk2rgrpd for now, and open
a RHEL6.1 bug to implement a better solution that might speed
up gfs2 a bit.

Comment 37 Robert Peterson 2010-09-29 16:16:19 UTC

Patch posted to rhkernel-list; changing status to POST.

Comment 38 Robert Peterson 2010-09-29 19:40:56 UTC

Concerns raised about the patch; resetting status to ASSIGNED.

Comment 39 Robert Peterson 2010-09-29 20:00:44 UTC

Created attachment 450571 [details]
Try 3 patch

Steve raised the concern that the call to function gfs2_blk2rgrpd
should be protected by the rindex glock.  Based on his feedback,
I went through the code and backtracked all the code paths that
result in a call to gfs2_blk2rgrpd.  There were many.  The vast
majority were protected by the rindex lock.  The only code paths
that were not protected were in function gfs2_rename, which calls
gfs2_blk2rgrpd in three ways:

(1) The failing case where it's trying to find and lock the rgrp
    associated with the existing file being removed by the rename.
(2) The rename function's call to gfs2_change_nlink(-1) to
    unlink the dinode being renamed.
(3) The call to gfs2_change_nlink(-1) for directory renames.

To ensure correct lock ordering, I put the rindex lock at the
beginning of the function, as it is in the unlink function.
However, that left the problem where the new dirent needs to
reserve a new block and call gfs2_inplace_reserve, which also
tries to hold the rindex.  So I had to introduce a way for
it to tell gfs2_inplace_reserve the rindex was already held.

I think I'll wait for Steve to review this patch before posting
it.

Comment 40 Robert Peterson 2010-10-01 13:29:27 UTC

I POSTed the revised patch to rhkernel-list.  Changing status
back to POST.

Comment 42 Jarod Wilson 2010-10-14 14:01:27 UTC

in kernel-2.6.18-227.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 44 Nate Straz 2010-11-18 22:14:19 UTC

I used the metadata save to verify that I can recreate the bug on -226.el5 kernel (prefix).  I re-ran the test case on -229.el5 and was not able to recreate the bug.

Marking VERIFIED.

Comment 46 errata-xmlrpc 2011-01-13 20:54:01 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.