Created attachment 365392 [details] GFS2 metadate from drive. Description of problem: GFS2 filesystem looks to have a lost a resource group and when we went to check for it we returned NULL instead of the rgd. Observed behavior: gfs2 filesystem withdraws with the following messages: Sep 18 04:36:57 tchtom04test kernel: GFS2: fsid=testsystem2:gfs_pool2.3: fatal: filesystem consistency error Sep 18 04:36:57 tchtom04test kernel: GFS2: fsid=testsystem2:gfs_pool2.3: function = rgblk_free, file = fs/gfs2/rgrp.c, line = 1401 Sep 18 04:36:57 tchtom04test kernel: GFS2: fsid=testsystem2:gfs_pool2.3: about to withdraw this file system Sep 18 04:36:57 tchtom04test kernel: GFS2: fsid=testsystem2:gfs_pool2.3: telling LM to withdraw Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: File system should remain healthy and not withdraw. Additional info:
I've run the saved metadata though fsck.gfs2 and it claims that there is nothing wrong with it. Was this metadata saved while the fs was still mounted? It seems that all the journals are dirty, but there are no blocks to reply.
Also, for some reason it looks as if the master inode was not saved. Maybe an old version of gfs2_edit perhaps?
Event posted on 10-21-2009 09:36am EDT by pkeloth This event sent from IssueTracker by pkeloth issue 351115 it_file 266642
Event posted on 10-21-2009 09:37am EDT by pkeloth This event sent from IssueTracker by pkeloth issue 351115 it_file 266643
Created attachment 365537 [details] gfs2_fsck.log
Created attachment 365539 [details] gfs2_fsck.log2
Created attachment 365863 [details] messages file with backtrace
The consistency error indicates a problem with the free block count in the resource group information. I've recently added better integrity checking of this information in my latest and greatest fsck.gfs2 for bug #455300, which includes a HUGE number of code changes. This new version of fsck.gfs2 is not quite ready yet and needs more testing. The good thing is that this version has a lot of additional bug fixes and serious performance enhancements, so when it is done, it will be a much better tool. I have been developing this new version for weeks, so it may require several more days of testing and possibly a few more small fixes. I'm hoping to run it through a battery of very nasty fsck.gfs2 tests this weekend. I will test the new version on the customer's metadata shortly to see how it handles their file system. Is the customer willing to take a test version of fsck.gfs2 rather than a hotfix, at least temporarily, to get them running?
The experimental new RHEL5 version of fsck.gfs2 has passed a lot of tests and is now on my people page at this location: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/fsck.gfs2 I highly recommend making backups and/or saving off the gfs2 metadata (with gfs2_edit savemeta /dev/device ~/some/file) before trying this version of fsck.gfs2.
I never heard back to see if the new fsck.gfs2 fixed the problem. Closing. Feel free to re-open if there's an issue.
In trying to reproduce bug #602025 on an i686 machine in a cluster, followed by trying my rename benchmark program, bobbench, Jaroslav Kortus seems to have recreated this problem by accident. It is possible that this bug is a duplicate of bug #602025. We will try to recreate the failure again and see if the patch for bug #602025 fixes the problem.
I tried running the reproducer for bug #602025 on my own i686 machine and it did not seem to fail. My guess is that it just silently corrupts kernel memory. Meanwhile, Jaroslav recreated the failure by following the same steps he did before. He got: 1. GFS2: fsid=Z_Cluster:vedder0.1: fatal: filesystem consistency error 2. GFS2: fsid=Z_Cluster:vedder0.1: function = rgblk_free, file = fs/gfs2/rgrp.c, line = 1401 3. GFS2: fsid=Z_Cluster:vedder0.1: about to withdraw this file system 4. ----------- [cut here ] --------- [please bite here ] --------- 5. Kernel BUG at fs/gfs2/lm.c:107 6. invalid opcode: 0000 [1] SMP 7. last sysfs file: /fs/gfs2/Z_Cluster:vedder0/lock_module/recover_done 8. CPU 0 9. Modules linked in: nfs fscache lock_dlm gfs2 dlm configfs nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev pcspkr i2c_i801 i2c_core ide_cd serio_raw sg i5000_edac edac_mc cdrom bnx2 dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_piix libata shpchp aacraid sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd 10. Pid: 3660, comm: br Not tainted 2.6.18-128.18.1.el5 #1 11. RIP: 0010:[<ffffffff8867d541>] [<ffffffff8867d541>] :gfs2:gfs2_lm_withdraw+0x84/0xd0 12. RSP: 0018:ffff81006440fa38 EFLAGS: 00010202 13. RAX: 0000000000000046 RBX: ffff81006a161000 RCX: ffffffff802f7aa8 14. RDX: ffffffff802f7aa8 RSI: 0000000000000000 RDI: ffffffff802f7aa0 15. RBP: ffff81006a161930 R08: ffffffff802f7aa8 R09: 0000000000000001 16. R10: 0000000000000000 R11: 0000000000000080 R12: 0000000000000000 17. R13: ffff81006456dcc0 R14: 0000000000000001 R15: ffff8100644f39f0 18. FS: 00002b2d10dbf220(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000 19. CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 20. CR2: 0000003364e7a000 CR3: 0000000064b71000 CR4: 00000000000006e0 21. Process br (pid: 3660, threadinfo ffff81006440e000, task ffff810064b18820) 22. Stack: 0000003000000030 ffff81006440fb30 ffff81006440fa58 0000000000000050 23. 000000000000882f ffffffff8002567f ffff81006a161930 ffff81006a161930 24. ffffffff88692394 ffffffff88694921 ffff81006467c3d0 ffffffff80013631 25. Call Trace: 26. [<ffffffff8002567f>] find_or_create_page+0x1e/0x75 27. [<ffffffff80013631>] find_lock_page+0x26/0xa1 28. [<ffffffff8002567f>] find_or_create_page+0x1e/0x75 29. [<ffffffff886809e7>] :gfs2:gfs2_getbuf+0x106/0x115 30. [<ffffffff886907ff>] :gfs2:gfs2_consist_i+0x27/0x2c 31. [<ffffffff8868d2a5>] :gfs2:rgblk_free+0x46/0x159 32. [<ffffffff8868d4cc>] :gfs2:gfs2_unlink_di+0x25/0x60 33. [<ffffffff8867bd2a>] :gfs2:gfs2_change_nlink+0xf8/0x103 34. [<ffffffff88687987>] :gfs2:gfs2_rename+0x467/0x643 35. [<ffffffff8868761f>] :gfs2:gfs2_rename+0xff/0x643 36. [<ffffffff88687666>] :gfs2:gfs2_rename+0x146/0x643 37. [<ffffffff8868768b>] :gfs2:gfs2_rename+0x16b/0x643 38. [<ffffffff80124386>] selinux_inode_rename+0x172/0x184 39. [<ffffffff8002a52d>] vfs_rename+0x2f4/0x471 40. [<ffffffff800366db>] sys_renameat+0x180/0x1eb 41. [<ffffffff80066b32>] do_page_fault+0x4fe/0x830 42. [<ffffffff80062ff8>] thread_return+0x62/0xfe 43. [<ffffffff800b4697>] audit_syscall_entry+0x180/0x1b3 44. [<ffffffff8005d28d>] tracesys+0xd5/0xe0 I've asked him to apply the patch to fix the problem and he will follow the same steps and see if that patch fixes this issue as well.
Note on recreating this problem from Jaroslav: <jkortus> bob: I do mkfs on one node (the only one having that mounted), mount it, run bobrenamer couple of times (tens of runs), then I run bobbench also like 30 times <jkortus> and the I mount the FS on 2nd node and run bobrenamer <jkortus> 0000 103. <-- that's all bobrenamer prints :)
Created attachment 448555 [details] metadata with FS that will trigger the bug
How to trigger the bug: Easy way: 1. restore metadata from bobrenamer.meta.gz 2. mount the FS, run bobrenamer on it How to make your own broken metadata: 0. Form a cluster, only one node is required to have the FS mounted 1. create fresh GFS2 FS (mkfs -t gfs2 -j 4 -J 32 -p lock_dlm -t Z_Cluster:vedder0 /dev/vedder/vedder0 -O) and mount it 2. get bobrenamer 3. run while true; do ./bobrenamer; done 4. keep it running for 20-60 secs 5. hit CTRL-C 6. umount -at gfs2 7. mount the FS back 8. run bobrenamer once again "successful" run looks like this: (16:50:21) [root@z1:~]$ ./bobrenamer 0000 103. open error! path:/mnt/vedder0/olddir/x and the FS is withdrawn. If you don't get that error, go to step 3 and increase the delay. Reopening the bug, there should be enough data now.
Hi Jaroslav, I tried the method described in comment #25 to recreate the problem, but the problem would not recreate for me. I was using the 2.6.18-222.el5 kernel on x86_64 which contains the latest gfs2 patches. In your comment you referred to test program "bobrenamer" but I've actually got six different versions of that program. I tried the two versions that I thought were most likely to fail and neither did. Can you try to recreate the problem with the newer kernel? I'd also like to know which version of the test program you were using when you had the problem; perhaps you could email it to me. It's very possible that the patch for bug #602025 fixed the corruption problem described by this bugzilla.
Sep 27 10:47:07 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: fatal: filesystem consistency error Sep 27 10:47:07 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: function = rgblk_free, file = fs/gfs2/rgrp.c, line = 1406 Sep 27 10:47:07 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: about to withdraw this file system Sep 27 10:47:07 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: telling LM to withdraw Sep 27 10:47:08 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: withdrawn Sep 27 10:47:08 z1 kernel: [<f919d55c>] gfs2_lm_withdraw+0x8b/0xae [gfs2] Sep 27 10:47:08 z1 kernel: [<f91afc18>] gfs2_consist_i+0x1c/0x20 [gfs2] Sep 27 10:47:08 z1 kernel: [<f91ac94a>] rgblk_free+0x3c/0x15e [gfs2] Sep 27 10:47:08 z1 kernel: [<f91accd5>] gfs2_unlink_di+0x22/0x55 [gfs2] Sep 27 10:47:08 z1 kernel: [<f919b9b8>] gfs2_change_nlink+0xea/0xf3 [gfs2] Sep 27 10:47:08 z1 kernel: [<f91a71e8>] gfs2_rename+0x443/0x622 [gfs2] Sep 27 10:47:08 z1 kernel: [<f91a6e97>] gfs2_rename+0xf2/0x622 [gfs2] Sep 27 10:47:08 z1 kernel: [<f91a6ee4>] gfs2_rename+0x13f/0x622 [gfs2] Sep 27 10:47:08 z1 kernel: [<f91a6f0c>] gfs2_rename+0x167/0x622 [gfs2] Sep 27 10:47:08 z1 kernel: [<c04c8377>] selinux_inode_rename+0x150/0x15a Sep 27 10:47:08 z1 kernel: [<c0482bf4>] vfs_rename+0x2cd/0x430 Sep 27 10:47:08 z1 kernel: [<c0482ff8>] __lookup_hash+0xc5/0xf3 Sep 27 10:47:08 z1 kernel: [<c0484862>] sys_renameat+0x155/0x1af Sep 27 10:47:08 z1 kernel: [<c061de82>] schedule+0x9c6/0xa4f Sep 27 10:47:08 z1 kernel: [<c0407f4b>] do_syscall_trace+0xab/0xb1 Sep 27 10:47:08 z1 kernel: [<c0404f4b>] syscall_call+0x7/0xb Sep 27 10:47:08 z1 kernel: ======================= Sep 27 10:47:08 z1 kernel: GFS2: fsid=Z_Cluster:vedder0.0: block = 42263 (10:47:47) [root@z1:~]$ uname -a Linux z1 2.6.18-223.el5PAE #1 SMP Mon Sep 20 15:42:57 EDT 2010 i686 athlon i386 GNU/Linux (10:47:49) [root@z1:~]$ The procedure was the same again.
Sep 27 10:50:47 z2 kernel: GFS2: fsid=Z_Cluster:vedder0.0: function = rgblk_free, file = fs/gfs2/rgrp.c, line = 1406 Sep 27 10:50:47 z2 kernel: GFS2: fsid=Z_Cluster:vedder0.0: about to withdraw this file system Sep 27 10:50:47 z2 kernel: GFS2: fsid=Z_Cluster:vedder0.0: telling LM to withdraw Sep 27 10:50:48 z2 kernel: GFS2: fsid=Z_Cluster:vedder0.0: withdrawn Sep 27 10:50:48 z2 kernel: Sep 27 10:50:48 z2 kernel: Call Trace: Sep 27 10:50:48 z2 kernel: [<ffffffff886c5652>] :gfs2:gfs2_lm_withdraw+0xd1/0xfe Sep 27 10:50:48 z2 kernel: [<ffffffff80025c33>] find_or_create_page+0x22/0x72 Sep 27 10:50:48 z2 kernel: [<ffffffff80025c33>] find_or_create_page+0x22/0x72 Sep 27 10:50:48 z2 kernel: [<ffffffff886c8c3b>] :gfs2:gfs2_getbuf+0x106/0x115 Sep 27 10:50:48 z2 kernel: [<ffffffff886d8d07>] :gfs2:gfs2_consist_i+0x27/0x2c Sep 27 10:50:48 z2 kernel: [<ffffffff886d55c7>] :gfs2:rgblk_free+0x46/0x15b Sep 27 10:50:48 z2 kernel: [<ffffffff886d595c>] :gfs2:gfs2_unlink_di+0x25/0x60 Sep 27 10:50:48 z2 kernel: [<ffffffff886c3c3d>] :gfs2:gfs2_change_nlink+0xf8/0x102 Sep 27 10:50:48 z2 kernel: [<ffffffff886cfc9f>] :gfs2:gfs2_rename+0x470/0x652 Sep 27 10:50:48 z2 kernel: [<ffffffff886cf92f>] :gfs2:gfs2_rename+0x100/0x652 Sep 27 10:50:48 z2 kernel: [<ffffffff886cf975>] :gfs2:gfs2_rename+0x146/0x652 Sep 27 10:50:48 z2 kernel: [<ffffffff886cf99a>] :gfs2:gfs2_rename+0x16b/0x652 Sep 27 10:50:48 z2 kernel: [<ffffffff8012d347>] selinux_inode_rename+0x172/0x184 Sep 27 10:50:48 z2 kernel: [<ffffffff8002a9d7>] vfs_rename+0x2f4/0x471 Sep 27 10:50:48 z2 kernel: [<ffffffff80036d07>] sys_renameat+0x180/0x1eb Sep 27 10:50:48 z2 kernel: [<ffffffff8006723e>] do_page_fault+0x4fe/0x874 Sep 27 10:50:48 z2 kernel: [<ffffffff8003a859>] tty_ldisc_deref+0x68/0x7b Sep 27 10:50:48 z2 kernel: [<ffffffff800b94f5>] audit_syscall_entry+0x1a4/0x1cf Sep 27 10:50:48 z2 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Sep 27 10:50:48 z2 kernel: Sep 27 10:50:48 z2 kernel: GFS2: fsid=Z_Cluster:vedder0.0: block = 42405 (10:51:06) [root@z2:~]$ uname -a Linux z2 2.6.18-223.el5 #1 SMP Mon Sep 20 15:37:50 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
I recreated the failure on roth-02. It seems to happen only on the first rename after mount. Possibly while journal recovery is still happening? [root@roth-02 /home/bob/tools]# umount -at gfs2 [root@roth-02 /home/bob/tools]# mount -tgfs2 /dev/roth_vg/roth_lv /mnt/gfs2 [root@roth-02 /home/bob/tools]# while true ; do ./bobrenamer5 ; done 0000 103. open error! path:/mnt/gfs2/olddir/x 0000 means it's the very first rename done by that run of the program. The 103 means the file name length is 103. There also might already be a file with that name. I'm going to try to narrow down the problem.
Created attachment 450010 [details] Try 1 Patch Well, I figured out what the problem is, at least in Jaroslav's case. The problem is that function gfs2_rename is called before the rgrp list has been brought into memory after a mount. Therefore, the most-recently-used (mru) list of rgrps is empty, and therefore function gfs2_blk2rgrpd() is not finding the rgrp it needs in order to do the unlink of the dinode once its directory's link count drops to zero. Having not found the rgrp it needs, it asserts. This patch circumvents the problem by checking if the rgrps have been read into memory when the gfs2_rename function starts. This is not an elegant solution, and we may want to consider other solutions. I thought I'd toss this one out there and let Steve Whitehouse devise a better way, as he often does.
This is a pretty serious bug. I think we should try to squeeze a patch in to 5.6 if we can. I asked Steve about my patch and he said we may need to hold the rindex throughout the rename operation. That's what the unlink code does. My concern is that it may impact rename performance. Maybe that's why the unlink code is so slow. My patch may already impact rename performance. We really should have an overarching approach that makes gfs2 read in the rindex whenever it's needed, and the rgrps and bitmaps only on-demand. That would likely make everything run faster. The problem is, that's a design change and would be pervasive to the code: something for upstream, perhaps, but we don't have time for that in 5.6. Plus, reading rgrps on an "as needed" basis is very tricky because we need to maintain the proper lock ordering.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
I had some design discussions with Steve Whitehouse about this issue. The thing is, gfs2_rename's call to gfs2_blk2rgrpd and its resulting scan of the rgrp list is unprotected. Since it's unprotected, an operation like gfs2_grow, which results in the rgrp list being re-read, could cause a rename to get into serious trouble. At least theoretically. How much trouble is still unknown. I added some debug instrumentation to try to tell if there were other places that called gfs2_blk2rgrpd unprotected. There were, even in other parts of the rename code. It calls function gfs2_change_nlink(-1) which sometimes calls gfs2_unlink_di, which calls rgblk_free, which calls gfs2_blk2rgrpd (unprotected). I tried to fix the situation by making gfs2_rename hold the rindex lock throughout. After all, that's what gfs2_unlink does so it can't be all bad, right? But that failed miserably. The problem is, rename itself may end up allocating blocks for a new dirent, which calls gfs2_inplace_reserve_i, which also tries to hold the rindex. That results in an assert. The thing is, the call to gfs2_rindex_hold is a central point of glock contention for anyone doing block allocations, which means it's a big performance pitfall. If we can eliminate them we might just speed gfs2 block allocations, renames and deletes up considerably. (That might be why my changes to make the rgrp list into a rbtree didn't show much performance.) Steve and I did a little brainstorming about how we can improve this situation. We came up with an idea, but it's too big of a design change to squeeze in to 5.6, especially given the timeframe we have: we're already way past the 5.6 cutoff for kernel changes. So our current plan is this: For 5.6, I'll go forward with my kludge patch as given above, minus the printk instrumentation I added to make sure it was doing its job properly. For upstream and RHEL6.1, I'll write a prototype of the new idea. The thing is: gfs2_rindex_hold has two basic uses: (1) It makes sure the rgrp linked list is in memory and (2) it makes sure it's the latest version in case someone did gfs2_grow. Use (1) can probably be protected by a rwsem or spin_lock (i.e. not clustered). We only need the individual rgrp locks to be clustered. Use (2) is very unlikely (how often do gfs2_grows happen?) and can be protected by a cluster-wide glock that holds the same spin_lock to protect the list. The other idea Steve had was to not re-read the entire list, but just the new rgrps, and that way functions like gfs2_blk2rgrpd don't have to worry about the linked list disappearing while they're being used. Of course, that means we would need a "grow-only" policy for gfs2, which we already have (by virtue of having no tool). In other words, we'd have to make special provisions if gfs2_shrink is ever implemented because that makes it a lot more problematic. Bottom line: I'm going to implement the kludge in 5.6, ignore the "unprotected" calls to gfs2_blk2rgrpd for now, and open a RHEL6.1 bug to implement a better solution that might speed up gfs2 a bit.
Patch posted to rhkernel-list; changing status to POST.
Concerns raised about the patch; resetting status to ASSIGNED.
Created attachment 450571 [details] Try 3 patch Steve raised the concern that the call to function gfs2_blk2rgrpd should be protected by the rindex glock. Based on his feedback, I went through the code and backtracked all the code paths that result in a call to gfs2_blk2rgrpd. There were many. The vast majority were protected by the rindex lock. The only code paths that were not protected were in function gfs2_rename, which calls gfs2_blk2rgrpd in three ways: (1) The failing case where it's trying to find and lock the rgrp associated with the existing file being removed by the rename. (2) The rename function's call to gfs2_change_nlink(-1) to unlink the dinode being renamed. (3) The call to gfs2_change_nlink(-1) for directory renames. To ensure correct lock ordering, I put the rindex lock at the beginning of the function, as it is in the unlink function. However, that left the problem where the new dirent needs to reserve a new block and call gfs2_inplace_reserve, which also tries to hold the rindex. So I had to introduce a way for it to tell gfs2_inplace_reserve the rindex was already held. I think I'll wait for Steve to review this patch before posting it.
I POSTed the revised patch to rhkernel-list. Changing status back to POST.
in kernel-2.6.18-227.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
I used the metadata save to verify that I can recreate the bug on -226.el5 kernel (prefix). I re-ran the test case on -229.el5 and was not able to recreate the bug. Marking VERIFIED.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html