Bug 471141
| Summary: | GFS: filesystem consistency error, function = gfs_setbit | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> | ||||||
| Component: | gfs-kmod | Assignee: | Robert Peterson <rpeterso> | ||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 5.3 | CC: | ben.yarwood, iannis, lmcilroy, swhiteho, tao | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | 5.5 | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2010-09-08 10:04:44 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 590060 | ||||||||
| Attachments: |
|
||||||||
|
Description
Nate Straz
2008-11-11 23:06:44 UTC
Since the block was being freed (according to the backtrace) it looks like for some reason it was found to already be marked free in the rgrp. This sounds like it might well be related to the other problem, where blocks were being corrupted. Maybe the rgrp's bitmaps are somehow incorrect? Looks like I may have hit this while doing testing with MarkLogic. I'm capturing metadata now with gfs2_edit to hand to Bob for analysis. GFS: fsid=north:northlv.11: fatal: filesystem consistency error GFS: fsid=north:northlv.11: RG = 140760734 GFS: fsid=north:northlv.11: function = gfs_setbit GFS: fsid=north:northlv.11: file = /builddir/build/BUILD/gfs-kmod-0.1.31/_kmod _build_/src/gfs/bits.c, line = 81 GFS: fsid=north:northlv.11: time = 1238647878 GFS: fsid=north:northlv.11: about to withdraw from the cluster GFS: fsid=north:northlv.11: telling LM to withdraw GFS: fsid=north:northlv.11: withdrawn Call Trace: [<ffffffff8867afc0>] :gfs:gfs_lm_withdraw+0xc4/0xd3 [<ffffffff80013617>] find_lock_page+0x26/0xa1 [<ffffffff800d6d4c>] __cache_alloc_node+0x9d/0xd2 [<ffffffff88692836>] :gfs:gmalloc+0xf/0x3a [<ffffffff88668e76>] :gfs:gfs_dpin+0x1b2/0x20c [<ffffffff88692ad0>] :gfs:gfs_consist_rgrpd_i+0x3c/0x41 [<ffffffff8868e210>] :gfs:blkfree_internal+0x125/0x145 [<ffffffff8868e24a>] :gfs:gfs_difree_uninit+0x1a/0xc9 [<ffffffff8868e30f>] :gfs:gfs_difree+0x16/0x45 [<ffffffff88675956>] :gfs:gfs_inode_dealloc+0x4f8/0x53c [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4 [<ffffffff88692023>] :gfs:gfs_unlinked_dealloc+0x31/0xb3 [<ffffffff886671e8>] :gfs:gfs_inoded+0x0/0x44 [<ffffffff886671f6>] :gfs:gfs_inoded+0xe/0x44 [<ffffffff80032360>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032262>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Version-Release number of selected component (if applicable): kernel-2.6.18-128.el5 kmod-gfs-0.1.31-3.el5 While investigating the corruption with gfs_fsck Bob noted: <bob> Well, it's hard to say how it got that way. I'm trying to find the files that correspond to the dinodes in order to possibly track down what was happening to them when they got the same blocks <bob> For example, the first duplicate block found is 0x86478d0. That block is claimed as data by a dinode at 0x19a77a0f, but it's also claimed as data by a dinode at 0x1eb112e3 This corresponds to the following two files: 430406159 2856 -rw-r--r-- 1 daemon daemon 2911568 Apr 2 00:49 ./NORTH_F6/00000190/d853d1ac0be720f2_string_ 514921187 216 -rw-r--r-- 1 daemon daemon 213888 Apr 2 00:49 ./NORTH_F10/00000190/a6cc4def48d472ab_date The file in NORTH_F10 is one that north-10 could have been modifying just before it panicked as per comment #2. A dump of the inode structures with gfs2_edit shows that both nodes were allocating to the same resource group within 20 seconds of each other. Block #430406159 (0x19a77a0f) of 609826816 (0x24593800) (disk inode) ... di_size 2911568 0x2c6d50 di_blocks 714 0x2ca di_atime 1238647797 0x49d443f5 di_mtime 1238647797 0x49d443f5 di_ctime 1238647797 0x49d443f5 di_goal_meta 430405676 0x19a7782c di_goal_data 140760734 0x863d69e Block #514921187 (0x1eb112e3) of 609826816 (0x24593800) (disk inode) ... di_size 213888 0x34380 di_blocks 54 0x36 di_atime 1238647778 0x49d443e2 di_mtime 1238647778 0x49d443e2 di_ctime 1238647778 0x49d443e2 di_goal_meta 514873846 0x1eb059f6 di_goal_data 140760734 0x863d69e The block maps show that the entirety of inode 0x1eb112e3's allocations are within inode 0x19a77a0f's allocations. Created attachment 338905 [details]
Preliminary Instrumentation (doesn't work)
This patch adds instrumentation to GFS to sanity-check the rgrp
information using lvbs passed back in the rgrp glocks. It doesn't
behave as I expect it should. The goal here is to try to detect
when gfs is getting the wrong data for an rgrp that was written on
a different node.
My theory is that the rgrp glock is released after the rgrp has been
committed to the journal, but the "real" block hasn't yet landed on
disk. Thus, gfs would get the previous "generation" of the rgrp and
perhaps assign a now-used block for another purpose.
Created attachment 338931 [details]
Better Instrumentation
This version corrects a few problems, including an alignment problem
that was causing all kinds of havoc.
I'd like for Nate or someone to try to recreate similar file system corruption using the instrumentation given with the previous patch, and monitor and report any kernel messages it kicks out. BTW, the patch applies to the gfs-kernel source on top of the patch for bug #491369. There's a remote possibility that the problem is fixed by the patch for bug #491369, so if we can't recreate the corruption with the instrumented version, we should back out the instrumentation and run with just that patch. I can provide a binary version of gfs.ko if desired. I'm setting the NEEDINFO bit until I can get that info. I have another theory as to how we can get gfs file system corruption. This is just a theory; perhaps someone can punch holes in it. Suppose we have two processes waiting for an rgrp: one for block allocation and another just for statfs, i.e. a df process. Suppose they request the rgrp at roughly the same time, but df is actually the first holder, and yet the rgrp is locked on a different node. The df waits for the glock by calling glock_wait_internal() (and so does the other process). Eventually, the glock is unlocked on the other node, so df gets control. Since he's the first holder, he has the HIF_FIRST bit set on in the holder structure, but since it's a statfs operation, the caller has the GL_SKIP flag set on, which prevents rgrp_go_lock from re-reading the rgrp in order to save time in the df. The df completes happily and the second process gets control of the glock, but now he's not the first holder in, so HIF_FIRST is not set, so glops->go_lock is never called, so the invalid rgrp is STILL not read in. Now the allocation takes place on the invalid rgrp, thus re-allocating bits that were already allocated on the second node. If my theory is correct, we should be able to recreate all kinds of file system corruption just by running df (and waiting a half-second) in an infinite loop while gfs writes are going on in multiple nodes. A simple multi-node d_io should work, provided we have rgrp collisions. It's more likely to occur on a small file system, since there will be fewer RGs to allocate from, and fewer RGs to lock/unlock. Also, this would only occur without statfs_fast. I hit this over the weekend on our PPC cluster running 2.6.18-154.el5.
GFS: fsid=lpars:brawl0.0: fatal: filesystem consistency error
GFS: fsid=lpars:brawl0.0: RG = 405798912
GFS: fsid=lpars:brawl0.0: function = gfs_setbit
GFS: fsid=lpars:brawl0.0: file = /builddir/build/BUILD/gfs-kmod-0.1.33/_kmod_b
uild_/src/gfs/bits.c, line = 81
GFS: fsid=lpars:brawl0.0: time = 1245471616
GFS: fsid=lpars:brawl0.0: about to withdraw from the cluster
kernel BUG in gfs_lm_withdraw at /builddir/build/BUILD/gfs-kmod-0.1.33/_kmod_bui
ld_/src/gfs/lm.c:110!
cpu 0x0: Vector: 700 (Program Check) at [c00000001c7ef1c0]
pc: d000000000e2d4a8: .gfs_lm_withdraw+0x9c/0xfc [gfs]
lr: d000000000e2d4a0: .gfs_lm_withdraw+0x94/0xfc [gfs]
sp: c00000001c7ef440
msr: 8000000000029032
current = 0xc00000007fcb19e0
paca = 0xc00000000053c500
pid = 16346, comm = gfs_inoded
kernel BUG in gfs_lm_withdraw at /builddir/build/BUILD/gfs-kmod-0.1.33/_kmod_bui
ld_/src/gfs/lm.c:110!
enter ? for help
Unfortunately our pSeries machines drop into a debugger instead of spitting out a backtrace. The system was fenced before I got to the system.
The information in comment #8 tells me nothing. What I'd really like is for someone to try to figure out a consistent way to recreate the failure, and I'd start with the procedure I recommended in comment #7. (In reply to comment #7) > If my theory is correct, we should be able to recreate all kinds of > file system corruption just by running df (and waiting a half-second) > in an infinite loop while gfs writes are going on in multiple nodes. > A simple multi-node d_io should work, provided we have rgrp collisions. > It's more likely to occur on a small file system, since there will > be fewer RGs to allocate from, and fewer RGs to lock/unlock. > Also, this would only occur without statfs_fast. The problem with the theory is that I hit this with near default options, meaning statfs_fast was on. Comment #8 happened while running coherency (multi-node I/O to the same file) on a 1k block size file system on PPC (64k pages). df does not get run during this, nor any calls to statfs. I hit this again last night on the PPC cluster, but this time with 4k fs block size. I was running kernel-2.6.18-160.el5. Added issue 334119. We don't have a fix yet; retargeting to 5.6. Do we have a reliable reproducer? This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. It is over 12 months since the issue was reproduced. We have no information on how it happened and the bug has been in needinfo for well over 6 months. On that basis I think we can reasonably close this. Please reopen if there is more info available. |