Description of problem: While running coherency on GFS with a 1k block size (on x86 nodes) I found that one node panicked with the following message: GFS: fsid=marathon:brawl0.0: fatal: filesystem consistency error GFS: fsid=marathon:brawl0.0: RG = 431174610 GFS: fsid=marathon:brawl0.0: function = gfs_setbit GFS: fsid=marathon:brawl0.0: file = /builddir/build/BUILD/gfs-kmod-0.1.28/_kmod_build_/src/gfs/bits.c, line = 81 GFS: fsid=marathon:brawl0.0: time = 1226438282 GFS: fsid=marathon:brawl0.0: about to withdraw from the cluster ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at ...ld/BUILD/gfs-kmod-0.1.28/_kmod_build_/src/gfs/lm.c:110 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:09.0/irq CPU 2 Modules linked in: sctp lock_nolock gfs(U) lock_dlm gfs2 dlm gnbd(U) configfs autofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) sunrpc ipv6 xfrm_nalgo crypto_api dm_multipat h scsi_dh video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg i2c_nforce2 k8_edac forcedeth ide_cd i2c_core shpchp edac_mc cdrom serio_ raw k8temp hwmon pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc sata_nv libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 9626, comm: gfs_inoded Tainted: G 2.6.18-122.el5 #1 RIP: 0010:[<ffffffff885fef4b>] [<ffffffff885fef4b>] :gfs:gfs_lm_withdraw+0x87/0xd3 RSP: 0018:ffff81005eef9920 EFLAGS: 00010202 RAX: 0000000000000043 RBX: ffffc20000491000 RCX: ffffffff802fdaa8 RDX: ffffffff802fdaa8 RSI: 0000000000000000 RDI: ffffffff802fdaa0 RBP: ffffc200004c99ac R08: ffffffff802fdaa8 R09: 0000000000000046 R10: ffff81005eef95c0 R11: 0000000000000280 R12: 000000000000014c R13: ffff81013db50048 R14: 0000000000005d6d R15: 0000000000000002 FS: 00002b7c3d7c7bd0(0000) GS:ffff810102b277c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002af34c528000 CR3: 00000000766d3000 CR4: 00000000000006e0 Process gfs_inoded (pid: 9626, threadinfo ffff81005eef8000, task ffff81013e150820) Stack: 0000003000000030 ffff81005eef9a40 ffff81005eef9940 ffff81012eb1fa60 ffff81012eb1fa60 ffffc20000491000 ffffc200004c99ac ffffc200004c99ac 0000000019b333d2 ffffc200004c99ac ffff810068a98800 ffffffff800d6d56 Call Trace: [<ffffffff800d6d56>] __cache_alloc_node+0x9d/0xd2 [<ffffffff800d6d56>] __cache_alloc_node+0x9d/0xd2 [<ffffffff885ecde0>] :gfs:gfs_dpin+0x11c/0x20c [<ffffffff886169ac>] :gfs:gfs_consist_rgrpd_i+0x3c/0x41 [<ffffffff886120ec>] :gfs:blkfree_internal+0x125/0x145 [<ffffffff88612324>] :gfs:gfs_blkfree+0x20/0x94 [<ffffffff885eadf4>] :gfs:do_strip+0x216/0x2fd [<ffffffff885e9aae>] :gfs:recursive_scan+0xe7/0x16c [<ffffffff885e9af8>] :gfs:recursive_scan+0x131/0x16c [<ffffffff885eabde>] :gfs:do_strip+0x0/0x2fd [<ffffffff885e9af8>] :gfs:recursive_scan+0x131/0x16c [<ffffffff885eabde>] :gfs:do_strip+0x0/0x2fd [<ffffffff885ea2ff>] :gfs:gfs_shrink+0xd7/0x41d [<ffffffff885eabde>] :gfs:do_strip+0x0/0x2fd [<ffffffff80089a8c>] complete+0x38/0x4b [<ffffffff885f504d>] :gfs:lock_on_glock+0x66/0x6d [<ffffffff8009d986>] keventd_create_kthread+0x0/0xc4 [<ffffffff8009d986>] keventd_create_kthread+0x0/0xc4 [<ffffffff885f9641>] :gfs:gfs_inode_dealloc+0x21e/0x53c [<ffffffff8009d986>] keventd_create_kthread+0x0/0xc4 [<ffffffff88615eff>] :gfs:gfs_unlinked_dealloc+0x31/0xb3 [<ffffffff885eb1e8>] :gfs:gfs_inoded+0x0/0x44 [<ffffffff885eb1f6>] :gfs:gfs_inoded+0xe/0x44 [<ffffffff800324a9>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009d986>] keventd_create_kthread+0x0/0xc4 [<ffffffff800323ab>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 0f 0b 68 cf a2 61 88 c2 6e 00 48 89 ee 48 c7 c7 0f a3 61 88 Version-Release number of selected component (if applicable): kernel-2.6.18-122.el5 kmod-gfs-0.1.28-1.el5 How reproducible: Not easily. I've seen this twice in the past two months. The other time I filed it in bug 452274 comment #26. Steps to Reproduce: No reliable steps yet.
Since the block was being freed (according to the backtrace) it looks like for some reason it was found to already be marked free in the rgrp. This sounds like it might well be related to the other problem, where blocks were being corrupted. Maybe the rgrp's bitmaps are somehow incorrect?
Looks like I may have hit this while doing testing with MarkLogic. I'm capturing metadata now with gfs2_edit to hand to Bob for analysis. GFS: fsid=north:northlv.11: fatal: filesystem consistency error GFS: fsid=north:northlv.11: RG = 140760734 GFS: fsid=north:northlv.11: function = gfs_setbit GFS: fsid=north:northlv.11: file = /builddir/build/BUILD/gfs-kmod-0.1.31/_kmod _build_/src/gfs/bits.c, line = 81 GFS: fsid=north:northlv.11: time = 1238647878 GFS: fsid=north:northlv.11: about to withdraw from the cluster GFS: fsid=north:northlv.11: telling LM to withdraw GFS: fsid=north:northlv.11: withdrawn Call Trace: [<ffffffff8867afc0>] :gfs:gfs_lm_withdraw+0xc4/0xd3 [<ffffffff80013617>] find_lock_page+0x26/0xa1 [<ffffffff800d6d4c>] __cache_alloc_node+0x9d/0xd2 [<ffffffff88692836>] :gfs:gmalloc+0xf/0x3a [<ffffffff88668e76>] :gfs:gfs_dpin+0x1b2/0x20c [<ffffffff88692ad0>] :gfs:gfs_consist_rgrpd_i+0x3c/0x41 [<ffffffff8868e210>] :gfs:blkfree_internal+0x125/0x145 [<ffffffff8868e24a>] :gfs:gfs_difree_uninit+0x1a/0xc9 [<ffffffff8868e30f>] :gfs:gfs_difree+0x16/0x45 [<ffffffff88675956>] :gfs:gfs_inode_dealloc+0x4f8/0x53c [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4 [<ffffffff88692023>] :gfs:gfs_unlinked_dealloc+0x31/0xb3 [<ffffffff886671e8>] :gfs:gfs_inoded+0x0/0x44 [<ffffffff886671f6>] :gfs:gfs_inoded+0xe/0x44 [<ffffffff80032360>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032262>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Version-Release number of selected component (if applicable): kernel-2.6.18-128.el5 kmod-gfs-0.1.31-3.el5
While investigating the corruption with gfs_fsck Bob noted: <bob> Well, it's hard to say how it got that way. I'm trying to find the files that correspond to the dinodes in order to possibly track down what was happening to them when they got the same blocks <bob> For example, the first duplicate block found is 0x86478d0. That block is claimed as data by a dinode at 0x19a77a0f, but it's also claimed as data by a dinode at 0x1eb112e3 This corresponds to the following two files: 430406159 2856 -rw-r--r-- 1 daemon daemon 2911568 Apr 2 00:49 ./NORTH_F6/00000190/d853d1ac0be720f2_string_ 514921187 216 -rw-r--r-- 1 daemon daemon 213888 Apr 2 00:49 ./NORTH_F10/00000190/a6cc4def48d472ab_date The file in NORTH_F10 is one that north-10 could have been modifying just before it panicked as per comment #2. A dump of the inode structures with gfs2_edit shows that both nodes were allocating to the same resource group within 20 seconds of each other. Block #430406159 (0x19a77a0f) of 609826816 (0x24593800) (disk inode) ... di_size 2911568 0x2c6d50 di_blocks 714 0x2ca di_atime 1238647797 0x49d443f5 di_mtime 1238647797 0x49d443f5 di_ctime 1238647797 0x49d443f5 di_goal_meta 430405676 0x19a7782c di_goal_data 140760734 0x863d69e Block #514921187 (0x1eb112e3) of 609826816 (0x24593800) (disk inode) ... di_size 213888 0x34380 di_blocks 54 0x36 di_atime 1238647778 0x49d443e2 di_mtime 1238647778 0x49d443e2 di_ctime 1238647778 0x49d443e2 di_goal_meta 514873846 0x1eb059f6 di_goal_data 140760734 0x863d69e The block maps show that the entirety of inode 0x1eb112e3's allocations are within inode 0x19a77a0f's allocations.
Created attachment 338905 [details] Preliminary Instrumentation (doesn't work) This patch adds instrumentation to GFS to sanity-check the rgrp information using lvbs passed back in the rgrp glocks. It doesn't behave as I expect it should. The goal here is to try to detect when gfs is getting the wrong data for an rgrp that was written on a different node. My theory is that the rgrp glock is released after the rgrp has been committed to the journal, but the "real" block hasn't yet landed on disk. Thus, gfs would get the previous "generation" of the rgrp and perhaps assign a now-used block for another purpose.
Created attachment 338931 [details] Better Instrumentation This version corrects a few problems, including an alignment problem that was causing all kinds of havoc.
I'd like for Nate or someone to try to recreate similar file system corruption using the instrumentation given with the previous patch, and monitor and report any kernel messages it kicks out. BTW, the patch applies to the gfs-kernel source on top of the patch for bug #491369. There's a remote possibility that the problem is fixed by the patch for bug #491369, so if we can't recreate the corruption with the instrumented version, we should back out the instrumentation and run with just that patch. I can provide a binary version of gfs.ko if desired. I'm setting the NEEDINFO bit until I can get that info.
I have another theory as to how we can get gfs file system corruption. This is just a theory; perhaps someone can punch holes in it. Suppose we have two processes waiting for an rgrp: one for block allocation and another just for statfs, i.e. a df process. Suppose they request the rgrp at roughly the same time, but df is actually the first holder, and yet the rgrp is locked on a different node. The df waits for the glock by calling glock_wait_internal() (and so does the other process). Eventually, the glock is unlocked on the other node, so df gets control. Since he's the first holder, he has the HIF_FIRST bit set on in the holder structure, but since it's a statfs operation, the caller has the GL_SKIP flag set on, which prevents rgrp_go_lock from re-reading the rgrp in order to save time in the df. The df completes happily and the second process gets control of the glock, but now he's not the first holder in, so HIF_FIRST is not set, so glops->go_lock is never called, so the invalid rgrp is STILL not read in. Now the allocation takes place on the invalid rgrp, thus re-allocating bits that were already allocated on the second node. If my theory is correct, we should be able to recreate all kinds of file system corruption just by running df (and waiting a half-second) in an infinite loop while gfs writes are going on in multiple nodes. A simple multi-node d_io should work, provided we have rgrp collisions. It's more likely to occur on a small file system, since there will be fewer RGs to allocate from, and fewer RGs to lock/unlock. Also, this would only occur without statfs_fast.
I hit this over the weekend on our PPC cluster running 2.6.18-154.el5. GFS: fsid=lpars:brawl0.0: fatal: filesystem consistency error GFS: fsid=lpars:brawl0.0: RG = 405798912 GFS: fsid=lpars:brawl0.0: function = gfs_setbit GFS: fsid=lpars:brawl0.0: file = /builddir/build/BUILD/gfs-kmod-0.1.33/_kmod_b uild_/src/gfs/bits.c, line = 81 GFS: fsid=lpars:brawl0.0: time = 1245471616 GFS: fsid=lpars:brawl0.0: about to withdraw from the cluster kernel BUG in gfs_lm_withdraw at /builddir/build/BUILD/gfs-kmod-0.1.33/_kmod_bui ld_/src/gfs/lm.c:110! cpu 0x0: Vector: 700 (Program Check) at [c00000001c7ef1c0] pc: d000000000e2d4a8: .gfs_lm_withdraw+0x9c/0xfc [gfs] lr: d000000000e2d4a0: .gfs_lm_withdraw+0x94/0xfc [gfs] sp: c00000001c7ef440 msr: 8000000000029032 current = 0xc00000007fcb19e0 paca = 0xc00000000053c500 pid = 16346, comm = gfs_inoded kernel BUG in gfs_lm_withdraw at /builddir/build/BUILD/gfs-kmod-0.1.33/_kmod_bui ld_/src/gfs/lm.c:110! enter ? for help Unfortunately our pSeries machines drop into a debugger instead of spitting out a backtrace. The system was fenced before I got to the system.
The information in comment #8 tells me nothing. What I'd really like is for someone to try to figure out a consistent way to recreate the failure, and I'd start with the procedure I recommended in comment #7.
(In reply to comment #7) > If my theory is correct, we should be able to recreate all kinds of > file system corruption just by running df (and waiting a half-second) > in an infinite loop while gfs writes are going on in multiple nodes. > A simple multi-node d_io should work, provided we have rgrp collisions. > It's more likely to occur on a small file system, since there will > be fewer RGs to allocate from, and fewer RGs to lock/unlock. > Also, this would only occur without statfs_fast. The problem with the theory is that I hit this with near default options, meaning statfs_fast was on. Comment #8 happened while running coherency (multi-node I/O to the same file) on a 1k block size file system on PPC (64k pages). df does not get run during this, nor any calls to statfs.
I hit this again last night on the PPC cluster, but this time with 4k fs block size. I was running kernel-2.6.18-160.el5.
Added issue 334119.
We don't have a fix yet; retargeting to 5.6.
Do we have a reliable reproducer?
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
It is over 12 months since the issue was reproduced. We have no information on how it happened and the bug has been in needinfo for well over 6 months. On that basis I think we can reasonably close this. Please reopen if there is more info available.