471141 – GFS: filesystem consistency error, function = gfs_setbit

Bug 471141 - GFS: filesystem consistency error, function = gfs_setbit

Summary: GFS: filesystem consistency error, function = gfs_setbit

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	gfs-kmod
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	5.5
Assignee:	Robert Peterson
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	590060
TreeView+	depends on / blocked

Reported:	2008-11-11 23:06 UTC by Nate Straz
Modified:	2018-10-27 12:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-09-08 10:04:44 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Preliminary Instrumentation (doesn't work) (11.58 KB, patch) 2009-04-09 14:06 UTC, Robert Peterson	no flags	Details \| Diff
Better Instrumentation (10.47 KB, patch) 2009-04-09 15:11 UTC, Robert Peterson	no flags	Details \| Diff
Show Obsolete (1) View All

Description Nate Straz 2008-11-11 23:06:44 UTC

Description of problem:

While running coherency on GFS with a 1k block size (on x86 nodes) I found that one node panicked with the following message:

GFS: fsid=marathon:brawl0.0: fatal: filesystem consistency error
GFS: fsid=marathon:brawl0.0:   RG = 431174610
GFS: fsid=marathon:brawl0.0:   function = gfs_setbit
GFS: fsid=marathon:brawl0.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.28/_kmod_build_/src/gfs/bits.c, line = 81
GFS: fsid=marathon:brawl0.0:   time = 1226438282
GFS: fsid=marathon:brawl0.0: about to withdraw from the cluster
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...ld/BUILD/gfs-kmod-0.1.28/_kmod_build_/src/gfs/lm.c:110
invalid opcode: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:09.0/irq
CPU 2 
Modules linked in: sctp lock_nolock gfs(U) lock_dlm gfs2 dlm gnbd(U) configfs autofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) sunrpc ipv6 xfrm_nalgo crypto_api dm_multipat
h scsi_dh video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg i2c_nforce2 k8_edac forcedeth ide_cd i2c_core shpchp edac_mc cdrom serio_
raw k8temp hwmon pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc sata_nv libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 9626, comm: gfs_inoded Tainted: G      2.6.18-122.el5 #1
RIP: 0010:[<ffffffff885fef4b>]  [<ffffffff885fef4b>] :gfs:gfs_lm_withdraw+0x87/0xd3
RSP: 0018:ffff81005eef9920  EFLAGS: 00010202
RAX: 0000000000000043 RBX: ffffc20000491000 RCX: ffffffff802fdaa8
RDX: ffffffff802fdaa8 RSI: 0000000000000000 RDI: ffffffff802fdaa0
RBP: ffffc200004c99ac R08: ffffffff802fdaa8 R09: 0000000000000046
R10: ffff81005eef95c0 R11: 0000000000000280 R12: 000000000000014c
R13: ffff81013db50048 R14: 0000000000005d6d R15: 0000000000000002
FS:  00002b7c3d7c7bd0(0000) GS:ffff810102b277c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002af34c528000 CR3: 00000000766d3000 CR4: 00000000000006e0
Process gfs_inoded (pid: 9626, threadinfo ffff81005eef8000, task ffff81013e150820)
Stack:  0000003000000030 ffff81005eef9a40 ffff81005eef9940 ffff81012eb1fa60
 ffff81012eb1fa60 ffffc20000491000 ffffc200004c99ac ffffc200004c99ac
 0000000019b333d2 ffffc200004c99ac ffff810068a98800 ffffffff800d6d56
Call Trace:
 [<ffffffff800d6d56>] __cache_alloc_node+0x9d/0xd2
 [<ffffffff800d6d56>] __cache_alloc_node+0x9d/0xd2
 [<ffffffff885ecde0>] :gfs:gfs_dpin+0x11c/0x20c
 [<ffffffff886169ac>] :gfs:gfs_consist_rgrpd_i+0x3c/0x41
 [<ffffffff886120ec>] :gfs:blkfree_internal+0x125/0x145
 [<ffffffff88612324>] :gfs:gfs_blkfree+0x20/0x94
 [<ffffffff885eadf4>] :gfs:do_strip+0x216/0x2fd
 [<ffffffff885e9aae>] :gfs:recursive_scan+0xe7/0x16c
 [<ffffffff885e9af8>] :gfs:recursive_scan+0x131/0x16c
 [<ffffffff885eabde>] :gfs:do_strip+0x0/0x2fd
 [<ffffffff885e9af8>] :gfs:recursive_scan+0x131/0x16c
 [<ffffffff885eabde>] :gfs:do_strip+0x0/0x2fd
 [<ffffffff885ea2ff>] :gfs:gfs_shrink+0xd7/0x41d
 [<ffffffff885eabde>] :gfs:do_strip+0x0/0x2fd
 [<ffffffff80089a8c>] complete+0x38/0x4b
 [<ffffffff885f504d>] :gfs:lock_on_glock+0x66/0x6d
 [<ffffffff8009d986>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8009d986>] keventd_create_kthread+0x0/0xc4
 [<ffffffff885f9641>] :gfs:gfs_inode_dealloc+0x21e/0x53c
 [<ffffffff8009d986>] keventd_create_kthread+0x0/0xc4
 [<ffffffff88615eff>] :gfs:gfs_unlinked_dealloc+0x31/0xb3
 [<ffffffff885eb1e8>] :gfs:gfs_inoded+0x0/0x44
 [<ffffffff885eb1f6>] :gfs:gfs_inoded+0xe/0x44
 [<ffffffff800324a9>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d986>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800323ab>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 0f 0b 68 cf a2 61 88 c2 6e 00 48 89 ee 48 c7 c7 0f a3 61 88 


Version-Release number of selected component (if applicable):
kernel-2.6.18-122.el5
kmod-gfs-0.1.28-1.el5

How reproducible:
Not easily.  I've seen this twice in the past two months.  The other time I filed it in bug 452274 comment #26.

Steps to Reproduce:
No reliable steps yet.

Comment 1 Steve Whitehouse 2008-11-12 11:54:41 UTC

Since the block was being freed (according to the backtrace) it looks like for some reason it was found to already be marked free in the rgrp. This sounds like it might well be related to the other problem, where blocks were being corrupted. Maybe the rgrp's bitmaps are somehow incorrect?

Comment 2 Nate Straz 2009-04-02 14:19:46 UTC

Looks like I may have hit this while doing testing with MarkLogic.  I'm capturing metadata now with gfs2_edit to hand to Bob for analysis.

GFS: fsid=north:northlv.11: fatal: filesystem consistency error
GFS: fsid=north:northlv.11:   RG = 140760734
GFS: fsid=north:northlv.11:   function = gfs_setbit
GFS: fsid=north:northlv.11:   file = /builddir/build/BUILD/gfs-kmod-0.1.31/_kmod
_build_/src/gfs/bits.c, line = 81
GFS: fsid=north:northlv.11:   time = 1238647878
GFS: fsid=north:northlv.11: about to withdraw from the cluster
GFS: fsid=north:northlv.11: telling LM to withdraw
GFS: fsid=north:northlv.11: withdrawn

Call Trace:
 [<ffffffff8867afc0>] :gfs:gfs_lm_withdraw+0xc4/0xd3
 [<ffffffff80013617>] find_lock_page+0x26/0xa1
 [<ffffffff800d6d4c>] __cache_alloc_node+0x9d/0xd2
 [<ffffffff88692836>] :gfs:gmalloc+0xf/0x3a
 [<ffffffff88668e76>] :gfs:gfs_dpin+0x1b2/0x20c
 [<ffffffff88692ad0>] :gfs:gfs_consist_rgrpd_i+0x3c/0x41
 [<ffffffff8868e210>] :gfs:blkfree_internal+0x125/0x145
 [<ffffffff8868e24a>] :gfs:gfs_difree_uninit+0x1a/0xc9
 [<ffffffff8868e30f>] :gfs:gfs_difree+0x16/0x45
 [<ffffffff88675956>] :gfs:gfs_inode_dealloc+0x4f8/0x53c
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff88692023>] :gfs:gfs_unlinked_dealloc+0x31/0xb3
 [<ffffffff886671e8>] :gfs:gfs_inoded+0x0/0x44
 [<ffffffff886671f6>] :gfs:gfs_inoded+0xe/0x44
 [<ffffffff80032360>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032262>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Version-Release number of selected component (if applicable):
kernel-2.6.18-128.el5
kmod-gfs-0.1.31-3.el5

Comment 3 Nate Straz 2009-04-02 20:23:11 UTC

While investigating the corruption with gfs_fsck Bob noted:

<bob> Well, it's hard to say how it got that way.  I'm trying to find the files that correspond to the dinodes in order to possibly track down what was happening to them when they got the same blocks
<bob> For example, the first duplicate block found is 0x86478d0.  That block is claimed as data by a dinode at 0x19a77a0f, but it's also claimed as data by a dinode at 0x1eb112e3

This corresponds to the following two files:

430406159 2856 -rw-r--r--   1 daemon   daemon    2911568 Apr  2 00:49 ./NORTH_F6/00000190/d853d1ac0be720f2_string_
514921187  216 -rw-r--r--   1 daemon   daemon     213888 Apr  2 00:49 ./NORTH_F10/00000190/a6cc4def48d472ab_date

The file in NORTH_F10 is one that north-10 could have been modifying just before it panicked as per comment #2.  A dump of the inode structures with gfs2_edit shows that both nodes were allocating to the same resource group within 20 seconds of each other.

Block #430406159    (0x19a77a0f) of 609826816 (0x24593800) (disk inode)
...
  di_size               2911568             0x2c6d50
  di_blocks             714                 0x2ca
  di_atime              1238647797          0x49d443f5
  di_mtime              1238647797          0x49d443f5
  di_ctime              1238647797          0x49d443f5
  di_goal_meta          430405676           0x19a7782c
  di_goal_data          140760734           0x863d69e

Block #514921187    (0x1eb112e3) of 609826816 (0x24593800) (disk inode)
...
  di_size               213888              0x34380
  di_blocks             54                  0x36
  di_atime              1238647778          0x49d443e2
  di_mtime              1238647778          0x49d443e2
  di_ctime              1238647778          0x49d443e2
  di_goal_meta          514873846           0x1eb059f6
  di_goal_data          140760734           0x863d69e

The block maps show that the entirety of inode 0x1eb112e3's allocations are within inode 0x19a77a0f's allocations.

Comment 4 Robert Peterson 2009-04-09 14:06:57 UTC

Created attachment 338905 [details]
Preliminary Instrumentation (doesn't work)

This patch adds instrumentation to GFS to sanity-check the rgrp
information using lvbs passed back in the rgrp glocks.  It doesn't
behave as I expect it should.  The goal here is to try to detect
when gfs is getting the wrong data for an rgrp that was written on
a different node.

My theory is that the rgrp glock is released after the rgrp has been
committed to the journal, but the "real" block hasn't yet landed on
disk.  Thus, gfs would get the previous "generation" of the rgrp and
perhaps assign a now-used block for another purpose.

Comment 5 Robert Peterson 2009-04-09 15:11:58 UTC

Created attachment 338931 [details]
Better Instrumentation

This version corrects a few problems, including an alignment problem
that was causing all kinds of havoc.

Comment 6 Robert Peterson 2009-04-13 13:56:27 UTC

I'd like for Nate or someone to try to recreate similar file system
corruption using the instrumentation given with the previous patch,
and monitor and report any kernel messages it kicks out.  BTW, the
patch applies to the gfs-kernel source on top of the patch for
bug #491369.  There's a remote possibility that the problem is fixed
by the patch for bug #491369, so if we can't recreate the corruption
with the instrumented version, we should back out the instrumentation
and run with just that patch.  I can provide a binary version of
gfs.ko if desired.  I'm setting the NEEDINFO bit until I can get that
info.

Comment 7 Robert Peterson 2009-04-27 20:33:27 UTC

I have another theory as to how we can get gfs file system corruption.
This is just a theory; perhaps someone can punch holes in it.

Suppose we have two processes waiting for an rgrp: one for block
allocation and another just for statfs, i.e. a df process.  Suppose
they request the rgrp at roughly the same time, but df is actually
the first holder, and yet the rgrp is locked on a different node.
The df waits for the glock by calling glock_wait_internal() (and
so does the other process).

Eventually, the glock is unlocked on the other node, so df gets control.
Since he's the first holder, he has the HIF_FIRST bit set on in the
holder structure, but since it's a statfs operation, the caller has
the GL_SKIP flag set on, which prevents rgrp_go_lock from re-reading
the rgrp in order to save time in the df.  The df completes happily
and the second process gets control of the glock, but now he's not
the first holder in, so HIF_FIRST is not set, so glops->go_lock is
never called, so the invalid rgrp is STILL not read in.  Now the
allocation takes place on the invalid rgrp, thus re-allocating bits
that were already allocated on the second node.

If my theory is correct, we should be able to recreate all kinds of
file system corruption just by running df (and waiting a half-second)
in an infinite loop while gfs writes are going on in multiple nodes.
A simple multi-node d_io should work, provided we have rgrp collisions.
It's more likely to occur on a small file system, since there will
be fewer RGs to allocate from, and fewer RGs to lock/unlock.
Also, this would only occur without statfs_fast.

Comment 8 Nate Straz 2009-06-22 12:42:55 UTC

I hit this over the weekend on our PPC cluster running 2.6.18-154.el5.

GFS: fsid=lpars:brawl0.0: fatal: filesystem consistency error
GFS: fsid=lpars:brawl0.0:   RG = 405798912
GFS: fsid=lpars:brawl0.0:   function = gfs_setbit
GFS: fsid=lpars:brawl0.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.33/_kmod_b
uild_/src/gfs/bits.c, line = 81
GFS: fsid=lpars:brawl0.0:   time = 1245471616
GFS: fsid=lpars:brawl0.0: about to withdraw from the cluster
kernel BUG in gfs_lm_withdraw at /builddir/build/BUILD/gfs-kmod-0.1.33/_kmod_bui
ld_/src/gfs/lm.c:110!
cpu 0x0: Vector: 700 (Program Check) at [c00000001c7ef1c0]
    pc: d000000000e2d4a8: .gfs_lm_withdraw+0x9c/0xfc [gfs]
    lr: d000000000e2d4a0: .gfs_lm_withdraw+0x94/0xfc [gfs]
    sp: c00000001c7ef440
   msr: 8000000000029032
  current = 0xc00000007fcb19e0
  paca    = 0xc00000000053c500
    pid   = 16346, comm = gfs_inoded
kernel BUG in gfs_lm_withdraw at /builddir/build/BUILD/gfs-kmod-0.1.33/_kmod_bui
ld_/src/gfs/lm.c:110!
enter ? for help

Unfortunately our pSeries machines drop into a debugger instead of spitting out a backtrace.  The system was fenced before I got to the system.

Comment 9 Robert Peterson 2009-06-23 13:36:22 UTC

The information in comment #8 tells me nothing.  What I'd really like
is for someone to try to figure out a consistent way to recreate the
failure, and I'd start with the procedure I recommended in comment #7.

Comment 10 Nate Straz 2009-06-23 21:17:11 UTC

(In reply to comment #7)
> If my theory is correct, we should be able to recreate all kinds of
> file system corruption just by running df (and waiting a half-second)
> in an infinite loop while gfs writes are going on in multiple nodes.
> A simple multi-node d_io should work, provided we have rgrp collisions.
> It's more likely to occur on a small file system, since there will
> be fewer RGs to allocate from, and fewer RGs to lock/unlock.
> Also, this would only occur without statfs_fast.  

The problem with the theory is that I hit this with near default options, meaning statfs_fast was on.  Comment #8 happened while running coherency (multi-node I/O to the same file) on a 1k block size file system on PPC (64k pages).  df does not get run during this, nor any calls to statfs.

Comment 11 Nate Straz 2009-08-05 12:07:29 UTC

I hit this again last night on the PPC cluster, but this time with 4k fs block size.  I was running kernel-2.6.18-160.el5.

Comment 12 Lachlan McIlroy 2009-08-26 02:29:36 UTC

Added issue 334119.

Comment 15 Robert Peterson 2009-12-22 15:04:05 UTC

We don't have a fix yet; retargeting to 5.6.

Comment 18 Steve Whitehouse 2010-07-13 08:59:19 UTC

Do we have a reliable reproducer?

Comment 20 RHEL Program Management 2010-08-09 19:26:47 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 27 Steve Whitehouse 2010-09-08 10:04:44 UTC

It is over 12 months since the issue was reproduced. We have no information on how it happened and the bug has been in needinfo for well over 6 months. On that basis I think we can reasonably close this.

Please reopen if there is more info available.

Note You need to log in before you can comment on or make changes to this bug.