Description of problem: Eric Sandeen had a 2TB GFS2 volume (/dev/sdc) on system east-10.lab.bos.redhat.com. Apparently the system was rebooted in the middle of a gfs2_fsck. When the system came back, he tried to mount the volume, and it panicked the kernel. Eric says he did not use any special mount parameters. Version-Release number of selected component (if applicable): RHEL5 running the 2.6.26-rc2 kernel (Linus's kernel) which is pretty recent wrt the mounting code (ops_fstype.c, mount.c and such). How reproducible: Unknown. I tried editing the superblock on one of my gfs2 volumes so it looked the same, but I got an error message rather than a kernel panic. Steps to Reproduce: 1. 2. 3. Actual results: Kernel panic Expected results: Error message Additional info: GFS2 (built May 30 2008 16:40:57) installed BUG: unable to handle kernel NULL pointer dereference at 000000000000082c IP: [<ffffffff804738e5>] _spin_lock_irq+0x6/0x16 PGD 11e829067 PUD 11dcdf067 PMD 0 Oops: 0002 [1] SMP CPU 2 Modules linked in: gfs2 autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 cpufreq_ondemand dm_multipath sbs sbshc battery acpi_memhotplugd Pid: 3904, comm: mount.gfs2 Not tainted 2.6.26-rc2 #2 RIP: 0010:[<ffffffff804738e5>] [<ffffffff804738e5>] _spin_lock_irq+0x6/0x16 RSP: 0018:ffff81011d5efba0 EFLAGS: 00010092 RAX: 0000000000000100 RBX: 0000000000000828 RCX: 0000000000000001 RDX: ffff81021e9d8288 RSI: 0000000000000000 RDI: 000000000000082c RBP: 0000000000000000 R08: 8000000000000000 R09: ffff81011f5cd220 R10: ffff81021e88a400 R11: ffff81011f5cd220 R12: 0000000000000828 R13: 0000000000000000 R14: ffff81021e8b3c00 R15: 0000000000000002 FS: 00007fbaf75756e0(0000) GS:ffff81011fa876c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000000082c CR3: 000000021bc9a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mount.gfs2 (pid: 3904, threadinfo ffff81011d5ee000, task ffff81011d997040) Stack: ffffffff804736e9 ffff81011f47a460 0000000000000007 ffff81011f47a280 0000000000000282 0000000000000000 0000000000000000 0000000000000000 ffffffffa039c639 ffff81011f5d3a40 ffffffff8029b896 ffff81011f5cd220 Call Trace: [<ffffffff804736e9>] __down_write_nested+0x12/0x8b [<ffffffffa039c639>] :gfs2:__gfs2_log_flush+0x1f/0x43a [<ffffffff8029b896>] d_kill+0x2e/0x43 [<ffffffffa03a5baa>] :gfs2:gfs2_sync_fs+0x1a/0x1e [<ffffffff802c46c1>] vfs_quota_off+0x450/0x53e [<ffffffffa03a353f>] :gfs2:fill_super+0x0/0x731 [<ffffffff8028dc0c>] deactivate_super+0x50/0x78 [<ffffffff8028e2b3>] get_sb_bdev+0x10f/0x145 [<ffffffffa03a2631>] :gfs2:gfs2_get_sb+0x13/0x2f [<ffffffff8028dcc7>] vfs_kern_mount+0x93/0x11b [<ffffffff8028dda2>] do_kern_mount+0x43/0xdc [<ffffffff802a23d1>] do_new_mount+0x5b/0x94 [<ffffffff802a25c7>] do_mount+0x1bd/0x1e7 [<ffffffff8026930a>] __alloc_pages_internal+0xe2/0x3c2 [<ffffffff802a267b>] sys_mount+0x8a/0xcf [<ffffffff8020bee2>] tracesys+0xd5/0xda Code: dc ff fe 07 48 8b 3c 24 e9 2e 3a dc ff 9c 58 fa ba 00 01 00 00 f0 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00 00 RIP [<ffffffff804738e5>] _spin_lock_irq+0x6/0x16 RSP <ffff81011d5efba0> CR2: 000000000000082c
The metadata file is just over 20MB: too big to attach.
Created attachment 308727 [details] Proposed patch to fix the problem This started with a not-too-improbable mount failure because the locking protocol was never set back to its proper "lock_dlm" after the system was rebooted in the middle of a gfs2_fsck. That left a (purposely) invalid locking protocol in the superblock, which caused an error when the file system was mounted the next time. When there's an error mounting, vfs calls DQUOT_OFF, which calls vfs_quota_off which calls gfs2_sync_fs. Next, gfs2_sync_fs calls gfs2_log_flush passing s_fs_info. But due to the error, s_fs_info had been previously set to NULL, and so we have the kernel oops. My solution in this patch is to test for the NULL value before passing it. I tested this patch and it fixes the problem. I will post it to cluster-devel shortly. I believe the problem was caused due to changes in what the DQUOTA_OFF macro does in newer kernels. That's why I couldn't recreate the problem on a RHEL kernel. I don't believe this affects RHEL.
This patch is now posted to the -nmw upstream git tree for GFS2, so I'm closing this bug as UPSTREAM.