Red Hat Bugzilla – Bug 508806
GFS2 panics while shrinking the glock cache.
Last modified: 2009-09-03 10:12:54 EDT
GFS2 is occassionally having panics like the following:
13:34 <bmarson> BUG: unable to handle kernel paging request at virtual address
13:34 <bmarson> printing eip:
13:34 <bmarson> f8d63663
13:34 <bmarson> *pde = 37369001
13:34 <bmarson> Oops: 0002 [#1]
13:34 <bmarson> SMP
13:34 <bmarson> last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
13:34 <bmarson> Modules linked in: lock_nolock gfs2 gfs(U) dlm configfs ext4
jbd2 crc16 autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc
ipv6 xfrm_nalgo crypto_api acpi_cpufreq dm_multipath scsi_dh
video hwmon backlight sbs i2c_ec i2c_core button battery
asus_acpi ac parport_pc lp parport sg i5000_edac edac_mc pcspkr
serio_raw bnx2 dm_raid45 dm_message dm_region_hash dm_mem_cache
dm_snapshot dm_zero dm_mirror dm_log dm_mod
13:35 <bmarson> mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd
13:35 <bmarson> CPU: 0
13:35 <bmarson> EIP: 0060:[<f8d63663>] Tainted: G VLI
13:35 <bmarson> EFLAGS: 00010206 (2.6.18-152.el5PAE #1)
13:35 <bmarson> EIP is at gfs2_glock_put+0x2e/0x137 [gfs2]
13:35 <bmarson> eax: 00100100 ebx: eb306344 ecx: c813e940 edx: 00200200
13:35 <bmarson> esi: eb306344 edi: 00000001 ebp: 00000080 esp: f7f96f00
13:35 <bmarson> ds: 007b es: 007b ss: 0068
13:35 <bmarson> Process kswapd0 (pid: 223, ti=f7f96000 task=f7fa3000
13:35 <bmarson> Stack: eb3063a0 eb306344 f8d64936 000055ed eb3063a0 e58d0ee0
13:35 <bmarson> 000000dd 000000d0 c045d096 01241700 00000000 01241700
13:35 <bmarson> 00000000 00000002 00000002 c068b580 c0689080 c045d421
13:35 <bmarson> Call Trace:
13:35 <bmarson> [<f8d64936>] gfs2_shrink_glock_memory+0x128/0x1a6 [gfs2]
13:35 <bmarson> [<c045d096>] shrink_slab+0xd3/0x13c
13:35 <bmarson> [<c045d421>] kswapd+0x2a6/0x3ab
13:35 <bmarson> [<c0434d17>] autoremove_wake_function+0x0/0x2d
13:35 <bmarson> [<c045d17b>] kswapd+0x0/0x3ab
13:35 <bmarson> [<c0434c55>] kthread+0xc0/0xeb
13:35 <bmarson> [<c0434b95>] kthread+0x0/0xeb
13:35 <bmarson> [<c0405c53>] kernel_thread_helper+0x7/0x10
13:35 <bmarson> =======================
13:35 <bmarson> Code: c3 8b 40 2c 25 ff 0f 00 00 8d 04 85 40 b6 da f8 e8 53 36
8b c7 f0 ff 4b 18 0f 94 c0 84 c0 0f 84 b4 00 00 00 8b 03 8b 53
04 85 c0 <89> 02 74 03 89 50 04 8b 43 2c c7 03 00 01 10 00 c7
43 04 00 02
13:35 <bmarson> EIP: [<f8d63663>] gfs2_glock_put+0x2e/0x137 [gfs2] SS:ESP
13:36 <bmarson> <0>Kernel panic - not syncing: Fatal exception
gfs2_glock_put+0x2e is at the
call in gfs2_glock_put(). Specifically it is dereferencing the prev pointer. Looking at the registers, that pointers value is 0x00200200 (LIST_POISON2).
What appears to be happening is this:
-Glock X gets put on the lru list, and then unlocked.
-Process A in gfs2_shrink_glock_memory starts checking for locks to demote, locking the lru_lock.
-Process B in gfs2_glock_put() drops the final reference to Glock X, removes in from the hb_list, and blocks waiting for the lru_lock.
-Process A checks Glock X, drops the lru_lock, grabs a reference to the glock which is in the process of being freed, realizes that it doen't need to demote it, and starts to free the the glock in gfs2_glock_put() itself.
-Bad things happen.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Created attachment 349891 [details]
Fix for the panic
This fix makes gfs2_shrink_glock_memory check that the refrence count on the glock isn't zero before unlocking the lru lock. If it is zero, the glock is simply skipped.
ok, looks like a good fix. can you post that to cluster-devel for upstream?
Just for reference, the workload that has generated this panic 3 times, each on i386, x86_64. and ia64 is the RHTS version of the postmark test.
Posted to cluster-devel and rhkernel-list
You can download this test kernel from http://people.redhat.com/dzickus/el5
Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so. However feel free
to provide a comment indicating that this fix has been verified.
We actually didn't see this issue on the -156 kernel. Preliminary tests on -157 showed no problems either. Doing a full perf. regression on the -158 starting this eve. Will update status based when I have the most recent results.
Patch is in -158.el5. Adding SanityOnly.
Tested -158kernel on multiple systems and saw no panics.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.