GFS2 is occassionally having panics like the following: 13:34 <bmarson> BUG: unable to handle kernel paging request at virtual address 00200200 13:34 <bmarson> printing eip: 13:34 <bmarson> f8d63663 13:34 <bmarson> *pde = 37369001 13:34 <bmarson> Oops: 0002 [#1] 13:34 <bmarson> SMP 13:34 <bmarson> last sysfs file: /devices/pci0000:00/0000:00:00.0/irq 13:34 <bmarson> Modules linked in: lock_nolock gfs2 gfs(U) dlm configfs ext4 jbd2 crc16 autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api acpi_cpufreq dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi ac parport_pc lp parport sg i5000_edac edac_mc pcspkr serio_raw bnx2 dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod 13:35 <bmarson> mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd 13:35 <bmarson> CPU: 0 13:35 <bmarson> EIP: 0060:[<f8d63663>] Tainted: G VLI 13:35 <bmarson> EFLAGS: 00010206 (2.6.18-152.el5PAE #1) 13:35 <bmarson> EIP is at gfs2_glock_put+0x2e/0x137 [gfs2] 13:35 <bmarson> eax: 00100100 ebx: eb306344 ecx: c813e940 edx: 00200200 13:35 <bmarson> esi: eb306344 edi: 00000001 ebp: 00000080 esp: f7f96f00 13:35 <bmarson> ds: 007b es: 007b ss: 0068 13:35 <bmarson> Process kswapd0 (pid: 223, ti=f7f96000 task=f7fa3000 task.ti=f7f96000) 13:35 <bmarson> Stack: eb3063a0 eb306344 f8d64936 000055ed eb3063a0 e58d0ee0 00018574 e29cd1e0 13:35 <bmarson> 000000dd 000000d0 c045d096 01241700 00000000 01241700 00028430 00000180 13:35 <bmarson> 00000000 00000002 00000002 c068b580 c0689080 c045d421 00000000 0000000b 13:35 <bmarson> Call Trace: 13:35 <bmarson> [<f8d64936>] gfs2_shrink_glock_memory+0x128/0x1a6 [gfs2] 13:35 <bmarson> [<c045d096>] shrink_slab+0xd3/0x13c 13:35 <bmarson> [<c045d421>] kswapd+0x2a6/0x3ab 13:35 <bmarson> [<c0434d17>] autoremove_wake_function+0x0/0x2d 13:35 <bmarson> [<c045d17b>] kswapd+0x0/0x3ab 13:35 <bmarson> [<c0434c55>] kthread+0xc0/0xeb 13:35 <bmarson> [<c0434b95>] kthread+0x0/0xeb 13:35 <bmarson> [<c0405c53>] kernel_thread_helper+0x7/0x10 13:35 <bmarson> ======================= 13:35 <bmarson> Code: c3 8b 40 2c 25 ff 0f 00 00 8d 04 85 40 b6 da f8 e8 53 36 8b c7 f0 ff 4b 18 0f 94 c0 84 c0 0f 84 b4 00 00 00 8b 03 8b 53 04 85 c0 <89> 02 74 03 89 50 04 8b 43 2c c7 03 00 01 10 00 c7 43 04 00 02 13:35 <bmarson> EIP: [<f8d63663>] gfs2_glock_put+0x2e/0x137 [gfs2] SS:ESP 0068:f7f96f00 13:36 <bmarson> <0>Kernel panic - not syncing: Fatal exception gfs2_glock_put+0x2e is at the hlist_del(&gl->gl_list); call in gfs2_glock_put(). Specifically it is dereferencing the prev pointer. Looking at the registers, that pointers value is 0x00200200 (LIST_POISON2). What appears to be happening is this: -Glock X gets put on the lru list, and then unlocked. -Process A in gfs2_shrink_glock_memory starts checking for locks to demote, locking the lru_lock. -Process B in gfs2_glock_put() drops the final reference to Glock X, removes in from the hb_list, and blocks waiting for the lru_lock. -Process A checks Glock X, drops the lru_lock, grabs a reference to the glock which is in the process of being freed, realizes that it doen't need to demote it, and starts to free the the glock in gfs2_glock_put() itself. -Bad things happen.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 349891 [details] Fix for the panic This fix makes gfs2_shrink_glock_memory check that the refrence count on the glock isn't zero before unlocking the lru lock. If it is zero, the glock is simply skipped.
ok, looks like a good fix. can you post that to cluster-devel for upstream?
Just for reference, the workload that has generated this panic 3 times, each on i386, x86_64. and ia64 is the RHTS version of the postmark test. Barry
Posted to cluster-devel and rhkernel-list
in kernel-2.6.18-157.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
We actually didn't see this issue on the -156 kernel. Preliminary tests on -157 showed no problems either. Doing a full perf. regression on the -158 starting this eve. Will update status based when I have the most recent results. Barry
Patch is in -158.el5. Adding SanityOnly.
Tested -158kernel on multiple systems and saw no panics. Barry
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html