Bug 508806 - GFS2 panics while shrinking the glock cache.
GFS2 panics while shrinking the glock cache.
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
high Severity high
: rc
: ---
Assigned To: Ben Marzinski
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2009-06-29 20:30 EDT by Ben Marzinski
Modified: 2009-09-03 10:12 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-09-02 04:17:32 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Fix for the panic (516 bytes, patch)
2009-06-29 20:34 EDT, Ben Marzinski
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 04:53:34 EDT

  None (edit)
Description Ben Marzinski 2009-06-29 20:30:27 EDT
GFS2 is occassionally having panics like the following:

13:34 <bmarson> BUG: unable to handle kernel paging request at virtual address 
13:34 <bmarson>  printing eip:
13:34 <bmarson> f8d63663
13:34 <bmarson> *pde = 37369001
13:34 <bmarson> Oops: 0002 [#1]
13:34 <bmarson> SMP 
13:34 <bmarson> last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
13:34 <bmarson> Modules linked in: lock_nolock gfs2 gfs(U) dlm configfs ext4 
                jbd2 crc16 autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc 
                ipv6 xfrm_nalgo crypto_api acpi_cpufreq dm_multipath scsi_dh 
                video hwmon backlight sbs i2c_ec i2c_core button battery 
                asus_acpi ac parport_pc lp parport sg i5000_edac edac_mc pcspkr 
                serio_raw bnx2 dm_raid45 dm_message dm_region_hash dm_mem_cache 
                dm_snapshot dm_zero dm_mirror dm_log dm_mod
13:35 <bmarson>  mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod 
                ext3 jbd uhci_hcd ohci_hcd ehci_hcd
13:35 <bmarson> CPU:    0
13:35 <bmarson> EIP:    0060:[<f8d63663>]    Tainted: G      VLI
13:35 <bmarson> EFLAGS: 00010206   (2.6.18-152.el5PAE #1) 
13:35 <bmarson> EIP is at gfs2_glock_put+0x2e/0x137 [gfs2]
13:35 <bmarson> eax: 00100100   ebx: eb306344   ecx: c813e940   edx: 00200200
13:35 <bmarson> esi: eb306344   edi: 00000001   ebp: 00000080   esp: f7f96f00
13:35 <bmarson> ds: 007b   es: 007b   ss: 0068
13:35 <bmarson> Process kswapd0 (pid: 223, ti=f7f96000 task=f7fa3000 
13:35 <bmarson> Stack: eb3063a0 eb306344 f8d64936 000055ed eb3063a0 e58d0ee0 
                00018574 e29cd1e0 
13:35 <bmarson>        000000dd 000000d0 c045d096 01241700 00000000 01241700 
                00028430 00000180 
13:35 <bmarson>        00000000 00000002 00000002 c068b580 c0689080 c045d421 
                00000000 0000000b 
13:35 <bmarson> Call Trace:
13:35 <bmarson>  [<f8d64936>] gfs2_shrink_glock_memory+0x128/0x1a6 [gfs2]
13:35 <bmarson>  [<c045d096>] shrink_slab+0xd3/0x13c
13:35 <bmarson>  [<c045d421>] kswapd+0x2a6/0x3ab
13:35 <bmarson>  [<c0434d17>] autoremove_wake_function+0x0/0x2d
13:35 <bmarson>  [<c045d17b>] kswapd+0x0/0x3ab
13:35 <bmarson>  [<c0434c55>] kthread+0xc0/0xeb
13:35 <bmarson>  [<c0434b95>] kthread+0x0/0xeb
13:35 <bmarson>  [<c0405c53>] kernel_thread_helper+0x7/0x10
13:35 <bmarson>  =======================
13:35 <bmarson> Code: c3 8b 40 2c 25 ff 0f 00 00 8d 04 85 40 b6 da f8 e8 53 36 
                8b c7 f0 ff 4b 18 0f 94 c0 84 c0 0f 84 b4 00 00 00 8b 03 8b 53 
                04 85 c0 <89> 02 74 03 89 50 04 8b 43 2c c7 03 00 01 10 00 c7 
                43 04 00 02 
13:35 <bmarson> EIP: [<f8d63663>] gfs2_glock_put+0x2e/0x137 [gfs2] SS:ESP 
13:36 <bmarson>  <0>Kernel panic - not syncing: Fatal exception

gfs2_glock_put+0x2e is at the


call in gfs2_glock_put(). Specifically it is dereferencing the prev pointer. Looking at the registers, that pointers value is 0x00200200 (LIST_POISON2).

What appears to be happening is this:

-Glock X gets put on the lru list, and then unlocked.
-Process A in gfs2_shrink_glock_memory starts checking for locks to demote, locking the lru_lock.
-Process B in gfs2_glock_put() drops the final reference to Glock X, removes in from the hb_list, and blocks waiting for the lru_lock.
-Process A checks Glock X, drops the lru_lock, grabs a reference to the glock which is in the process of being freed, realizes that it doen't need to demote it, and starts to free the the glock in gfs2_glock_put() itself.
-Bad things happen.
Comment 1 RHEL Product and Program Management 2009-06-29 20:32:26 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
Comment 2 Ben Marzinski 2009-06-29 20:34:14 EDT
Created attachment 349891 [details]
Fix for the panic

This fix makes gfs2_shrink_glock_memory check that the refrence count on the glock isn't zero before unlocking the lru lock.  If it is zero, the glock is simply skipped.
Comment 3 Steve Whitehouse 2009-06-30 10:14:48 EDT
ok, looks like a good fix. can you post that to cluster-devel for upstream?
Comment 4 Barry Marson 2009-06-30 15:48:57 EDT
Just for reference, the workload that has generated this panic 3 times, each on i386, x86_64. and ia64 is the RHTS version of the postmark test.

Comment 5 Ben Marzinski 2009-06-30 16:37:42 EDT
Posted to cluster-devel and rhkernel-list
Comment 7 Don Zickus 2009-07-07 11:06:01 EDT
in kernel-2.6.18-157.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 9 Barry Marson 2009-07-14 17:29:26 EDT
We actually didn't see this issue on the -156 kernel.  Preliminary tests on -157 showed no problems either.  Doing a full perf. regression on the -158 starting this eve.  Will update status based when I have the most recent results.

Comment 10 Jan Tluka 2009-07-20 11:32:57 EDT
Patch is in -158.el5. Adding SanityOnly.
Comment 11 Barry Marson 2009-07-23 11:29:59 EDT
Tested -158kernel on multiple systems and saw no panics.

Comment 13 errata-xmlrpc 2009-09-02 04:17:32 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.