Bug 242720 - GFS panic due to inode cache corruption
GFS panic due to inode cache corruption
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
All Linux
high Severity high
: ---
: ---
Assigned To: Wendy Cheng
GFS Bugs
Depends On:
Blocks: 243146
  Show dependency treegraph
Reported: 2007-06-05 10:53 EDT by Wendy Cheng
Modified: 2010-01-11 22:16 EST (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2007-0998
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-11-21 16:14:27 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Wendy Cheng 2007-06-05 10:53:56 EDT
Description of problem:

This is a clone of bugzilla 236565 with re-written problem statement
for easy search purpose. There is a race between GFS lookup code and
VM inode cache reclaim logic that would create a window to allow GFS
to corrupt (GFS) inode cache. The occurrence is rare and only happens 
when system is under memory pressure such that VM starts to free its
inode cache entries. Dependin on who gets the freed memory, the result
can't be specified. In the case where this bug is found (in RHEL5 NFS 
benchmark runs), the kernel is panicked with the following stack back-

[<ffffffff800100b7>] generic_file_buffered_write+0x496/0x6a3
[<ffffffff800641fa>] _spin_unlock_irq+0x9/0xc
[<ffffffff8000e2dd>] current_fs_time+0x3b/0x40
[<ffffffff80062350>] wait_for_completion+0x99/0xa2
[<ffffffff80016476>] __generic_file_aio_write_nolock+0x370/0x3bb
[<ffffffff80012a2f>] poison_obj+0x26/0x2f
[<ffffffff800bba91>] generic_file_aio_write_nolock+0x20/0x6c
[<ffffffff800bbeaa>] generic_file_write_nolock+0x8f/0xa8
[<ffffffff8009d3ee>] autoremove_wake_function+0x0/0x2e
[<ffffffff88641c8a>] :gfs:gfs_trans_begin_i+0x13c/0x1b2
[<ffffffff88634c50>] :gfs:do_write_buf+0x456/0x696
[<ffffffff88634452>] :gfs:walk_vm+0x10e/0x311
[<ffffffff886347fa>] :gfs:do_write_buf+0x0/0x696
[<ffffffff88634701>] :gfs:__gfs_write+0xac/0xc6
[<ffffffff800d3903>] do_readv_writev+0x198/0x295
[<ffffffff88634744>] :gfs:gfs_write+0x0/0x8
[<ffffffff88635ce8>] :gfs:gfs_open+0x12c/0x15e
[<ffffffff884e7709>] :nfsd:nfsd_vfs_write+0xf2/0x2e1
[<ffffffff88635bbc>] :gfs:gfs_open+0x0/0x15e
[<ffffffff8001e7c0>] __dentry_open+0x104/0x1e2
[<ffffffff884e7f89>] :nfsd:nfsd_write+0xb5/0xd5
[<ffffffff884ee778>] :nfsd:nfsd3_proc_write+0xea/0x109
[<ffffffff884e40e9>] :nfsd:nfsd_dispatch+0xd7/0x198
[<ffffffff884154f3>] :sunrpc:svc_process+0x42e/0x6ec
[<ffffffff80063cc1>] __down_read+0x34/0x96
[<ffffffff884e4471>] :nfsd:nfsd+0x0/0x32b
[<ffffffff884e4626>] :nfsd:nfsd+0x1b5/0x32b
[<ffffffff8005d665>] child_rip+0xa/0x11
[<ffffffff884e4471>] :nfsd:nfsd+0x0/0x32b
[<ffffffff884e4471>] :nfsd:nfsd+0x0/0x32b
[<ffffffff8005d65b>] child_rip+0x0/0x11

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:
NFSD (that has frequent calls into lookup code) and GFS glock trimming
logic (that invokes inode cache release logic on a regular time interval)
could see this bug more.
Comment 1 Wendy Cheng 2007-06-05 11:01:07 EDT
Should have said this happens with all versions of GFS1 code (haven't checked
GFS2 yet). 

The bug lurks in the end of the lookup code (gfs_lookup and gfs_get_dentry)
where inode glock is released pre-maturely. This creates a window inside the 
bottom portion of logic that could make gfs_iget to update the associated GFS 
inode structure that has been freed. Depending on who gets the new memory, 
unspecified corruptions occur. In RHEL5's case, it corrupts TCP buffer head 
that ends up over-running NFSD kernel stack. An almost identical report was
found at:

Comment 5 Benjamin Kahn 2007-06-07 11:14:55 EDT
This bug has been copied as z-stream (EUS) bug #243146 and now must be resolved
in the current update release, set blocker flag.
Comment 8 errata-xmlrpc 2007-11-21 16:14:27 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.