Bug 242720

Summary: GFS panic due to inode cache corruption
Product: [Retired] Red Hat Cluster Suite Reporter: Wendy Cheng <nobody+wcheng>
Component: gfsAssignee: Wendy Cheng <nobody+wcheng>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 4CC: cfeist, edamato
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0998 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-21 21:14:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 243146    

Description Wendy Cheng 2007-06-05 14:53:56 UTC
Description of problem:

This is a clone of bugzilla 236565 with re-written problem statement
for easy search purpose. There is a race between GFS lookup code and
VM inode cache reclaim logic that would create a window to allow GFS
to corrupt (GFS) inode cache. The occurrence is rare and only happens 
when system is under memory pressure such that VM starts to free its
inode cache entries. Dependin on who gets the freed memory, the result
can't be specified. In the case where this bug is found (in RHEL5 NFS 
benchmark runs), the kernel is panicked with the following stack back-
trace:

[<ffffffff800100b7>] generic_file_buffered_write+0x496/0x6a3
[<ffffffff800641fa>] _spin_unlock_irq+0x9/0xc
[<ffffffff8000e2dd>] current_fs_time+0x3b/0x40
[<ffffffff80062350>] wait_for_completion+0x99/0xa2
[<ffffffff80016476>] __generic_file_aio_write_nolock+0x370/0x3bb
[<ffffffff80012a2f>] poison_obj+0x26/0x2f
[<ffffffff800bba91>] generic_file_aio_write_nolock+0x20/0x6c
[<ffffffff800bbeaa>] generic_file_write_nolock+0x8f/0xa8
[<ffffffff8009d3ee>] autoremove_wake_function+0x0/0x2e
[<ffffffff88641c8a>] :gfs:gfs_trans_begin_i+0x13c/0x1b2
[<ffffffff88634c50>] :gfs:do_write_buf+0x456/0x696
[<ffffffff88634452>] :gfs:walk_vm+0x10e/0x311
[<ffffffff886347fa>] :gfs:do_write_buf+0x0/0x696
[<ffffffff88634701>] :gfs:__gfs_write+0xac/0xc6
[<ffffffff800d3903>] do_readv_writev+0x198/0x295
[<ffffffff88634744>] :gfs:gfs_write+0x0/0x8
[<ffffffff88635ce8>] :gfs:gfs_open+0x12c/0x15e
[<ffffffff884e7709>] :nfsd:nfsd_vfs_write+0xf2/0x2e1
[<ffffffff88635bbc>] :gfs:gfs_open+0x0/0x15e
[<ffffffff8001e7c0>] __dentry_open+0x104/0x1e2
[<ffffffff884e7f89>] :nfsd:nfsd_write+0xb5/0xd5
[<ffffffff884ee778>] :nfsd:nfsd3_proc_write+0xea/0x109
[<ffffffff884e40e9>] :nfsd:nfsd_dispatch+0xd7/0x198
[<ffffffff884154f3>] :sunrpc:svc_process+0x42e/0x6ec
[<ffffffff80063cc1>] __down_read+0x34/0x96
[<ffffffff884e4471>] :nfsd:nfsd+0x0/0x32b
[<ffffffff884e4626>] :nfsd:nfsd+0x1b5/0x32b
[<ffffffff8005d665>] child_rip+0xa/0x11
[<ffffffff884e4471>] :nfsd:nfsd+0x0/0x32b
[<ffffffff884e4471>] :nfsd:nfsd+0x0/0x32b
[<ffffffff8005d65b>] child_rip+0x0/0x11

Version-Release number of selected component (if applicable):

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
NFSD (that has frequent calls into lookup code) and GFS glock trimming
logic (that invokes inode cache release logic on a regular time interval)
could see this bug more.

Comment 1 Wendy Cheng 2007-06-05 15:01:07 UTC
Should have said this happens with all versions of GFS1 code (haven't checked
GFS2 yet). 

The bug lurks in the end of the lookup code (gfs_lookup and gfs_get_dentry)
where inode glock is released pre-maturely. This creates a window inside the 
bottom portion of logic that could make gfs_iget to update the associated GFS 
inode structure that has been freed. Depending on who gets the new memory, 
unspecified corruptions occur. In RHEL5's case, it corrupts TCP buffer head 
that ends up over-running NFSD kernel stack. An almost identical report was
found at:

http://www.redhat.com/archives/linux-cluster/2005-June/msg00124.html

Comment 5 Benjamin Kahn 2007-06-07 15:14:55 UTC
This bug has been copied as z-stream (EUS) bug #243146 and now must be resolved
in the current update release, set blocker flag.


Comment 8 errata-xmlrpc 2007-11-21 21:14:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0998.html