242720 – GFS panic due to inode cache corruption

Bug 242720 - GFS panic due to inode cache corruption

Summary: GFS panic due to inode cache corruption

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Wendy Cheng
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	243146
TreeView+	depends on / blocked

Reported:	2007-06-05 14:53 UTC by Wendy Cheng
Modified:	2010-01-12 03:16 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2007-0998
Clone Of:
Environment:
Last Closed:	2007-11-21 21:14:27 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0998	0	normal	SHIPPED_LIVE	GFS-kernel bug fix update	2007-11-29 14:35:09 UTC

Description Wendy Cheng 2007-06-05 14:53:56 UTC

Description of problem:

This is a clone of bugzilla 236565 with re-written problem statement
for easy search purpose. There is a race between GFS lookup code and
VM inode cache reclaim logic that would create a window to allow GFS
to corrupt (GFS) inode cache. The occurrence is rare and only happens 
when system is under memory pressure such that VM starts to free its
inode cache entries. Dependin on who gets the freed memory, the result
can't be specified. In the case where this bug is found (in RHEL5 NFS 
benchmark runs), the kernel is panicked with the following stack back-
trace:

[<ffffffff800100b7>] generic_file_buffered_write+0x496/0x6a3
[<ffffffff800641fa>] _spin_unlock_irq+0x9/0xc
[<ffffffff8000e2dd>] current_fs_time+0x3b/0x40
[<ffffffff80062350>] wait_for_completion+0x99/0xa2
[<ffffffff80016476>] __generic_file_aio_write_nolock+0x370/0x3bb
[<ffffffff80012a2f>] poison_obj+0x26/0x2f
[<ffffffff800bba91>] generic_file_aio_write_nolock+0x20/0x6c
[<ffffffff800bbeaa>] generic_file_write_nolock+0x8f/0xa8
[<ffffffff8009d3ee>] autoremove_wake_function+0x0/0x2e
[<ffffffff88641c8a>] :gfs:gfs_trans_begin_i+0x13c/0x1b2
[<ffffffff88634c50>] :gfs:do_write_buf+0x456/0x696
[<ffffffff88634452>] :gfs:walk_vm+0x10e/0x311
[<ffffffff886347fa>] :gfs:do_write_buf+0x0/0x696
[<ffffffff88634701>] :gfs:__gfs_write+0xac/0xc6
[<ffffffff800d3903>] do_readv_writev+0x198/0x295
[<ffffffff88634744>] :gfs:gfs_write+0x0/0x8
[<ffffffff88635ce8>] :gfs:gfs_open+0x12c/0x15e
[<ffffffff884e7709>] :nfsd:nfsd_vfs_write+0xf2/0x2e1
[<ffffffff88635bbc>] :gfs:gfs_open+0x0/0x15e
[<ffffffff8001e7c0>] __dentry_open+0x104/0x1e2
[<ffffffff884e7f89>] :nfsd:nfsd_write+0xb5/0xd5
[<ffffffff884ee778>] :nfsd:nfsd3_proc_write+0xea/0x109
[<ffffffff884e40e9>] :nfsd:nfsd_dispatch+0xd7/0x198
[<ffffffff884154f3>] :sunrpc:svc_process+0x42e/0x6ec
[<ffffffff80063cc1>] __down_read+0x34/0x96
[<ffffffff884e4471>] :nfsd:nfsd+0x0/0x32b
[<ffffffff884e4626>] :nfsd:nfsd+0x1b5/0x32b
[<ffffffff8005d665>] child_rip+0xa/0x11
[<ffffffff884e4471>] :nfsd:nfsd+0x0/0x32b
[<ffffffff884e4471>] :nfsd:nfsd+0x0/0x32b
[<ffffffff8005d65b>] child_rip+0x0/0x11

Version-Release number of selected component (if applicable):

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
NFSD (that has frequent calls into lookup code) and GFS glock trimming
logic (that invokes inode cache release logic on a regular time interval)
could see this bug more.

Comment 1 Wendy Cheng 2007-06-05 15:01:07 UTC

Should have said this happens with all versions of GFS1 code (haven't checked
GFS2 yet). 

The bug lurks in the end of the lookup code (gfs_lookup and gfs_get_dentry)
where inode glock is released pre-maturely. This creates a window inside the 
bottom portion of logic that could make gfs_iget to update the associated GFS 
inode structure that has been freed. Depending on who gets the new memory, 
unspecified corruptions occur. In RHEL5's case, it corrupts TCP buffer head 
that ends up over-running NFSD kernel stack. An almost identical report was
found at:

http://www.redhat.com/archives/linux-cluster/2005-June/msg00124.html

Comment 5 Benjamin Kahn 2007-06-07 15:14:55 UTC

This bug has been copied as z-stream (EUS) bug #243146 and now must be resolved
in the current update release, set blocker flag.

Comment 8 errata-xmlrpc 2007-11-21 21:14:27 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0998.html

Note You need to log in before you can comment on or make changes to this bug.