Bug 164324

Summary:	gfs oops in gfs_wipe_buffers
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Curtis Zinzilieta <curtisz>
Component:	gfs	Assignee:	Ben Marzinski <bmarzins>
Status:	CLOSED ERRATA	QA Contact:	GFS Bugs <gfs-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	djansa, rkenna
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2005-740	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-10-07 16:57:08 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	165449

Description Curtis Zinzilieta 2005-07-26 22:02:35 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050512 Red Hat/1.0.4-1.4.1 Firefox/1.0.4

Description of problem:
running regression tests on 5 node cluster.  one of the nodes oops'ed with:

Jul 26 14:14:17 tank-01 kernel: Unable to handle kernel paging request at virtual address 00200214
Jul 26 14:14:17 tank-01 kernel:  printing eip:
Jul 26 14:14:17 tank-01 kernel: f8cf4d02
Jul 26 14:14:17 tank-01 kernel: *pde = 00004001
Jul 26 14:14:17 tank-01 kernel: Oops: 0000 [#1]
Jul 26 14:14:17 tank-01 kernel: SMP
Jul 26 14:14:17 tank-01 kernel: Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) lpfc qla2300 qla2xxx parport_pc lp parport autofs4 i2c_dev i2c_core dlm(U) cman(U) md5 ipv6 sunrpc button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod scsi_transport_fc sd_mod scsi_mod
Jul 26 14:14:17 tank-01 kernel: CPU:    2
Jul 26 14:14:17 tank-01 kernel: EIP:    0060:[<f8cf4d02>]    Tainted: GF     VLI
Jul 26 14:14:17 tank-01 kernel: EFLAGS: 00010206   (2.6.9-11.28.ELsmp)
Jul 26 14:14:17 tank-01 kernel: EIP is at depend_sync_old+0x40/0x59 [gfs]
Jul 26 14:14:17 tank-01 kernel: eax: 04549df9   ebx: f8c7724c   ecx: f7fff800   edx: f8c7724c
Jul 26 14:14:17 tank-01 kernel: esi: 0000ea60   edi: f8c77000   ebp: 002001f8   esp: f68acea4
Jul 26 14:14:17 tank-01 kernel: ds: 007b   es: 007b   ss: 0068
Jul 26 14:14:17 tank-01 kernel: Process gfs_inoded (pid: 4265, threadinfo=f68ac000 task=f6654630)
Jul 26 14:14:17 tank-01 kernel: Stack: c330e600 f8c77000 ed0aa258 ed0aa22c f8c77000 00000000 f8cce3a8 00000001
Jul 26 14:14:17 tank-01 kernel:        eb96ec48 c330e600 c5246dac cb3c3590 c5246dac c330e600 c330e600 f8c77000
Jul 26 14:14:17 tank-01 kernel:        f8cf6aec 004ed57a 00000000 00000001 00000000 00000000 00000000 c5246dac
Jul 26 14:14:17 tank-01 kernel: Call Trace:
Jul 26 14:14:17 tank-01 kernel:  [<f8cce3a8>] gfs_wipe_buffers+0x2a6/0x2ae [gfs]
Jul 26 14:14:17 tank-01 kernel:  [<f8cf6aec>] gfs_difree+0x39/0x3f [gfs]
Jul 26 14:14:17 tank-01 kernel:  [<f8cdb170>] dinode_dealloc+0x113/0x164 [gfs]
Jul 26 14:14:17 tank-01 kernel:  [<f8cdb351>] inode_dealloc+0x190/0x1d6 [gfs]
Jul 26 14:14:17 tank-01 kernel:  [<f8cd8093>] gfs_glock_dq+0x111/0x11f [gfs]
Jul 26 14:14:17 tank-01 kernel:  [<f8cdb3e8>] inode_dealloc_init+0x51/0x64 [gfs]
Jul 26 14:14:17 tank-01 kernel:  [<f8cf96a6>] .text.lock.unlinked+0x1a/0x74 [gfs]
Jul 26 14:14:17 tank-01 kernel:  [<f8cf9602>] gfs_unlinked_dealloc+0x2b/0xb5 [gfs]
Jul 26 14:14:17 tank-01 kernel:  [<f8ccc209>] gfs_inoded+0x3a/0xbc [gfs]
Jul 26 14:14:17 tank-01 kernel:  [<f8ccc1cf>] gfs_inoded+0x0/0xbc [gfs]
Jul 26 14:14:17 tank-01 kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb
Jul 26 14:14:17 tank-01 kernel: Code: 02 00 00 8b a8 20 01 00 00 89 d8 e8 2a 8d 5d c7 8b b7 78 02 00 00 83 ed 08 89 d8 e8 8b 8d 5d c7 a1 20 e9 31 c0 69 f6 e8 03 00 00 <03> 75 1c 39 f0 78 0b 89 ea 89 f8 e8 f3 fe ff ff eb bd 5b 5e 5b
Jul 26 14:14:17 tank-01 kernel:  <0>Fatal exception: panic in 5 seconds


Version-Release number of selected component (if applicable):
GFS-kernel-smp-2.6.9-36.2

How reproducible:
Didn't try

Steps to Reproduce:
currently trying to reproduce.  

Additional info:

Comment 1 Ben Marzinski 2005-08-29 22:35:59 UTC

Please let me know if this is reproduceable.

Comment 2 Ben Marzinski 2005-08-29 22:48:52 UTC

*** Bug 166293 has been marked as a duplicate of this bug. ***

Comment 3 Ben Marzinski 2005-08-29 22:50:19 UTC

o.k. I guess it is reproduceable

Comment 4 Ben Marzinski 2005-09-16 14:58:29 UTC

O.k. I found a bug in the depend_sync_old code, that could definitely cause this
error.  Only problem is, I'm not totally sure that it *IS* causing this error,
and I'm even fuzzier on how it would cause 166293. My best guess is that the
stack trace for 166293 is incompelete, and that it is exactly the same bug.

Here's the delema.  in depend_sync_old, if it takes longer than "depend_secs"
(which is a tuneable parameter set to 60 seconds by default) to sync all the old
depenent inodes to disk, bad things happen, and you end up overwriting the
resource group descriptor structure.  If you manage to trash this structure
without crashing, on the next loop, this bug is exactly what you would
definitely see. This explains why we saw it with gnbd.  Using gnbd, it would
take longer to sync the inodes to disk.

I knocked down depend_secs to 0, and I can hit this bug within minutes, every
time.  The problem is, I always crash while mucking with the structure. However,
I don't think that you must always crash. (i.e. when you access what you think
should be a pointer, it is actually a pointer in the rgd structure. There's no
place where the memory that you access will never have a valid value).  I think
the reason that I always crash early has something to do with knocking the
depend_secs down, so that other parts of the rgd don't have time to be set to
valid values.

If we could reproduce this bug reliably, we could verify a fix. But I can't see
another way for this error to happen, and this bug could definitely cause it.

Comment 6 Ben Marzinski 2005-09-19 18:58:30 UTC

Unless someone can recreate this problem which my change in, I'm calling this bug
fixed

Comment 7 Red Hat Bugzilla 2005-10-07 16:57:08 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-740.html