Bug 307971

Summary:	GFS: kernel panic when filesystem withdrawn after storage controller reset
Product:	Red Hat Enterprise Linux 5	Reporter:	Corey Marthaler <cmarthal>
Component:	gfs-kmod	Assignee:	Robert Peterson <rpeterso>
Status:	CLOSED WORKSFORME	QA Contact:	Cluster QE <mspqa-list>
Severity:	low	Docs Contact:
Priority:	low
Version:	5.0	CC:	bmr
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-12-19 20:28:51 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2007-09-26 20:29:38 UTC

Description of problem:
This is a dup of the GFS2 bug 307861.

I got I/O started and then killed the devices and all the nodes paniced. This
was fixed back in rhel4 at one time, I remember verifying this issue, but now
that behavior has regressed in rhel5 and rhel4 too. 

Buffer I/O error on device dm-2, logical block 16086
lost page write due to I/O error on dm-2
Buffer I/O error on device dm-2, logical block 16087
lost page write due to I/O error on dm-2
Buffer I/O error on device dm-2, logical block 16088
lost page write due to I/O error on dm-2
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
GFS: fsid=TAFT_CLUSTER:gfs.0: fatal: I/O error
GFS: fsid=TAFT_CLUSTER:gfs.0:   block = 1272819
GFS: fsid=TAFT_CLUSTER:gfs.0:   function = gfs_logbh_wait
GFS: fsid=TAFT_CLUSTER:gfs.0:   file =
/builddir/build/BUILD/gfs-kmod-0.1.19/_kmod_build_/src/g5
GFS: fsid=TAFT_CLUSTER:gfs.0:   time = 1190836614
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/locks.c:1991
invalid opcode: 0000 [1] SMP
last sysfs file:
/devices/pci0000:00/0000:00:06.0/0000:08:00.2/0000:0b:02.0/host1/rport-1:0-1/te
CPU 0
Modules linked in: gfs(U) lock_dlm gfs2 dlm configfs autofs4 hidp rfcomm l2cap
bluetooth sunrpcd
Pid: 8391, comm: doio Not tainted 2.6.18-48.el5 #1
RIP: 0010:[<ffffffff8002703f>]  [<ffffffff8002703f>] locks_remove_flock+0xe4/0x122
RSP: 0018:ffff81020a577db8  EFLAGS: 00010246
RAX: ffff81021ff5f6b8 RBX: ffff810208cfd7e0 RCX: ffff81020a577db8
RDX: 0000000000000000 RSI: ffff81020a577db8 RDI: ffffffff802fde80
RBP: ffff81021a53c180 R08: 0000000000000000 R09: 0000000000000000
R10: ffff81020a577db8 R11: 00000000000000b0 R12: ffff810208cfd6e0
R13: ffff810208cfd6e0 R14: ffff810107770480 R15: ffff81020adad8e8
FS:  0000000000000000(0000) GS:ffffffff80396000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003d72095770 CR3: 0000000000201000 CR4: 00000000000006e0
Process doio (pid: 8391, threadinfo ffff81020a576000, task ffff810219da1080)
Stack:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 00000000000020c7 0000000000000000
 0000000000000000 0000000000000000 ffff81021a53c180 0000000000000202
Call Trace:
 [<ffffffff800122c4>] __fput+0x94/0x198
 [<ffffffff800237b3>] filp_close+0x5c/0x64
 [<ffffffff8003851e>] put_files_struct+0x6c/0xc3
 [<ffffffff80014f70>] do_exit+0x2d2/0x89d
 [<ffffffff80046e60>] cpuset_exit+0x0/0x6c
 [<ffffffff8005b28d>] tracesys+0xd5/0xe0


Code: 0f 0b 68 8a de 28 80 c2 c7 07 48 89 c3 48 8b 03 48 85 c0 75
RIP  [<ffffffff8002703f>] locks_remove_flock+0xe4/0x122
 RSP <ffff81020a577db8>
 <0>Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):
2.6.18-48.el5

Comment 1 Robert Peterson 2007-09-26 20:42:50 UTC

Do you remember the bugzilla number of the RHEL4 bug where this was
previously fixed?

Comment 2 Corey Marthaler 2007-09-26 20:52:02 UTC

No, I've been looking for it but can't find it, I wonder if it was in the old
sistina bz system, because it was a long time ago when it was filed, but not
that long ago when it was fixed.

Comment 3 Corey Marthaler 2007-09-26 21:21:26 UTC

This does appear to be just the flock case (which I just found out may already
have a bug for this issue), as I attempted this scenario with just a dd writing
to the gfs and the withdraw worked without the panic.

Comment 4 Robert Peterson 2007-10-09 21:21:36 UTC

I tried for a long time to recreate this problem on both gfs and
gfs2 by mounting the file system, taking flocks, then doing
"gfs_tool withdraw /mnt/gfs2" for gfs1, or
"echo "1" > /sys/fs/gfs2/sdb1/withdraw") for gfs2.

I tried a couple ways of taking out flocks:
(1) xiogen -t 10k -T 10k -o -m random -F 50k:file | xdoio -n 6 -K
(2) genesis -S RANDSEED -i RUN_TIME -n 100 -d 100 -p 10 -L flock -s 665600
(3) locktests (a tool I got for a bugzilla a while back).

I tried a local disk (/dev/sdb1) and a SAN through LVM2 from multiple
nodes, but it didn't recreate.  The withdraw happened as expected, but
there was no subsequent panic.

I'm looking for a good way to recreate this.  I haven't actually written
my own flock test program but that's probably the next step.  I suppose
I could also try pulling the FC cable from my box rather than
doing a controlled withdraw from the command line.

Comment 6 Corey Marthaler 2007-12-19 20:28:51 UTC

I was unable to reproduce this issue, will reopen if seen again. It does seem
odd though, that this stack is reported in other bz's 253347/253948, hmmm...