155597 – filesystem consistency error after recovery

Bug 155597 - filesystem consistency error after recovery

Summary: filesystem consistency error after recovery

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ben Marzinski
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	156973 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-04-21 18:01 UTC by Corey Marthaler
Modified:	2010-01-12 03:04 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-02-01 17:04:02 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2005-04-21 18:01:33 UTC

Description of problem:
Before everyone asks, this was seen with:

- the "good" MSA1000 hardware
- no SCSI errors
- no I/O errors
- no "Info fld=0x0, Current sda: sense key No Sense" message
- all qla2300 fc drivers
- no errors or beeping on the RAID 

Three node cluster (link-10,link-12,link-08) and three GFS, after the 5th
iteration of revolver, link-12 and link-10 were shot. When brought back and
mounting filesystems, link-10 hit this bug.

Apr 21 09:53:29 link-10 ccsd[3916]: cluster.conf (cluster name = link-cluster,
version = 1) found.
Apr 21 09:53:29 link-10 ccsd[3916]: Remote copy of cluster.conf is from quorate
node.
Apr 21 09:53:29 link-10 ccsd[3916]:  Local version # : 1
Apr 21 09:53:29 link-10 ccsd[3916]:  Remote version #: 1
Apr 21 09:53:29 link-10 kernel: CMAN: Waiting to join or form a Linux-cluster
Apr 21 09:53:30 link-10 ccsd[3916]: Connected to cluster infrastruture via:
CMAN/SM Plugin v1.1.2
Apr 21 09:53:30 link-10 ccsd[3916]: Initial status:: Inquorate
Apr 21 09:53:32 link-10 kernel: CMAN: sending membership request
Apr 21 09:53:33 link-10 kernel: CMAN: got node link-12.lab.msp.redhat.com
Apr 21 09:53:33 link-10 kernel: CMAN: got node link-08
Apr 21 09:53:33 link-10 kernel: CMAN: quorum regained, resuming activity
Apr 21 09:53:33 link-10 ccsd[3916]: Cluster is quorate.  Allowing connections.
Apr 21 09:53:45 link-10 clvmd: Cluster LVM daemon started - connected to CMAN
Apr 21 09:53:57 link-10 kernel: GFS: Trying to join cluster "lock_dlm",
"link-cluster:gfs0"
Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: Joined cluster.
Now mounting FS...
Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: jid=1: Trying to
acquire journal lock...
Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: jid=1: Looking at
journal...
Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: jid=1: Done
Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: Scanning for log
elements...
Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: Found 10 unlinked
inodes
Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: Found quota
changes for 2 IDs
Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: Done
Apr 21 09:54:00 link-10 kernel: GFS: Trying to join cluster "lock_dlm",
"link-cluster:gfs1"
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: Joined cluster.
Now mounting FS...
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: jid=1: Trying to
acquire journal lock...
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: jid=1: Looking at
journal...
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: jid=1: Done
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: Scanning for log
elements...
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: fatal: filesystem
consistency error
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1:   function =
gfs_increment_blkno
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1:   file =
/usr/src/build/553783-i686/BUILD/smp/src/gfs/recovery.c, line = 326
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1:   time = 1114091642
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: about to withdraw
from the cluster
Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: waiting for
outstanding I/O
Apr 21 09:54:02 link-10 kernel: ------------[ cut here ]------------
Apr 21 09:54:02 link-10 kernel: kernel BUG at
/usr/src/build/553783-i686/BUILD/smp/src/gfs/lm.c:190!
Apr 21 09:54:02 link-10 kernel: invalid operand: 0000 [#1]
Apr 21 09:54:02 link-10 kernel: SMP
Apr 21 09:54:02 link-10 kernel: Modules linked in: gnbd(U) lock_nolock(U) gfs(U)
lock_dlm(U) dlm(U) cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport
autofs4 sunrpc button battery ac uhci_hcd ehci_hcd e1000 floppy dm_snapshot
dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
Apr 21 09:54:02 link-10 kernel: CPU:    0
Apr 21 09:54:02 link-10 kernel: EIP:    0060:[<e03d9f4f>]    Not tainted VLI
Apr 21 09:54:02 link-10 kernel: EFLAGS: 00010202   (2.6.9-6.37.ELsmp)
Apr 21 09:54:02 link-10 kernel: EIP is at gfs_lm_withdraw+0x51/0xc0 [gfs]
Apr 21 09:54:02 link-10 kernel: eax: 0000003b   ebx: e042c728   ecx: da448a34  
edx: e03f5915
Apr 21 09:54:02 link-10 kernel: esi: e0408000   edi: da448a94   ebp: 00000000  
esp: da448a48
Apr 21 09:54:02 link-10 kernel: ds: 007b   es: 007b   ss: 0068
Apr 21 09:54:02 link-10 kernel: Process mount (pid: 6260, threadinfo=da448000
task=d6311430)
Apr 21 09:54:02 link-10 kernel: Stack: 00000000 da448c48 e03f2377 e0408000
e03f8ca8 e042c728 e042c728 e03f35fa
Apr 21 09:54:02 link-10 kernel:        e042c728 e03f7cb0 00000146 e042c728
4267b07a e03ec40b e03f7cb0 00000146
Apr 21 09:54:02 link-10 kernel:        d9eae0f4 e042c574 e0408000 01161970
00000008 00000000 00000000 00000320
Apr 21 09:54:02 link-10 kernel: Call Trace:
Apr 21 09:54:02 link-10 kernel:  [<e03f2377>] gfs_consist_i+0x24/0x28 [gfs]
Apr 21 09:54:02 link-10 kernel:  [<e03ec40b>] gfs_increment_blkno+0x184/0x1da [gfs]
Apr 21 09:54:02 link-10 kernel:  [<e03ec7b1>] foreach_descriptor+0x350/0x365 [gfs]
Apr 21 09:54:02 link-10 kernel:  [<e03ecfe8>] gfs_recover_dump+0xd3/0x156 [gfs]
Apr 21 09:54:02 link-10 kernel:  [<e03f0823>] gfs_make_fs_rw+0xc3/0x11b [gfs]
Apr 21 09:54:02 link-10 kernel:  [<e03e618f>] fill_super+0x9b7/0xf3f [gfs]
Apr 21 09:54:02 link-10 kernel:  [<c01446f4>] pagevec_lookup+0x17/0x1d
Apr 21 09:54:02 link-10 kernel:  [<c015c336>] set_blocksize+0x77/0x7c
Apr 21 09:54:02 link-10 kernel:  [<e03e6846>] gfs_get_sb+0x114/0x159 [gfs]
Apr 21 09:54:02 link-10 kernel:  [<c015bfc3>] do_kern_mount+0x8a/0x13d
Apr 21 09:54:02 link-10 kernel:  [<c016ea23>] do_new_mount+0x61/0x90
Apr 21 09:54:02 link-10 kernel:  [<c016f070>] do_mount+0x178/0x190
Apr 21 09:54:02 link-10 kernel:  [<c02c7e28>] common_interrupt+0x18/0x20
Apr 21 09:54:02 link-10 kernel:  [<c016ee37>] exact_copy_from_user+0x20/0x4f
Apr 21 09:54:02 link-10 kernel:  [<c016f3c7>] sys_mount+0x91/0x108
Apr 21 09:54:02 link-10 kernel:  [<c02c746b>] syscall_call+0x7/0xb
Apr 21 09:54:02 link-10 kernel: Code: ff 74 24 14 e8 ae 7a d4 df 53 68 e3 58 3f
e0 e8 92 7a d4 df 53 68 15 59 3f e0 e8 87 7a d4 df 83 c4 18 83 be 34 02 00 00 00
74 08 <0f> 0b be 00 2c 58 3f e0 8b 86 08 47 02 00 85 c0 74 1c ba 02 00
Apr 21 09:54:02 link-10 kernel:  <0>Fatal exception: panic in 5 seconds


Version-Release number of selected component (if applicable):
GFS 2.6.9-28.5 (built Apr 11 2005 15:30:01) installed

How reproducible:
seen it once so far

Comment 1 Ken Preslan 2005-04-21 18:20:32 UTC

This looks like I bug I discovered when working on the new logging code.  Log space is reclaimed when 
FS' incore data thinks that the space is no longer needed.  It should happen when the ondisk data says 
it's no longer needed.

Comment 2 Ben Marzinski 2005-04-29 20:58:22 UTC

GFS now no longer can modify the part of the on disk log that has already been
written out to it's in-place location, until the on disk log head points to the
new tail. However the code that I wrote to do this does memory allocations while
holding the log_lock. This could possibly cause lockups in low memory situations.

Comment 3 Ben Marzinski 2005-05-09 19:23:08 UTC

O.k. this is not the bug that Ken had discovered.  I'm still trying to recreate
it. All I really know is the symptom that we've seen. When gfs is replaying the
journal, it finds a log header that claims to be the start of a new transaction.
However, according to the previous log descriptor, it isn't.

Comment 4 Ben Marzinski 2005-05-09 19:27:46 UTC

*** Bug 156973 has been marked as a duplicate of this bug. ***

Comment 5 Corey Marthaler 2005-05-10 16:38:34 UTC

putting back into assigned state.

Comment 6 Ben Marzinski 2005-05-31 23:19:54 UTC

I have never been able to recreate this bug. QA has seen it twice. If anyone
sees this bug again, the most useful thing that they can do is to send me a copy
of the entire corrupted journal.

Comment 7 Ben Marzinski 2006-02-01 17:04:02 UTC

I have never seen this. No one has seen this in over half a year. I someone can
see this, they should reopen it

Note You need to log in before you can comment on or make changes to this bug.