Description of problem: Before everyone asks, this was seen with: - the "good" MSA1000 hardware - no SCSI errors - no I/O errors - no "Info fld=0x0, Current sda: sense key No Sense" message - all qla2300 fc drivers - no errors or beeping on the RAID Three node cluster (link-10,link-12,link-08) and three GFS, after the 5th iteration of revolver, link-12 and link-10 were shot. When brought back and mounting filesystems, link-10 hit this bug. Apr 21 09:53:29 link-10 ccsd[3916]: cluster.conf (cluster name = link-cluster, version = 1) found. Apr 21 09:53:29 link-10 ccsd[3916]: Remote copy of cluster.conf is from quorate node. Apr 21 09:53:29 link-10 ccsd[3916]: Local version # : 1 Apr 21 09:53:29 link-10 ccsd[3916]: Remote version #: 1 Apr 21 09:53:29 link-10 kernel: CMAN: Waiting to join or form a Linux-cluster Apr 21 09:53:30 link-10 ccsd[3916]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.2 Apr 21 09:53:30 link-10 ccsd[3916]: Initial status:: Inquorate Apr 21 09:53:32 link-10 kernel: CMAN: sending membership request Apr 21 09:53:33 link-10 kernel: CMAN: got node link-12.lab.msp.redhat.com Apr 21 09:53:33 link-10 kernel: CMAN: got node link-08 Apr 21 09:53:33 link-10 kernel: CMAN: quorum regained, resuming activity Apr 21 09:53:33 link-10 ccsd[3916]: Cluster is quorate. Allowing connections. Apr 21 09:53:45 link-10 clvmd: Cluster LVM daemon started - connected to CMAN Apr 21 09:53:57 link-10 kernel: GFS: Trying to join cluster "lock_dlm", "link-cluster:gfs0" Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: Joined cluster. Now mounting FS... Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: jid=1: Trying to acquire journal lock... Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: jid=1: Looking at journal... Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: jid=1: Done Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: Scanning for log elements... Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: Found 10 unlinked inodes Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: Found quota changes for 2 IDs Apr 21 09:53:59 link-10 kernel: GFS: fsid=link-cluster:gfs0.1: Done Apr 21 09:54:00 link-10 kernel: GFS: Trying to join cluster "lock_dlm", "link-cluster:gfs1" Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: Joined cluster. Now mounting FS... Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: jid=1: Trying to acquire journal lock... Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: jid=1: Looking at journal... Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: jid=1: Done Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: Scanning for log elements... Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: fatal: filesystem consistency error Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: function = gfs_increment_blkno Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: file = /usr/src/build/553783-i686/BUILD/smp/src/gfs/recovery.c, line = 326 Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: time = 1114091642 Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: about to withdraw from the cluster Apr 21 09:54:02 link-10 kernel: GFS: fsid=link-cluster:gfs1.1: waiting for outstanding I/O Apr 21 09:54:02 link-10 kernel: ------------[ cut here ]------------ Apr 21 09:54:02 link-10 kernel: kernel BUG at /usr/src/build/553783-i686/BUILD/smp/src/gfs/lm.c:190! Apr 21 09:54:02 link-10 kernel: invalid operand: 0000 [#1] Apr 21 09:54:02 link-10 kernel: SMP Apr 21 09:54:02 link-10 kernel: Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc button battery ac uhci_hcd ehci_hcd e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Apr 21 09:54:02 link-10 kernel: CPU: 0 Apr 21 09:54:02 link-10 kernel: EIP: 0060:[<e03d9f4f>] Not tainted VLI Apr 21 09:54:02 link-10 kernel: EFLAGS: 00010202 (2.6.9-6.37.ELsmp) Apr 21 09:54:02 link-10 kernel: EIP is at gfs_lm_withdraw+0x51/0xc0 [gfs] Apr 21 09:54:02 link-10 kernel: eax: 0000003b ebx: e042c728 ecx: da448a34 edx: e03f5915 Apr 21 09:54:02 link-10 kernel: esi: e0408000 edi: da448a94 ebp: 00000000 esp: da448a48 Apr 21 09:54:02 link-10 kernel: ds: 007b es: 007b ss: 0068 Apr 21 09:54:02 link-10 kernel: Process mount (pid: 6260, threadinfo=da448000 task=d6311430) Apr 21 09:54:02 link-10 kernel: Stack: 00000000 da448c48 e03f2377 e0408000 e03f8ca8 e042c728 e042c728 e03f35fa Apr 21 09:54:02 link-10 kernel: e042c728 e03f7cb0 00000146 e042c728 4267b07a e03ec40b e03f7cb0 00000146 Apr 21 09:54:02 link-10 kernel: d9eae0f4 e042c574 e0408000 01161970 00000008 00000000 00000000 00000320 Apr 21 09:54:02 link-10 kernel: Call Trace: Apr 21 09:54:02 link-10 kernel: [<e03f2377>] gfs_consist_i+0x24/0x28 [gfs] Apr 21 09:54:02 link-10 kernel: [<e03ec40b>] gfs_increment_blkno+0x184/0x1da [gfs] Apr 21 09:54:02 link-10 kernel: [<e03ec7b1>] foreach_descriptor+0x350/0x365 [gfs] Apr 21 09:54:02 link-10 kernel: [<e03ecfe8>] gfs_recover_dump+0xd3/0x156 [gfs] Apr 21 09:54:02 link-10 kernel: [<e03f0823>] gfs_make_fs_rw+0xc3/0x11b [gfs] Apr 21 09:54:02 link-10 kernel: [<e03e618f>] fill_super+0x9b7/0xf3f [gfs] Apr 21 09:54:02 link-10 kernel: [<c01446f4>] pagevec_lookup+0x17/0x1d Apr 21 09:54:02 link-10 kernel: [<c015c336>] set_blocksize+0x77/0x7c Apr 21 09:54:02 link-10 kernel: [<e03e6846>] gfs_get_sb+0x114/0x159 [gfs] Apr 21 09:54:02 link-10 kernel: [<c015bfc3>] do_kern_mount+0x8a/0x13d Apr 21 09:54:02 link-10 kernel: [<c016ea23>] do_new_mount+0x61/0x90 Apr 21 09:54:02 link-10 kernel: [<c016f070>] do_mount+0x178/0x190 Apr 21 09:54:02 link-10 kernel: [<c02c7e28>] common_interrupt+0x18/0x20 Apr 21 09:54:02 link-10 kernel: [<c016ee37>] exact_copy_from_user+0x20/0x4f Apr 21 09:54:02 link-10 kernel: [<c016f3c7>] sys_mount+0x91/0x108 Apr 21 09:54:02 link-10 kernel: [<c02c746b>] syscall_call+0x7/0xb Apr 21 09:54:02 link-10 kernel: Code: ff 74 24 14 e8 ae 7a d4 df 53 68 e3 58 3f e0 e8 92 7a d4 df 53 68 15 59 3f e0 e8 87 7a d4 df 83 c4 18 83 be 34 02 00 00 00 74 08 <0f> 0b be 00 2c 58 3f e0 8b 86 08 47 02 00 85 c0 74 1c ba 02 00 Apr 21 09:54:02 link-10 kernel: <0>Fatal exception: panic in 5 seconds Version-Release number of selected component (if applicable): GFS 2.6.9-28.5 (built Apr 11 2005 15:30:01) installed How reproducible: seen it once so far
This looks like I bug I discovered when working on the new logging code. Log space is reclaimed when FS' incore data thinks that the space is no longer needed. It should happen when the ondisk data says it's no longer needed.
GFS now no longer can modify the part of the on disk log that has already been written out to it's in-place location, until the on disk log head points to the new tail. However the code that I wrote to do this does memory allocations while holding the log_lock. This could possibly cause lockups in low memory situations.
O.k. this is not the bug that Ken had discovered. I'm still trying to recreate it. All I really know is the symptom that we've seen. When gfs is replaying the journal, it finds a log header that claims to be the start of a new transaction. However, according to the previous log descriptor, it isn't.
*** Bug 156973 has been marked as a duplicate of this bug. ***
putting back into assigned state.
I have never been able to recreate this bug. QA has seen it twice. If anyone sees this bug again, the most useful thing that they can do is to send me a copy of the entire corrupted journal.
I have never seen this. No one has seen this in over half a year. I someone can see this, they should reopen it