144110 – recovery causes bad gfs_meta_check

Bug 144110 - recovery causes bad gfs_meta_check

Summary: recovery causes bad gfs_meta_check

Keywords:
Status:	CLOSED DUPLICATE of bug 175589
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ben Marzinski
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-01-04 17:02 UTC by Corey Marthaler
Modified:	2010-01-12 03:02 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-01-04 20:14:24 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2005-01-04 17:02:33 UTC

Description of problem:
I had the morph cluster (morph-01 - morph-06) running a light I/O load
all night long for about 16 hours before I had revolver take down 2
nodes (morph-03 & morph-05). Those two nodes ended up getting fenced
by morph-01, brought back into the cluster, and had the filesystems
mounted back up. In the meantime, while replaying their journals,
morph-04 sees a bad gfs_meta_check and then ends up getting fenced by
morph-01 as well. The remaining nodes continue the I/O load.


Jan  4 10:13:14 morph-04 kernel: CMAN: no HELLO from morph-03,
removing from the cluster
Jan  4 10:13:18 morph-04 kernel: CMAN: node morph-05 is not responding
- removing from the cluster
Jan  4 10:13:19 morph-04 fenced[3641]: fencing deferred to 1

Jan  4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs1.3:
jid=4: Trying to acquire journal lo
ck...
Jan  4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:
jid=4: Trying to acquire journal lo
ck...
Jan  4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs2.3:
jid=4: Trying to acquire journal lo
ck...
Jan  4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:
jid=4: Looking at journal...
Jan  4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs1.3:
jid=4: Busy
Jan  4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs1.3:
jid=2: Trying to acquire journal lo
ck...
Jan  4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs1.3:
jid=2: Busy
Jan  4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs2.3:
jid=4: Looking at journal...
Jan  4 10:13:51 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:
jid=4: Acquiring the transaction lo
ck...
Jan  4 10:13:51 morph-04 kernel: GFS: fsid=morph-cluster:gfs2.3:
jid=4: Acquiring the transaction lo
ck...
Jan  4 10:14:34 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:
jid=4: Replaying journal...
Jan  4 10:14:38 morph-04 kernel: GFS: fsid=morph-cluster:gfs2.3:
jid=4: Replaying journal...
Jan  4 10:15:04 morph-04 kernel: Info fld=0x0, Current sda: sense key
No Sense
Jan  4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:
fatal: invalid metadata block
Jan  4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:   bh
= 89487074 (magic)
Jan  4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:  
function = replay_block
Jan  4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:  
file = /usr/src/cluster/gfs-kerne
l/src/gfs/lops.c, line = 559
Jan  4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:  
time = 1104855305
Jan  4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: about
to withdraw from the cluster
Jan  4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:
waiting for outstanding I/O
Jan  4 10:15:22 morph-04 kernel: Info fld=0x0, Current sda: sense key
No Sense
Jan  4 10:15:22 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:
telling LM to withdraw
Jan  4 10:15:22 morph-04 kernel: name "       5         593b786" flags
2 nodeid 0 ref 1
Jan  4 10:15:22 morph-04 kernel: G 01d1037e gr 3 rq -1 flg 0 sts 2
node 0 remid 0 lq 0,0
Jan  4 10:15:22 morph-04 kernel: name "      11         59db78b" flags
2 nodeid 0 ref 1
J
.
.
.

This lock info continues to spit out until morph-04 gets fenced.

Version-Release number of selected component (if applicable):
Jan  3 17:54:29 morph-04 kernel: GFS <CVS> (built Jan  3 2005
16:30:15) installed


How reproducible:
Didn't try

Comment 1 Corey Marthaler 2005-01-04 20:52:10 UTC

Reproduced this again during recovery. This time a failed
gfs_metatype_check in /usr/src/cluster/gfs-kernel/src/gfs/recovery.c


Jan  4 14:17:36 morph-01 kernel: GFS: fsid=morph-cluster:gfs1.0:
jid=4: Replayed 1006 of 1044 blocksJan  4 14:17:36 morph-01 kernel:
GFS: fsid=morph-cluster:gfs1.0: jid=4: replays = 1006, skips = 10,
sames = 28                                                           
                              Jan  4 14:17:36 morph-01 kernel: GFS:
fsid=morph-cluster:gfs0.0: jid=4: Replaying journal...        Jan  4
14:18:24 morph-01 kernel: Info fld=0x0, Current sda: sense key No
Sense                      Jan  4 14:18:24 morph-01 kernel: GFS:
fsid=morph-cluster:gfs0.0: fatal: invalid metadata block      Jan  4
14:18:24 morph-01 kernel: GFS: fsid=morph-cluster:gfs0.0:   bh =
134226627 (type: exp=9, found=4)                                     
                                                          Jan  4
14:18:24 morph-01 kernel: GFS: fsid=morph-cluster:gfs0.0:   function =
foreach_descriptor    Jan  4 14:18:24 morph-01 kernel: GFS:
fsid=morph-cluster:gfs0.0:   file =
/usr/src/cluster/gfs-kernel/src/gfs/recovery.c, line = 396           
                                                        Jan  4
14:18:24 morph-01 kernel: GFS: fsid=morph-cluster:gfs0.0:   time =
1104869904                Jan  4 14:18:24 morph-01 kernel: Buffer I/O
error on device diapered_dm-0, logical block 178928990  Jan  4
14:18:24 morph-01 kernel: lost page write due to I/O error on
diapered_dm-0                  Jan  4 14:18:24 morph-01 kernel: Buffer
I/O error on device diapered_dm-0, logical block 178928995  Jan  4
14:18:24 morph-01 kernel: lost page write due to I/O error on
diapered_dm-0                  Jan  4 14:18:24 morph-01 kernel: Buffer
I/O error on device diapered_dm-0, logical block 178928996  Jan  4
14:18:24 morph-01 kernel: lost page write due to I/O error on
diapered_dm-0

Comment 2 Ken Preslan 2005-01-05 02:39:56 UTC

Jan  4 10:15:04 morph-04 kernel: Info fld=0x0, Current sda: sense key
No Sense

What is this?  This isn't the first time you've posted a bug with this
line right before something goes terribly wrong.  Is sda the device
this filesystem is on?

Comment 3 Corey Marthaler 2005-01-05 16:32:10 UTC

The filesystem is on sda. 

I'm not exactly sure what that messages means. What I don't like is
that it always seems to happen as a result of recovery taking place in
all those bugs I've posted. 

I know that we panic as a result of read or write errors and if I ever
see scsi read or write errors followed by a panic/issue I ignore it,
but I can seem to run I/O for hours without I/O errors and the second
I try recovery that error shows up and an problem occurs, and at times
the filesystem even ends up corrupted as a result.

Comment 4 Corey Marthaler 2005-01-05 19:02:35 UTC

I did not see those kernel messages after trying Dean's suggestion of
having I/O on all but one or two nodes in the cluster and continually
loading and unloading the fcdriver on the node(s) which were not doing
I/O.

Comment 5 Corey Marthaler 2005-01-05 21:19:58 UTC

I tried a bunch of recovery attempts with filesystems mounted and _no_
I/O going and couldn't hit the kernel messages either.

Comment 6 Ken Preslan 2005-01-05 21:27:14 UTC

What I'm worried about is that if your hardware is flakey enough that
it generates spontaneous I/O errors regularly that is also is causing
silent corruptions as well.  That "sense key" message might be
indicative of one of them.

All the bugs that have this "sense key" message (139696, 140736, and
140736) are errors where a block is just totally corrupted.  This
would fit with the I/O layer reading or writing garbage or mismatching
I/O requests and responses.

139696 is hard to see as general filesystem corruption -- GFS read
block #35835474 and got back contents that said it was block
#82515456.  Lock consistency problems might cause a block to get
overwritten with incorrect or a different type of metadata, but it
wouldn't cause a block to get written to the wrong place.  A read
request whose response gets routed as a response to a different
request would cause this type of error.  It could also be caused by a
data mixup in the controller.  It could even be caused by a read
request that supposedly completed, but didn't -- the data in the
buffer could be whatever was in memory from the last time the memory
was used for something.  I'm not saying that is definitely isn't a GFS
problem, but it doesn't seem like it is.

Comment 7 Ken Preslan 2005-01-05 21:38:25 UTC

Which FC controller is this?

Comment 8 Corey Marthaler 2005-01-05 21:40:51 UTC

I've seen this on nodes with qla2300 and lpfc.

Comment 9 Corey Marthaler 2005-01-12 17:36:21 UTC

FWIW, I saw this issue on my cluster without the lpfc nodes (only
qla2300s). Also, there was no "kernel: Info fld=0x0, Current sda:
sense key No Sense" message before it happened. I'll try with other
hardware to see if mine is causing this.

Comment 10 Kiersten (Kerri) Anderson 2005-07-18 20:16:17 UTC

Still happening?

Comment 11 Ben Marzinski 2006-01-04 20:14:24 UTC


*** This bug has been marked as a duplicate of 175589 ***

Comment 12 Dennis Preston 2006-02-10 18:21:57 UTC

we got it last night:

Feb  9 16:11:37 igrid01 clurgmgrd: [7540]: <info> Executing
/mnt/crosswalk/snapshot/run_snap status
Feb  9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: fatal: invalid
metadata block
Feb  9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1:   bh = 13241964 (magic)
Feb  9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1:   function =
buf_build_bhlist
Feb  9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1:   file =
fs/gfs/lops.c, line = 453
Feb  9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1:   time = 1139526706
Feb  9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: about to withdraw
from the cluster
Feb  9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: waiting for
outstanding I/O
Feb  9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: telling LM to withdraw
Feb  9 16:11:49 igrid01 kernel: lock_dlm: withdraw abandoned memory
Feb  9 16:11:49 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: withdrawn

Note You need to log in before you can comment on or make changes to this bug.