Description of problem: I had the morph cluster (morph-01 - morph-06) running a light I/O load all night long for about 16 hours before I had revolver take down 2 nodes (morph-03 & morph-05). Those two nodes ended up getting fenced by morph-01, brought back into the cluster, and had the filesystems mounted back up. In the meantime, while replaying their journals, morph-04 sees a bad gfs_meta_check and then ends up getting fenced by morph-01 as well. The remaining nodes continue the I/O load. Jan 4 10:13:14 morph-04 kernel: CMAN: no HELLO from morph-03, removing from the cluster Jan 4 10:13:18 morph-04 kernel: CMAN: node morph-05 is not responding - removing from the cluster Jan 4 10:13:19 morph-04 fenced[3641]: fencing deferred to 1 Jan 4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs1.3: jid=4: Trying to acquire journal lo ck... Jan 4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: jid=4: Trying to acquire journal lo ck... Jan 4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs2.3: jid=4: Trying to acquire journal lo ck... Jan 4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: jid=4: Looking at journal... Jan 4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs1.3: jid=4: Busy Jan 4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs1.3: jid=2: Trying to acquire journal lo ck... Jan 4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs1.3: jid=2: Busy Jan 4 10:13:48 morph-04 kernel: GFS: fsid=morph-cluster:gfs2.3: jid=4: Looking at journal... Jan 4 10:13:51 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: jid=4: Acquiring the transaction lo ck... Jan 4 10:13:51 morph-04 kernel: GFS: fsid=morph-cluster:gfs2.3: jid=4: Acquiring the transaction lo ck... Jan 4 10:14:34 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: jid=4: Replaying journal... Jan 4 10:14:38 morph-04 kernel: GFS: fsid=morph-cluster:gfs2.3: jid=4: Replaying journal... Jan 4 10:15:04 morph-04 kernel: Info fld=0x0, Current sda: sense key No Sense Jan 4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: fatal: invalid metadata block Jan 4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: bh = 89487074 (magic) Jan 4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: function = replay_block Jan 4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: file = /usr/src/cluster/gfs-kerne l/src/gfs/lops.c, line = 559 Jan 4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: time = 1104855305 Jan 4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: about to withdraw from the cluster Jan 4 10:15:05 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: waiting for outstanding I/O Jan 4 10:15:22 morph-04 kernel: Info fld=0x0, Current sda: sense key No Sense Jan 4 10:15:22 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: telling LM to withdraw Jan 4 10:15:22 morph-04 kernel: name " 5 593b786" flags 2 nodeid 0 ref 1 Jan 4 10:15:22 morph-04 kernel: G 01d1037e gr 3 rq -1 flg 0 sts 2 node 0 remid 0 lq 0,0 Jan 4 10:15:22 morph-04 kernel: name " 11 59db78b" flags 2 nodeid 0 ref 1 J . . . This lock info continues to spit out until morph-04 gets fenced. Version-Release number of selected component (if applicable): Jan 3 17:54:29 morph-04 kernel: GFS <CVS> (built Jan 3 2005 16:30:15) installed How reproducible: Didn't try
Reproduced this again during recovery. This time a failed gfs_metatype_check in /usr/src/cluster/gfs-kernel/src/gfs/recovery.c Jan 4 14:17:36 morph-01 kernel: GFS: fsid=morph-cluster:gfs1.0: jid=4: Replayed 1006 of 1044 blocksJan 4 14:17:36 morph-01 kernel: GFS: fsid=morph-cluster:gfs1.0: jid=4: replays = 1006, skips = 10, sames = 28 Jan 4 14:17:36 morph-01 kernel: GFS: fsid=morph-cluster:gfs0.0: jid=4: Replaying journal... Jan 4 14:18:24 morph-01 kernel: Info fld=0x0, Current sda: sense key No Sense Jan 4 14:18:24 morph-01 kernel: GFS: fsid=morph-cluster:gfs0.0: fatal: invalid metadata block Jan 4 14:18:24 morph-01 kernel: GFS: fsid=morph-cluster:gfs0.0: bh = 134226627 (type: exp=9, found=4) Jan 4 14:18:24 morph-01 kernel: GFS: fsid=morph-cluster:gfs0.0: function = foreach_descriptor Jan 4 14:18:24 morph-01 kernel: GFS: fsid=morph-cluster:gfs0.0: file = /usr/src/cluster/gfs-kernel/src/gfs/recovery.c, line = 396 Jan 4 14:18:24 morph-01 kernel: GFS: fsid=morph-cluster:gfs0.0: time = 1104869904 Jan 4 14:18:24 morph-01 kernel: Buffer I/O error on device diapered_dm-0, logical block 178928990 Jan 4 14:18:24 morph-01 kernel: lost page write due to I/O error on diapered_dm-0 Jan 4 14:18:24 morph-01 kernel: Buffer I/O error on device diapered_dm-0, logical block 178928995 Jan 4 14:18:24 morph-01 kernel: lost page write due to I/O error on diapered_dm-0 Jan 4 14:18:24 morph-01 kernel: Buffer I/O error on device diapered_dm-0, logical block 178928996 Jan 4 14:18:24 morph-01 kernel: lost page write due to I/O error on diapered_dm-0
Jan 4 10:15:04 morph-04 kernel: Info fld=0x0, Current sda: sense key No Sense What is this? This isn't the first time you've posted a bug with this line right before something goes terribly wrong. Is sda the device this filesystem is on?
The filesystem is on sda. I'm not exactly sure what that messages means. What I don't like is that it always seems to happen as a result of recovery taking place in all those bugs I've posted. I know that we panic as a result of read or write errors and if I ever see scsi read or write errors followed by a panic/issue I ignore it, but I can seem to run I/O for hours without I/O errors and the second I try recovery that error shows up and an problem occurs, and at times the filesystem even ends up corrupted as a result.
I did not see those kernel messages after trying Dean's suggestion of having I/O on all but one or two nodes in the cluster and continually loading and unloading the fcdriver on the node(s) which were not doing I/O.
I tried a bunch of recovery attempts with filesystems mounted and _no_ I/O going and couldn't hit the kernel messages either.
What I'm worried about is that if your hardware is flakey enough that it generates spontaneous I/O errors regularly that is also is causing silent corruptions as well. That "sense key" message might be indicative of one of them. All the bugs that have this "sense key" message (139696, 140736, and 140736) are errors where a block is just totally corrupted. This would fit with the I/O layer reading or writing garbage or mismatching I/O requests and responses. 139696 is hard to see as general filesystem corruption -- GFS read block #35835474 and got back contents that said it was block #82515456. Lock consistency problems might cause a block to get overwritten with incorrect or a different type of metadata, but it wouldn't cause a block to get written to the wrong place. A read request whose response gets routed as a response to a different request would cause this type of error. It could also be caused by a data mixup in the controller. It could even be caused by a read request that supposedly completed, but didn't -- the data in the buffer could be whatever was in memory from the last time the memory was used for something. I'm not saying that is definitely isn't a GFS problem, but it doesn't seem like it is.
Which FC controller is this?
I've seen this on nodes with qla2300 and lpfc.
FWIW, I saw this issue on my cluster without the lpfc nodes (only qla2300s). Also, there was no "kernel: Info fld=0x0, Current sda: sense key No Sense" message before it happened. I'll try with other hardware to see if mine is causing this.
Still happening?
*** This bug has been marked as a duplicate of 175589 ***
we got it last night: Feb 9 16:11:37 igrid01 clurgmgrd: [7540]: <info> Executing /mnt/crosswalk/snapshot/run_snap status Feb 9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: fatal: invalid metadata block Feb 9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: bh = 13241964 (magic) Feb 9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: function = buf_build_bhlist Feb 9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: file = fs/gfs/lops.c, line = 453 Feb 9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: time = 1139526706 Feb 9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: about to withdraw from the cluster Feb 9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: waiting for outstanding I/O Feb 9 16:11:46 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: telling LM to withdraw Feb 9 16:11:49 igrid01 kernel: lock_dlm: withdraw abandoned memory Feb 9 16:11:49 igrid01 kernel: GFS: fsid=igrid:NFS_ROOT2.1: withdrawn