185677 – md multipath errors and corrupt fs

Bug 185677 - md multipath errors and corrupt fs

Summary: md multipath errors and corrupt fs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	180185
TreeView+	depends on / blocked

Reported:	2006-03-16 20:32 UTC by David Teigland
Modified:	2010-01-12 03:10 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-04-11 18:46:24 UTC
Embargoed:

Attachments	(Terms of Use)

Description David Teigland 2006-03-16 20:32:30 UTC

Description of problem:

Original report is bz 172944 comment 9.

GFS nodes are doing i/o over several paths.
One or more paths are disabled.
Many scsi errors (meaning/severity unknown).
Many md multipath errors (meaning/severity unknown).
Node(s) rebooted before md errors are resolved.
Cluster brought back up.
Corruption is found in one of three gfs fs's.

I was given /var/log/messages from two different nodes
(saqone01, sqaone02) that showed the final two steps in
that sequence.  Oddly, only the log from sqaone01 showed
the previous steps, it appears sqaone02 was not being
used during the previous steps?  Were there other nodes
in the cluster during the previous steps when the failures
actually occured?

There are quite a lot of md error messages that I have
no experience interpretting.  That's a big area of uncertainty.
Some of the errors appear to relate to updating md's superblock;
superblock updates are an area where md is especially vulnerable
to problems in a cluster since the different nodes' updates clobber
each other.  It's unknown what the actual effects of this would be,
directly within md or worse indirectly to the data/storage.

Apart from learning what the md errors mean, my suggestion would be
to break this problem down into smaller pieces or swap in different
pieces to narrow down which subsystem is at fault.  At this point
there's no indication that GFS is at fault, but the i/o layers
below the fs look questionable.  Perhaps you could replace gfs
with ext3 or gfs+nolock, have a single machine using the fs and
try the same tests killing paths.  Or, a test where you eliminate
the fs and have one or more nodes writing/reading patterns from
the device, verifying the data, and killing paths.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 David Teigland 2006-04-11 18:46:24 UTC

These folks have switched to using dm multipathing.

Note You need to log in before you can comment on or make changes to this bug.