Bug 185677 - md multipath errors and corrupt fs
md multipath errors and corrupt fs
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
GFS Bugs
Depends On:
Blocks: 180185
  Show dependency treegraph
Reported: 2006-03-16 15:32 EST by David Teigland
Modified: 2010-01-11 22:10 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2006-04-11 14:46:24 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description David Teigland 2006-03-16 15:32:30 EST
Description of problem:

Original report is bz 172944 comment 9.

GFS nodes are doing i/o over several paths.
One or more paths are disabled.
Many scsi errors (meaning/severity unknown).
Many md multipath errors (meaning/severity unknown).
Node(s) rebooted before md errors are resolved.
Cluster brought back up.
Corruption is found in one of three gfs fs's.

I was given /var/log/messages from two different nodes
(saqone01, sqaone02) that showed the final two steps in
that sequence.  Oddly, only the log from sqaone01 showed
the previous steps, it appears sqaone02 was not being
used during the previous steps?  Were there other nodes
in the cluster during the previous steps when the failures
actually occured?

There are quite a lot of md error messages that I have
no experience interpretting.  That's a big area of uncertainty.
Some of the errors appear to relate to updating md's superblock;
superblock updates are an area where md is especially vulnerable
to problems in a cluster since the different nodes' updates clobber
each other.  It's unknown what the actual effects of this would be,
directly within md or worse indirectly to the data/storage.

Apart from learning what the md errors mean, my suggestion would be
to break this problem down into smaller pieces or swap in different
pieces to narrow down which subsystem is at fault.  At this point
there's no indication that GFS is at fault, but the i/o layers
below the fs look questionable.  Perhaps you could replace gfs
with ext3 or gfs+nolock, have a single machine using the fs and
try the same tests killing paths.  Or, a test where you eliminate
the fs and have one or more nodes writing/reading patterns from
the device, verifying the data, and killing paths.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:
Comment 2 David Teigland 2006-04-11 14:46:24 EDT
These folks have switched to using dm multipathing.

Note You need to log in before you can comment on or make changes to this bug.