Red Hat Bugzilla – Bug 185677
md multipath errors and corrupt fs
Last modified: 2010-01-11 22:10:31 EST
Description of problem:
Original report is bz 172944 comment 9.
GFS nodes are doing i/o over several paths.
One or more paths are disabled.
Many scsi errors (meaning/severity unknown).
Many md multipath errors (meaning/severity unknown).
Node(s) rebooted before md errors are resolved.
Cluster brought back up.
Corruption is found in one of three gfs fs's.
I was given /var/log/messages from two different nodes
(saqone01, sqaone02) that showed the final two steps in
that sequence. Oddly, only the log from sqaone01 showed
the previous steps, it appears sqaone02 was not being
used during the previous steps? Were there other nodes
in the cluster during the previous steps when the failures
There are quite a lot of md error messages that I have
no experience interpretting. That's a big area of uncertainty.
Some of the errors appear to relate to updating md's superblock;
superblock updates are an area where md is especially vulnerable
to problems in a cluster since the different nodes' updates clobber
each other. It's unknown what the actual effects of this would be,
directly within md or worse indirectly to the data/storage.
Apart from learning what the md errors mean, my suggestion would be
to break this problem down into smaller pieces or swap in different
pieces to narrow down which subsystem is at fault. At this point
there's no indication that GFS is at fault, but the i/o layers
below the fs look questionable. Perhaps you could replace gfs
with ext3 or gfs+nolock, have a single machine using the fs and
try the same tests killing paths. Or, a test where you eliminate
the fs and have one or more nodes writing/reading patterns from
the device, verifying the data, and killing paths.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
These folks have switched to using dm multipathing.