Description of problem: Original report is bz 172944 comment 9. GFS nodes are doing i/o over several paths. One or more paths are disabled. Many scsi errors (meaning/severity unknown). Many md multipath errors (meaning/severity unknown). Node(s) rebooted before md errors are resolved. Cluster brought back up. Corruption is found in one of three gfs fs's. I was given /var/log/messages from two different nodes (saqone01, sqaone02) that showed the final two steps in that sequence. Oddly, only the log from sqaone01 showed the previous steps, it appears sqaone02 was not being used during the previous steps? Were there other nodes in the cluster during the previous steps when the failures actually occured? There are quite a lot of md error messages that I have no experience interpretting. That's a big area of uncertainty. Some of the errors appear to relate to updating md's superblock; superblock updates are an area where md is especially vulnerable to problems in a cluster since the different nodes' updates clobber each other. It's unknown what the actual effects of this would be, directly within md or worse indirectly to the data/storage. Apart from learning what the md errors mean, my suggestion would be to break this problem down into smaller pieces or swap in different pieces to narrow down which subsystem is at fault. At this point there's no indication that GFS is at fault, but the i/o layers below the fs look questionable. Perhaps you could replace gfs with ext3 or gfs+nolock, have a single machine using the fs and try the same tests killing paths. Or, a test where you eliminate the fs and have one or more nodes writing/reading patterns from the device, verifying the data, and killing paths. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
These folks have switched to using dm multipathing.