Description of problem: With lvm configured on top of device mapper multipathed storage, if the devices shown in the configuration header on the disks (/dev/dm) do not match the UUID shown, then buffer i/o errors occur. Correcting the device to match the UUID solves the problem. Version-Release number of selected component (if applicable): How reproducible: Every time Steps to Reproduce: 1. Use lvm to create a volume group on two dm devices and a stripped logical volume across the two devices in the volume group 2. Use dd to look at the lvm header on the dm device and verify that the two devices used in the vgcreate command are shown in the configuration 3. Create GFS file system on the logical volume 4. Run traffic to the GFS file system 5. Fail an array controller to cause all paths to the logical volume to fail which causing the GFS file system to withdraw 6. Reboot the node 7. Use dd to look at the lvm headers on the dm devices and observe that the dm device shown in the configuration do not match the dm devices used to create the volume group 8. Run traffic to the GFS file system and observe buffer i/o errors Actual results: Buffer i/o errors Expected results: No i/o errors Additional info: There are several things happening with this bug. Here is the scenario that lead us to the point of recognizing that the LVM metadata inconsistancy was causing the buffer i/o errors. After creating completely new pv's, vg's, and lv's on a GFS clustered system with three nodes, we ran traffic for two days with no problems. We then failed one of the controllers on the storage array. Two of the nodes reported scsi i/o errors on the active path group only as expected, but the third node got scsi i/o errors on all 8 paths (4 paths on both active and passive path groups). When all the paths to a LUN on the third node failed, the GFS file system withdrew on that node. This was not the expected behavior since only the active controller and not the standby controller was reset. The other two nodes recovered from the controller reset and ran without error until we rebooted the third node. When the third node came up and remounted the GFS file system, then all three nodes started reporting buffer i/o errors. When we looked at the LVM metadata, we found the dm device inconsistancy. After using a disk editor to correct the inconsistency, we were able to again run for two days with no errors. I will attach the system logs from these three nodes.
Created attachment 130918 [details] Additional info This attachment shows the output of a pvs command indicating that the volume group vg_igrid_01 is on /dev/dm-2 and /dev/dm-3. It also shows the output of a dd command reading from /dev/dm-2 which has the lvm configuration data. This shows that the volume group is on /dev/dm-0 and /dev/dm-1. Note however that the UUID shown in the configuration for /dev/dm-1 is actually the same as the UUID shown at the beginning of the dd output for /dev/dm-2.
Created attachment 130919 [details] Multipath configuration
Created attachment 130920 [details] LVM configuration
Created attachment 130921 [details] System log This is one of the two nodes that recovered from the controller reset which occured about 9:00am on June 9.
Created attachment 130922 [details] System log This is the third node which had the GFS file system withdrawal at about 9:00am on June 9 and was rebooted at about 9:30am.
This sounds like it might be a multipath related problem. Does that sounds reasonable ?
Yes, it does. It looks like multipath does not always create the same dm device for a given LUN. I noticed on the multipath tools website that there is a something called devmap_name that udev can use to name the dm devices. However, this has been commented out in /etc/udev/rules.d/50-udev.rules. Here is the quote from http://christophe.varoqui.free.fr/wiki/wakka.php? wiki=ReferenceBook "The udev userspace tool is triggered upon every block sysfs entry creation and suppression, and assume the responsibility of the associated device node creation and naming. Udev default naming policies can be complemented by add- on scripts or binaries. As it does not currently have a default policy for device maps naming, we plug a little tool named devmap_name that resolve the sysfs dm-[0-9]* names in map names as set at map creation time. Provided the map naming is rightly done, this plugin provides the naming stability and meaningfulness required for a proper multipath implementation." Any idea why that was commented out in this release?
I have no idea. This needs assigning to someone who knows about multipath!
You should never use /dev/dm-* to reference the devices. Those are not permanent names. Since you have user friendly names turned on, you should be able to use the user friendly multipath names (/dev/mpath/mpath*) These are unique per machine (i.e. on nodeA, if a multipathed device is assigned mpath0, it will always be assigned mpath0). To have them be unique across the cluster, start up multipathing on one machine. Then copy the /var/lib/multipath/bindings file from that machine to all the other machines in the cluster. Then all the machines will use the same user friendly names for devices. Otherwise, you can turn the user friendly names feature off, and just refer to the devices by their multipath assigned WWID.
I have been unable to recreate this. Are you able to reproduce either the IO errors, or the the all paths failure?
We have made changes in our system to prevent the LVM metadata inconsistency from occurring. We are not seeing buffer I/O errors or path failures now. Feel free to close this bug if you like.