Description of problem: Vdsm is monitoring only the first LUN in the storage domain (the one used for the metadata lv). If a vm is using an LV on another LUN, and there the vm is paused after EIO error on that LUN, vdsm may never detect the storage issues, so it will never resume the paused vm. Version-Release number of selected component (if applicable): Any How reproducible: Unknown. Steps to Reproduce: 1. Create a storage domain with 2 LUNs 2. Create enough disks to fill the first LUN 3. Create new disk (should be create on the second LUN 4. Start a vm with the new disk 5. Make the second LUN offline (can be done using sysfs) 6. Perform some io in the vm until the vm pauses 7. Make the second LUN available again Actual results: The vm will remain paused Expected results: The vm will resume when the LUN becomes online again Additional info: Never tested, but since vdsm does not handle this case. Probably can be improved using multipath events: 1. When any multipath device belonging to a storage domain goes down (all paths failed), we can switch the domain to INVALID state. 2. When all multipath device belonging to a storage domain goes up we can switch the domain to VALID state, and resume vms using this storage domain. This may be too course grain - if one LUN is not accessible but no vm is using this LUN, we will never resume any vm on this storage domain. We can be more specific, and use disk based monitoring: 1. Keep a mapping from disk to multipath devices, based on lvm info. (lvm reports which pvs are used by each LV) 2. When a multipath device needed by a disk goes down, mark the disk as invalid 3. When all multipath devices needed by a disk goes up, mark the disk as valid and resume the vm using it if needed. Additionally, we can drop the code checking that we can read from the metadata LV, since multiapth is already doing similar check on the underlying paths of all PVs used by a storage domain.
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly
ok, closing. Please reopen if still relevant/you want to work on it.