The iscsi timeout on the node images in /etc/iscsi/iscsid.conf is too short. Generally it's acceptable to have a much higher iscsi timeout when dealing with root filesystems (as we are with the ovirt guests). I'd suggest setting: node.session.timeo.replacement_timeout = 86400 This would allow the node to recover gracefully even after the iscsi share died or rebooted. This doesn't protect the integrity of the guests but would allow them to be paused during this issue.
Isn't 86400 secs (24hrs) gonna be way too long? 600 secs should be plenty for a reboot, but 20-30 mins would give enough time for an alert/reboot on a failed iscsi target. # To specify the length of time to wait for session re-establishment # before failing SCSI commands back to the application when running # the Linux SCSI Layer error handler, edit the line. # The value is in seconds and the default is 120 seconds. node.session.timeo.replacement_timeout = 120