Description of problem: We're using an emc CS500 and CX700 box on which we want to setup an oracle rac. During tests we dicovered the following strange behaviour of multipath. While copying a big file from a local disk to a volume located on the emc box we tresspassed the lun from service processor a to service processor b. This resultet in an I/O error. We habe to reboot the host to get access to the filesystem on the emc box again. The volume on the emc box is mounted via /dev/mapper. multipathd is running. This behaviour does not happen if a cable is removed from the hba. The only way to reprocude this is to tresspass a lun from one sp to another. Version-Release number of selected component (if applicable): HEL4 U2 - kernel: 2.6.9-22.0.1.ELsmp - device-mapper-1.01.04-1.0.RHEL4 - device-mapper-multipath-0.4.5-6.0.RHEL4 How reproducible: copy a big enough file to a volume on a emc box tresspass lun from one sp to another during copy you get an IO-Error and the mountpoint ism't accessabe anymore Steps to Reproduce: 1. mount a lun via device mapper to e. g. /vol1 2. copy a big enozgh file from local disk to /vol1 3. tresspass the lun /vol1 resides on from one sp to the other on a CX500 or CX700 emc box 4. get an IO error on /vol1 Actual results: Expected results: Additional info:
Ed, any thoughts?
This works fine for me copying entire 5GB block device with dd(1) using upstream code (2.6.14-rc4 & multipath-tools in git head) while re-assigning the block device's logical unit via my own utility. I've been testing this use case in order to test a fix to multipathd(8) which will reduce the number of events which will cause it to failback to the highest priority path group. This is needed to keep multipathd from failing back to the default group when a block device is reassigned to a different path group (e.g., CLARiiON trespass) by software external to the current multipathing software (SAN management software, another cluster node, or storage services software on the CLARiiON itself). Possibly they are not running with the queue_if_no_path attribute and the combination of trespass followed by multipathd induced failback is causing a small time period where all paths are down. I'll think about it some more.
Is this still an issue?
Hi, sorry for the delay in answering but I was on holidays for a few days. I think that's no longer an issue. The mentioned oracle racs are in production and we do not have this behavior again. In the meantime we had flar code updates on the emc boxes which result in a trasspass of the luns to the other service processor during update. There were no strange results. Kind regards Thomas