Red Hat Bugzilla – Bug 172435
multipath fails in case of service processor failure on emx box
Last modified: 2010-01-11 21:22:32 EST
Description of problem:
We're using an emc CS500 and CX700 box on which we want to setup an oracle rac.
During tests we dicovered the following strange behaviour of multipath. While
copying a big file from a local disk to a volume located on the emc box we
tresspassed the lun from service processor a to service processor b. This
resultet in an I/O error. We habe to reboot the host to get access to the
filesystem on the emc box again. The volume on the emc box is mounted via
/dev/mapper. multipathd is running.
This behaviour does not happen if a cable is removed from the hba. The only way
to reprocude this is to tresspass a lun from one sp to another.
Version-Release number of selected component (if applicable):
- kernel: 2.6.9-22.0.1.ELsmp
copy a big enough file to a volume on a emc box
tresspass lun from one sp to another during copy
you get an IO-Error and the mountpoint ism't accessabe anymore
Steps to Reproduce:
1. mount a lun via device mapper to e. g. /vol1
2. copy a big enozgh file from local disk to /vol1
3. tresspass the lun /vol1 resides on from one sp to the other on a CX500 or
CX700 emc box
4. get an IO error on /vol1
Ed, any thoughts?
This works fine for me copying entire 5GB block device with dd(1) using
upstream code (2.6.14-rc4 & multipath-tools in git head) while re-assigning
the block device's logical unit via my own utility. I've been testing this
use case in order to test a fix to multipathd(8) which will reduce the number
of events which will cause it to failback to the highest priority path group.
This is needed to keep multipathd from failing back to the default group when a
block device is reassigned to a different path group (e.g., CLARiiON trespass)
by software external to the current multipathing software (SAN management
software, another cluster node, or storage services software on the CLARiiON
Possibly they are not running with the queue_if_no_path attribute and the
combination of trespass followed by multipathd induced failback is causing a
small time period where all paths are down. I'll think about it some more.
Is this still an issue?
sorry for the delay in answering but I was on holidays for a few days.
I think that's no longer an issue. The mentioned oracle racs are in production
and we do not have this behavior again.
In the meantime we had flar code updates on the emc boxes which result in a
trasspass of the luns to the other service processor during update. There were
no strange results.