Bug 172435 - multipath fails in case of service processor failure on emx box
Summary: multipath fails in case of service processor failure on emx box
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: device-mapper-multipath
Version: 4.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Ben Marzinski
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-11-04 13:04 UTC by Thomas Krieger
Modified: 2010-01-12 02:22 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-07-24 10:39:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Thomas Krieger 2005-11-04 13:04:23 UTC
Description of problem:
We're using an emc CS500 and CX700 box on which we want to setup an oracle rac.
During tests we dicovered the following strange behaviour of multipath. While
copying a big file from a local disk to a volume located on the emc box we
tresspassed the lun from service processor a to service processor b. This
resultet in an I/O error. We habe to reboot the host to get access to the
filesystem on the emc box again. The volume on the emc box is mounted via
/dev/mapper. multipathd is running.
This behaviour does not happen if a cable is removed from the hba. The only way
to reprocude this is to tresspass a lun from one sp to another.

Version-Release number of selected component (if applicable):
HEL4 U2
- kernel: 2.6.9-22.0.1.ELsmp
- device-mapper-1.01.04-1.0.RHEL4
- device-mapper-multipath-0.4.5-6.0.RHEL4

How reproducible:
copy a big enough file to a volume on a emc box
tresspass lun from one sp to another during copy
you get an IO-Error and the mountpoint ism't accessabe anymore

Steps to Reproduce:
1. mount a lun via device mapper to e. g. /vol1
2. copy a big enozgh file from local disk to /vol1
3. tresspass the lun /vol1 resides on from one sp to the other on a CX500 or
CX700 emc box
4. get an IO error on /vol1
  
Actual results:


Expected results:


Additional info:

Comment 1 Alasdair Kergon 2006-03-08 20:35:41 UTC
Ed, any thoughts?

Comment 2 Ed Goggin 2006-03-08 20:58:22 UTC
This works fine for me copying entire 5GB block device with dd(1) using 
upstream code (2.6.14-rc4 & multipath-tools in git head) while re-assigning
the block device's logical unit via my own utility.  I've been testing this
use case in order to test a fix to multipathd(8) which will reduce the number 
of events which will cause it to failback to the highest priority path group.  
This is needed to keep multipathd from failing back to the default group when a 
block device is reassigned to a different path group (e.g., CLARiiON trespass) 
by software external to the current multipathing software (SAN management 
software, another cluster node, or storage services software on the CLARiiON 
itself).

Possibly they are not running with the queue_if_no_path attribute and the 
combination of trespass followed by multipathd induced failback is causing a 
small time period where all paths are down.  I'll think about it some more.

Comment 3 Ben Marzinski 2007-05-31 18:52:04 UTC
Is this still an issue?

Comment 4 Thomas Krieger 2007-06-05 06:46:41 UTC
Hi,

sorry for the delay in answering but I was on holidays for a few days.

I think that's no longer an issue. The mentioned oracle racs are in production
and we do not have this behavior again. 
In the meantime we had flar code updates on the emc boxes which result in a
trasspass of the luns to the other service processor during update. There were
no strange results.

Kind regards

Thomas


Note You need to log in before you can comment on or make changes to this bug.