Bug 172435

Summary: multipath fails in case of service processor failure on emx box
Product: Red Hat Enterprise Linux 4 Reporter: Thomas Krieger <thomas.krieger>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: agk, bmarzins, christophe.varoqui, dmo, dwysocha, egoggin, lmb, mbroz, thomas.krieger, tranlan
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-07-24 10:39:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Thomas Krieger 2005-11-04 13:04:23 UTC
Description of problem:
We're using an emc CS500 and CX700 box on which we want to setup an oracle rac.
During tests we dicovered the following strange behaviour of multipath. While
copying a big file from a local disk to a volume located on the emc box we
tresspassed the lun from service processor a to service processor b. This
resultet in an I/O error. We habe to reboot the host to get access to the
filesystem on the emc box again. The volume on the emc box is mounted via
/dev/mapper. multipathd is running.
This behaviour does not happen if a cable is removed from the hba. The only way
to reprocude this is to tresspass a lun from one sp to another.

Version-Release number of selected component (if applicable):
HEL4 U2
- kernel: 2.6.9-22.0.1.ELsmp
- device-mapper-1.01.04-1.0.RHEL4
- device-mapper-multipath-0.4.5-6.0.RHEL4

How reproducible:
copy a big enough file to a volume on a emc box
tresspass lun from one sp to another during copy
you get an IO-Error and the mountpoint ism't accessabe anymore

Steps to Reproduce:
1. mount a lun via device mapper to e. g. /vol1
2. copy a big enozgh file from local disk to /vol1
3. tresspass the lun /vol1 resides on from one sp to the other on a CX500 or
CX700 emc box
4. get an IO error on /vol1
  
Actual results:


Expected results:


Additional info:

Comment 1 Alasdair Kergon 2006-03-08 20:35:41 UTC
Ed, any thoughts?

Comment 2 Ed Goggin 2006-03-08 20:58:22 UTC
This works fine for me copying entire 5GB block device with dd(1) using 
upstream code (2.6.14-rc4 & multipath-tools in git head) while re-assigning
the block device's logical unit via my own utility.  I've been testing this
use case in order to test a fix to multipathd(8) which will reduce the number 
of events which will cause it to failback to the highest priority path group.  
This is needed to keep multipathd from failing back to the default group when a 
block device is reassigned to a different path group (e.g., CLARiiON trespass) 
by software external to the current multipathing software (SAN management 
software, another cluster node, or storage services software on the CLARiiON 
itself).

Possibly they are not running with the queue_if_no_path attribute and the 
combination of trespass followed by multipathd induced failback is causing a 
small time period where all paths are down.  I'll think about it some more.

Comment 3 Ben Marzinski 2007-05-31 18:52:04 UTC
Is this still an issue?

Comment 4 Thomas Krieger 2007-06-05 06:46:41 UTC
Hi,

sorry for the delay in answering but I was on holidays for a few days.

I think that's no longer an issue. The mentioned oracle racs are in production
and we do not have this behavior again. 
In the meantime we had flar code updates on the emc boxes which result in a
trasspass of the luns to the other service processor during update. There were
no strange results.

Kind regards

Thomas