Bug 195685

Summary: RFE: Add dm-hp-sw to kernel to allow use of active/passive sans with dm multipathing
Product: Red Hat Enterprise Linux 4 Reporter: Josef Bacik <jbacik>
Component: kernelAssignee: Dave Wysochanski <dwysocha>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 4.0CC: agk, bmarzins, christophe.varoqui, coughlan, cww, ddomingo, dwysocha, edamato, egoggin, hgarcia, jfenal, jlayton, j-nomura, jplans, k-ueda, lmb, mbroz, mceci, mchristi, nstrug, rlerch, slevine, steve.reilly, tao, tranlan
Target Milestone: ---Keywords: Documentation, FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2008-0665 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-24 19:11:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 208261, 214809, 226791, 246627, 391231, 438037, 458752    
Attachments:
Description Flags
module to add.
none
dm-hp-sw patch that applies to 2.6.9-39
none
patch for the config files.
none
dm-hp-sw patch that applies to 2.6.9-39 with appropriate kconfig changes.
none
patch to fix panic, error path
none
dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes
none
v0.91 dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes
none
Patch to add retry flag in dm-mpath.c
none
v0.961 dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes
none
Latest upstream patch against 2.6.20
none
Latest upstream patch against 2.6.20
none
Simpler patch against 2.6.22-rc1 (does not have retries or anything)
none
Initial backport of upstream 3 patches
none
Updated rhel4.7 patch - currently under test and looking promising none

Description Josef Bacik 2006-06-16 16:02:03 UTC
Description of problem:
The HP/Storageworks active/passive sans do not work with dm multipathing because
of the nature of their operation.  This module I'm attatching is a backport of
the upstream module that is used and has been tested upstream.  I have yet to
get a customer to test this, but it should work.  I will continue trying to find
a customer willing to test and confirm if this module works.

Comment 1 Josef Bacik 2006-06-16 16:02:04 UTC
Created attachment 131053 [details]
module to add.

Comment 2 Josef Bacik 2006-06-16 16:04:24 UTC
adding mchristi to the cc list as he's the original author.

Comment 3 Josef Bacik 2006-06-16 19:05:55 UTC
Created attachment 131073 [details]
dm-hp-sw patch that applies to 2.6.9-39

Comment 4 Josef Bacik 2006-06-16 19:07:02 UTC
Created attachment 131074 [details]
patch for the config files.

Comment 5 Josef Bacik 2006-06-28 15:13:18 UTC
putting on the RHEL4.5 proposed list.  A customer has confirmed that this does work.

Comment 6 Josef Bacik 2006-07-13 17:21:23 UTC
Created attachment 132388 [details]
dm-hp-sw patch that applies to 2.6.9-39 with appropriate kconfig changes.

Comment 42 Mike Christie 2006-10-25 21:20:49 UTC
(In reply to comment #40)
> (In reply to comment #35)
> > Note also that the start cmd takes ~3.5s on my setup.
> 
> Upstream, I retry the command 5 times (it is just a dumb hardcode). Does this
> work? If you send IO before the 3.5 secs to the path that is becoming active
> what is returned? Do READs/WRITEs get NOT_READY? I think one of us should look
> at the qlogic fo driver again to confirm what it did. I thought it only retried
> the START_STOP command a couple times if it got NOT_READY, but I do not remember
> the code. It may have returned succes on NOT_READY and then internally handled
> if IO got sense that indicated that the device was still becoming ready.
> 

Oh yeah we could also just ask Andrew if there was an upper bound on how long it
takes to complete a failover and add a timer :)

Comment 53 Dave Wysochanski 2006-11-01 23:51:16 UTC
Created attachment 140056 [details]
patch to fix panic, error path

Here's the one patch I'm using on top of the dm-hp-sw.patch for unit testing
rhel4 u4 code.

Comment 58 Dave Wysochanski 2006-11-08 22:01:35 UTC
Created attachment 140721 [details]
dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes

Patch which adds dm-hp-sw module - currently under unit testing.

Comment 59 Dave Wysochanski 2006-11-09 03:54:45 UTC
Created attachment 140741 [details]
v0.91 dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes

Fix dumb error with cmd_timeout units

Comment 60 Dave Wysochanski 2006-11-21 20:53:25 UTC
Just an update.  I am testing some error recovery paths with retries that I
added to the code and trying to invoke various check conditions from the A/P
MSA1000.  We're also trying to obtain documentation on check conditions and/or
getting them from existing kernel code snippits.

Also the boot is not pretty, though I'm not sure any of them are show stoppers.
 There's basically 3 boot issues I'm seeing:
1) Lots of I/O errors on standby paths b/c of LVM or something else scanning
2) Some thrashing with a lot of paths issuing start/stop (probably because the
failover is controller based and active/passive paths get seen by udev/multipath
in a non-determinate fashion).
3) Sometimes all maps don't get popluated with all paths (might be bz 205781
though) so you have to re-run multipath after boot (saw this with 14 devices -
28 paths - so it's not an unreasonable configuration).


Comment 61 Dave Wysochanski 2006-11-27 19:09:51 UTC
Will attach my latest code, which adds a retry flag to dm-mpath.c which is
passed to dm_pg_init_complete() and allows dm-mpath to retry the pg_init.  Work
still is in progress, but basic retries seem to be ok.





Comment 62 Dave Wysochanski 2006-11-27 19:11:39 UTC
Created attachment 142204 [details]
Patch to add retry flag in dm-mpath.c

Comment 63 Dave Wysochanski 2006-11-27 19:14:23 UTC
Created attachment 142207 [details]
v0.961 dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes

Latest dm-hp-sw code that uses dm-mpath.c retries via MP_RETRY_PG_INIT flag.

Comment 65 Dave Wysochanski 2007-02-05 21:07:24 UTC
*** Bug 175197 has been marked as a duplicate of this bug. ***

Comment 66 Dave Wysochanski 2007-02-16 00:00:28 UTC
Created attachment 148162 [details]
Latest upstream patch against 2.6.20

This patch is on top of the retry flag patch and applies cleanly to 2.6.20.

Fixes multiple pg_inits in progress at the same time using a simple list based
on the FC node_name (unique per MSA1000).  Gets closer to the more ideal of
controller based failover without more extensive surgery to dm-mp.

Still todo:
1) I/O errors on passive paths (would like to propose something even though it
may get rejected by maintainers)
2) retry logic and check conditions (make final call on what to do here - is it
worth it to do retries?)
3) boot issues (some paths don't get added to multipath maps on bootup - might
be a driver / hotplug / udev issue)
4) misc code cleanup (comments, debug code / printk's)

Comment 67 Dave Wysochanski 2007-02-16 16:15:37 UTC
Created attachment 148204 [details]
Latest upstream patch against 2.6.20

A few fixes w/locking, etc.

Comment 68 Issue Tracker 2007-02-23 16:43:06 UTC
Hello,

Is it possible to get the backported patch for the current RHEL4 kernel, to

create an updated test package/hotfix?



This event sent from IssueTracker by adreyer 
 issue 109951

Comment 69 Dave Wysochanski 2007-02-23 18:01:22 UTC
Fix is not quite complete and not upstream.

Do you view issue #1 as important/essential?  
1) I/O errors on passive paths

I was viewing this as an essential component until I heard otherwise (part of
equivalent functionality with existing mp solutions so would be a regression). 
If not, you will see a lot of I/O errors with various tools and in
/var/log/messages which may mask or even cause other real issues (at the very
least will cause undo alarm & look scary).

Comment 72 RHEL Product and Program Management 2007-03-10 01:02:00 UTC
This bugzilla had previously been approved for engineering
consideration but Red Hat Product Management is currently reevaluating
this issue for inclusion in RHEL4.6.

Comment 77 Dave Wysochanski 2007-05-22 22:45:32 UTC
Created attachment 155210 [details]
Simpler patch against 2.6.22-rc1 (does not have retries or anything)

Only brief testing

Comment 79 Dave Wysochanski 2007-05-23 20:44:12 UTC
Baseline patch (no retries, check conditions, etc) against 2.6.22-rc1 submitted
to dm-devel.

Comment 80 Dave Wysochanski 2007-05-31 17:26:59 UTC
Patch set submitted to dm-devel against 2.6.22-rc1.  Mostly very basic support
with some retries and handling of check conditions.  No handling of I/O errors
(future work).
https://www.redhat.com/archives/dm-devel/2007-May/msg00105.html

Comment 81 Michael Hideo 2007-06-06 04:42:47 UTC
Adding 'cc ecs-dev-list@redhat.com for tracking

Comment 86 Dave Wysochanski 2007-07-26 04:53:20 UTC
Latest patches against 2.6.23-rc1 posted to dm-devel:
https://www.redhat.com/archives/dm-devel/2007-July/msg00187.html

Code has been decently tested with cable pulls during I/O runs and no major
issues seen.

Comment 87 Issue Tracker 2007-08-01 11:17:49 UTC
Any chance to patch our RHEL4 kernel?


Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by racedo 
 issue 109951

Comment 88 Dave Wysochanski 2007-08-02 13:26:10 UTC
Still waiting for upstream acceptance.

Comment 96 Michael Hideo 2007-10-23 02:44:08 UTC
Removing automation notification

Comment 97 Dave Wysochanski 2007-10-23 14:42:23 UTC
Three patches which implement hp-sw handler now in linus's kernel:
1) generic retry support: http://tinyurl.com/yw6q2e
2) basic hp-sw support: http://tinyurl.com/22tw4c
3) add retries to hp: http://tinyurl.com/yt7abn

Comment 102 Nick Strugnell 2007-12-14 11:30:48 UTC
Just come back from another client engagement with the 7.0 firmware upgrade and
can confirm that this _does_ work with group_by_prio and mpath_prio_alua. So,
dm-hp-sw will only be needed for older arrays which cannot be upgraded to this
firmware revision.



Comment 106 Dave Wysochanski 2008-01-21 06:26:23 UTC
Created attachment 292333 [details]
Initial backport of upstream 3 patches

Initial patch against 2.6.9-68.7.  Only compile tested.  I did not run this
code but looked at previous rhel4u5 patch and upstream patch and took my best
guess.	Will do some tests early this week.

Comment 107 Dave Wysochanski 2008-01-22 16:20:59 UTC
Patch in #106 has at least one critical error (reversed logic in completion
handler) that makes it non-functional.  Working on an updated patch.

Comment 108 Dave Wysochanski 2008-01-22 19:46:29 UTC
Created attachment 292553 [details]
Updated rhel4.7 patch - currently under test and looking promising

Fixes various bugs in initial backport, testing going ok so far.

Interfaces used for failover:
1) to_scsi_device: get scsi_device pointer (needed for following APIs)
2) scsi_allocate_request: allocates a request for failover (START_STOP) command

3) scsi_do_req: sends the failover command
4) scsi_release_request: release scsi request used for failover command

If you look at the history of this bug, you'll see I arrived at these
interfaces  because of the differences between the hp and emc hw handlers.  The
EMC handler is more complicated since it sends a MODE_SELECT.  It must allocate
a page, a bio, and a request.  Since the HP handler is only sending a
START_STOP command, I tried using a request directly, but then needed a bio for
the completion callback.  I then got a panic because apparently you need a page
attached to the bio.

Comment 109 Dave Wysochanski 2008-01-22 20:16:45 UTC
Note that to utilize the previous patch, something like the following should be
placed in /etc/multipath.conf:

devices
{
        device {
                vendor                  "COMPAQ  "
                product                 "MSA1000 VOLUME  "
                path_grouping_policy    failover
                hardware_handler        "1 hp-sw"
                path_selector           "round-robin 0"
                path_checker            hp_sw
                features                "2 pg_init_retries 7"
                no_path_retry           60
                failback                manual
        }
}


Comment 110 Dave Wysochanski 2008-01-23 22:28:50 UTC
Series posted to rhkernel:
http://post-office.corp.redhat.com/archives/rhkernel-list/2008-January/msg01170.html


Comment 120 Vivek Goyal 2008-03-18 16:24:28 UTC
Committed in 68.23 . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 124 Don Domingo 2008-05-21 03:57:54 UTC
this bug has been tagged for inclusion in the RHEl4.7 release notes. please post
the necessary content for it. thanks!

Comment 125 Dave Wysochanski 2008-05-30 18:43:11 UTC
The main thing that needs added is a summary of comment #29, and a note that an
updated userspace device-mapper-multipath package (included in rhel4.7) is
required to utilize the kernel module.  Here's a first attempt.

An updated device-mapper-multipath package is required for utilization of the
hp_sw kernel module.

In addition, the HP array must be configured properly for active/passive mode
and recognition of connections from a Linux machine.  The following is an
example of configuration of an HP MSA1000 array with two connections.

CLI> show version
     Firmware version:         4.48 build 342
     Hardware Revision:        7 [AutoRev: 0x010000]
     Internal EMU Rev:         1.86 (9J33JN71778P)

CLI> show connections

Connection Name: <Unknown>
   Host WWNN = 200100E0-8B3C0A65
   Host WWPN = 210100E0-8B3C0A65
   Profile Name = Default
   Unit Offset = 0
   Controller 2 Port 1 Status = Online

Connection Name: <Unknown>
   Host WWNN = 200000E0-8B1C0A65
   Host WWPN = 210000E0-8B1C0A65
   Profile Name = Default
   Unit Offset = 0
   Controller 1 Port 1 Status = Online

CLI> add connection foo-p2 WWPN=210000E0-8B1C0A65 profile=Linux OFFSET=0
Connection has been added successfully.
Profile Linux is set for the new connection.

CLI> add connection foo-p1 WWPN=210100E0-8B3C0A65 profile=Linux OFFSET=0
Connection has been added successfully.
Profile Linux is set for the new connection.

CLI> show connections

Connection Name: foo-p2
   Host WWNN = 200000E0-8B1C0A65
   Host WWPN = 210000E0-8B1C0A65
   Profile Name = Linux
   Unit Offset = 0
   Controller 1 Port 1 Status = Online

Connection Name: foo-p1
   Host WWNN = 200100E0-8B3C0A65
   Host WWPN = 210100E0-8B3C0A65
   Profile Name = Linux
   Unit Offset = 0
   Controller 2 Port 1 Status = Online


Comment 126 Don Domingo 2008-06-01 23:00:18 UTC
thanks Dave. adding to "Known Issues" of RHEL4.7 release notes:

<quote>
If you need to use the hp_sw kernel module, install the updated
device-mapper-multipath package.

You also need to properly configure the HP array to correctly use active/passive
mode and recognize connections from a Linux machine. To do this, perform the
following steps:

   1. Determine what the world wide port name (WWPN) of each connection is using
show connections. Below is a sample output of show connections on an HP MSA1000
array with two connections:

      Connection Name: <Unknown>
         Host WWNN = 200100E0-8B3C0A65
         Host WWPN = 210100E0-8B3C0A65
         Profile Name = Default
         Unit Offset = 0
         Controller 2 Port 1 Status = Online

      Connection Name: <Unknown>
         Host WWNN = 200000E0-8B1C0A65
         Host WWPN = 210000E0-8B1C0A65
         Profile Name = Default
         Unit Offset = 0
         Controller 1 Port 1 Status = Online

   2. Configure each connection properly using the following command:

      add connection [connection name] WWPN=[WWPN ID] profile=Linux OFFSET=[unit
offset]

Note that [connection name] can be set arbitrarily.

Using the given example, the proper commands should be:

      add connection foo-p2 WWPN=210000E0-8B1C0A65 profile=Linux OFFSET=0

      add connection foo-p1 WWPN=210100E0-8B3C0A65 profile=Linux OFFSET=0

   3. Run show connections again to verify that each connection is properly
configured. In our example, the correct configuration should be:

      Connection Name: foo-p2
         Host WWNN = 200000E0-8B1C0A65
         Host WWPN = 210000E0-8B1C0A65
         Profile Name = Linux
         Unit Offset = 0
         Controller 1 Port 1 Status = Online

      Connection Name: foo-p1
         Host WWNN = 200100E0-8B3C0A65
         Host WWPN = 210100E0-8B3C0A65
         Profile Name = Linux
         Unit Offset = 0
         Controller 2 Port 1 Status = Online
</quote>

please advise if any further revisions are required. also, will a kbase article
be needed for this?

thanks!

Comment 127 Don Domingo 2008-06-02 23:14:40 UTC
Hi,

the RHEL4.7 release notes deadline is on June 17, 2008 (Tuesday). they will
undergo a final proofread before being dropped to translation, at which point no
further additions or revisions will be entertained.

a mockup of the RHEL4.7 release notes can be viewed here:
http://intranet.corp.redhat.com/ic/intranet/RHEL4u7relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don

Comment 130 errata-xmlrpc 2008-07-24 19:11:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html