Bug 195685 - RFE: Add dm-hp-sw to kernel to allow use of active/passive sans with dm multipathing
Summary: RFE: Add dm-hp-sw to kernel to allow use of active/passive sans with dm multi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Dave Wysochanski
QA Contact:
URL:
Whiteboard:
: 175197 (view as bug list)
Depends On:
Blocks: 208261 214809 226791 246627 RHEL4u7_relnotes 438037 RHEL4u8_relnotes
TreeView+ depends on / blocked
 
Reported: 2006-06-16 16:02 UTC by Josef Bacik
Modified: 2018-10-19 22:51 UTC (History)
25 users (show)

Fixed In Version: RHSA-2008-0665
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-07-24 19:11:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
module to add. (3.69 KB, text/plain)
2006-06-16 16:02 UTC, Josef Bacik
no flags Details
dm-hp-sw patch that applies to 2.6.9-39 (4.54 KB, patch)
2006-06-16 19:05 UTC, Josef Bacik
no flags Details | Diff
patch for the config files. (5.67 KB, patch)
2006-06-16 19:07 UTC, Josef Bacik
no flags Details | Diff
dm-hp-sw patch that applies to 2.6.9-39 with appropriate kconfig changes. (5.02 KB, patch)
2006-07-13 17:21 UTC, Josef Bacik
no flags Details | Diff
patch to fix panic, error path (649 bytes, patch)
2006-11-01 23:51 UTC, Dave Wysochanski
no flags Details | Diff
dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes (8.01 KB, patch)
2006-11-08 22:01 UTC, Dave Wysochanski
no flags Details | Diff
v0.91 dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes (8.02 KB, patch)
2006-11-09 03:54 UTC, Dave Wysochanski
no flags Details | Diff
Patch to add retry flag in dm-mpath.c (1000 bytes, patch)
2006-11-27 19:11 UTC, Dave Wysochanski
no flags Details | Diff
v0.961 dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes (10.00 KB, patch)
2006-11-27 19:14 UTC, Dave Wysochanski
no flags Details | Diff
Latest upstream patch against 2.6.20 (15.85 KB, patch)
2007-02-16 00:00 UTC, Dave Wysochanski
no flags Details | Diff
Latest upstream patch against 2.6.20 (16.22 KB, text/x-patch)
2007-02-16 16:15 UTC, Dave Wysochanski
no flags Details
Simpler patch against 2.6.22-rc1 (does not have retries or anything) (7.60 KB, patch)
2007-05-22 22:45 UTC, Dave Wysochanski
no flags Details | Diff
Initial backport of upstream 3 patches (13.04 KB, patch)
2008-01-21 06:26 UTC, Dave Wysochanski
no flags Details | Diff
Updated rhel4.7 patch - currently under test and looking promising (12.43 KB, text/x-patch)
2008-01-22 19:46 UTC, Dave Wysochanski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2008:0665 0 normal SHIPPED_LIVE Moderate: Updated kernel packages for Red Hat Enterprise Linux 4.7 2008-07-24 16:41:06 UTC

Description Josef Bacik 2006-06-16 16:02:03 UTC
Description of problem:
The HP/Storageworks active/passive sans do not work with dm multipathing because
of the nature of their operation.  This module I'm attatching is a backport of
the upstream module that is used and has been tested upstream.  I have yet to
get a customer to test this, but it should work.  I will continue trying to find
a customer willing to test and confirm if this module works.

Comment 1 Josef Bacik 2006-06-16 16:02:04 UTC
Created attachment 131053 [details]
module to add.

Comment 2 Josef Bacik 2006-06-16 16:04:24 UTC
adding mchristi to the cc list as he's the original author.

Comment 3 Josef Bacik 2006-06-16 19:05:55 UTC
Created attachment 131073 [details]
dm-hp-sw patch that applies to 2.6.9-39

Comment 4 Josef Bacik 2006-06-16 19:07:02 UTC
Created attachment 131074 [details]
patch for the config files.

Comment 5 Josef Bacik 2006-06-28 15:13:18 UTC
putting on the RHEL4.5 proposed list.  A customer has confirmed that this does work.

Comment 6 Josef Bacik 2006-07-13 17:21:23 UTC
Created attachment 132388 [details]
dm-hp-sw patch that applies to 2.6.9-39 with appropriate kconfig changes.

Comment 42 Mike Christie 2006-10-25 21:20:49 UTC
(In reply to comment #40)
> (In reply to comment #35)
> > Note also that the start cmd takes ~3.5s on my setup.
> 
> Upstream, I retry the command 5 times (it is just a dumb hardcode). Does this
> work? If you send IO before the 3.5 secs to the path that is becoming active
> what is returned? Do READs/WRITEs get NOT_READY? I think one of us should look
> at the qlogic fo driver again to confirm what it did. I thought it only retried
> the START_STOP command a couple times if it got NOT_READY, but I do not remember
> the code. It may have returned succes on NOT_READY and then internally handled
> if IO got sense that indicated that the device was still becoming ready.
> 

Oh yeah we could also just ask Andrew if there was an upper bound on how long it
takes to complete a failover and add a timer :)

Comment 53 Dave Wysochanski 2006-11-01 23:51:16 UTC
Created attachment 140056 [details]
patch to fix panic, error path

Here's the one patch I'm using on top of the dm-hp-sw.patch for unit testing
rhel4 u4 code.

Comment 58 Dave Wysochanski 2006-11-08 22:01:35 UTC
Created attachment 140721 [details]
dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes

Patch which adds dm-hp-sw module - currently under unit testing.

Comment 59 Dave Wysochanski 2006-11-09 03:54:45 UTC
Created attachment 140741 [details]
v0.91 dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes

Fix dumb error with cmd_timeout units

Comment 60 Dave Wysochanski 2006-11-21 20:53:25 UTC
Just an update.  I am testing some error recovery paths with retries that I
added to the code and trying to invoke various check conditions from the A/P
MSA1000.  We're also trying to obtain documentation on check conditions and/or
getting them from existing kernel code snippits.

Also the boot is not pretty, though I'm not sure any of them are show stoppers.
 There's basically 3 boot issues I'm seeing:
1) Lots of I/O errors on standby paths b/c of LVM or something else scanning
2) Some thrashing with a lot of paths issuing start/stop (probably because the
failover is controller based and active/passive paths get seen by udev/multipath
in a non-determinate fashion).
3) Sometimes all maps don't get popluated with all paths (might be bz 205781
though) so you have to re-run multipath after boot (saw this with 14 devices -
28 paths - so it's not an unreasonable configuration).


Comment 61 Dave Wysochanski 2006-11-27 19:09:51 UTC
Will attach my latest code, which adds a retry flag to dm-mpath.c which is
passed to dm_pg_init_complete() and allows dm-mpath to retry the pg_init.  Work
still is in progress, but basic retries seem to be ok.





Comment 62 Dave Wysochanski 2006-11-27 19:11:39 UTC
Created attachment 142204 [details]
Patch to add retry flag in dm-mpath.c

Comment 63 Dave Wysochanski 2006-11-27 19:14:23 UTC
Created attachment 142207 [details]
v0.961 dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes

Latest dm-hp-sw code that uses dm-mpath.c retries via MP_RETRY_PG_INIT flag.

Comment 65 Dave Wysochanski 2007-02-05 21:07:24 UTC
*** Bug 175197 has been marked as a duplicate of this bug. ***

Comment 66 Dave Wysochanski 2007-02-16 00:00:28 UTC
Created attachment 148162 [details]
Latest upstream patch against 2.6.20

This patch is on top of the retry flag patch and applies cleanly to 2.6.20.

Fixes multiple pg_inits in progress at the same time using a simple list based
on the FC node_name (unique per MSA1000).  Gets closer to the more ideal of
controller based failover without more extensive surgery to dm-mp.

Still todo:
1) I/O errors on passive paths (would like to propose something even though it
may get rejected by maintainers)
2) retry logic and check conditions (make final call on what to do here - is it
worth it to do retries?)
3) boot issues (some paths don't get added to multipath maps on bootup - might
be a driver / hotplug / udev issue)
4) misc code cleanup (comments, debug code / printk's)

Comment 67 Dave Wysochanski 2007-02-16 16:15:37 UTC
Created attachment 148204 [details]
Latest upstream patch against 2.6.20

A few fixes w/locking, etc.

Comment 68 Issue Tracker 2007-02-23 16:43:06 UTC
Hello,

Is it possible to get the backported patch for the current RHEL4 kernel, to

create an updated test package/hotfix?



This event sent from IssueTracker by adreyer 
 issue 109951

Comment 69 Dave Wysochanski 2007-02-23 18:01:22 UTC
Fix is not quite complete and not upstream.

Do you view issue #1 as important/essential?  
1) I/O errors on passive paths

I was viewing this as an essential component until I heard otherwise (part of
equivalent functionality with existing mp solutions so would be a regression). 
If not, you will see a lot of I/O errors with various tools and in
/var/log/messages which may mask or even cause other real issues (at the very
least will cause undo alarm & look scary).

Comment 72 RHEL Program Management 2007-03-10 01:02:00 UTC
This bugzilla had previously been approved for engineering
consideration but Red Hat Product Management is currently reevaluating
this issue for inclusion in RHEL4.6.

Comment 77 Dave Wysochanski 2007-05-22 22:45:32 UTC
Created attachment 155210 [details]
Simpler patch against 2.6.22-rc1 (does not have retries or anything)

Only brief testing

Comment 79 Dave Wysochanski 2007-05-23 20:44:12 UTC
Baseline patch (no retries, check conditions, etc) against 2.6.22-rc1 submitted
to dm-devel.

Comment 80 Dave Wysochanski 2007-05-31 17:26:59 UTC
Patch set submitted to dm-devel against 2.6.22-rc1.  Mostly very basic support
with some retries and handling of check conditions.  No handling of I/O errors
(future work).
https://www.redhat.com/archives/dm-devel/2007-May/msg00105.html

Comment 81 Michael Hideo 2007-06-06 04:42:47 UTC
Adding 'cc ecs-dev-list for tracking

Comment 86 Dave Wysochanski 2007-07-26 04:53:20 UTC
Latest patches against 2.6.23-rc1 posted to dm-devel:
https://www.redhat.com/archives/dm-devel/2007-July/msg00187.html

Code has been decently tested with cable pulls during I/O runs and no major
issues seen.

Comment 87 Issue Tracker 2007-08-01 11:17:49 UTC
Any chance to patch our RHEL4 kernel?


Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by racedo 
 issue 109951

Comment 88 Dave Wysochanski 2007-08-02 13:26:10 UTC
Still waiting for upstream acceptance.

Comment 96 Michael Hideo 2007-10-23 02:44:08 UTC
Removing automation notification

Comment 97 Dave Wysochanski 2007-10-23 14:42:23 UTC
Three patches which implement hp-sw handler now in linus's kernel:
1) generic retry support: http://tinyurl.com/yw6q2e
2) basic hp-sw support: http://tinyurl.com/22tw4c
3) add retries to hp: http://tinyurl.com/yt7abn

Comment 102 Nick Strugnell 2007-12-14 11:30:48 UTC
Just come back from another client engagement with the 7.0 firmware upgrade and
can confirm that this _does_ work with group_by_prio and mpath_prio_alua. So,
dm-hp-sw will only be needed for older arrays which cannot be upgraded to this
firmware revision.



Comment 106 Dave Wysochanski 2008-01-21 06:26:23 UTC
Created attachment 292333 [details]
Initial backport of upstream 3 patches

Initial patch against 2.6.9-68.7.  Only compile tested.  I did not run this
code but looked at previous rhel4u5 patch and upstream patch and took my best
guess.	Will do some tests early this week.

Comment 107 Dave Wysochanski 2008-01-22 16:20:59 UTC
Patch in #106 has at least one critical error (reversed logic in completion
handler) that makes it non-functional.  Working on an updated patch.

Comment 108 Dave Wysochanski 2008-01-22 19:46:29 UTC
Created attachment 292553 [details]
Updated rhel4.7 patch - currently under test and looking promising

Fixes various bugs in initial backport, testing going ok so far.

Interfaces used for failover:
1) to_scsi_device: get scsi_device pointer (needed for following APIs)
2) scsi_allocate_request: allocates a request for failover (START_STOP) command

3) scsi_do_req: sends the failover command
4) scsi_release_request: release scsi request used for failover command

If you look at the history of this bug, you'll see I arrived at these
interfaces  because of the differences between the hp and emc hw handlers.  The
EMC handler is more complicated since it sends a MODE_SELECT.  It must allocate
a page, a bio, and a request.  Since the HP handler is only sending a
START_STOP command, I tried using a request directly, but then needed a bio for
the completion callback.  I then got a panic because apparently you need a page
attached to the bio.

Comment 109 Dave Wysochanski 2008-01-22 20:16:45 UTC
Note that to utilize the previous patch, something like the following should be
placed in /etc/multipath.conf:

devices
{
        device {
                vendor                  "COMPAQ  "
                product                 "MSA1000 VOLUME  "
                path_grouping_policy    failover
                hardware_handler        "1 hp-sw"
                path_selector           "round-robin 0"
                path_checker            hp_sw
                features                "2 pg_init_retries 7"
                no_path_retry           60
                failback                manual
        }
}


Comment 110 Dave Wysochanski 2008-01-23 22:28:50 UTC
Series posted to rhkernel:
http://post-office.corp.redhat.com/archives/rhkernel-list/2008-January/msg01170.html


Comment 120 Vivek Goyal 2008-03-18 16:24:28 UTC
Committed in 68.23 . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 124 Don Domingo 2008-05-21 03:57:54 UTC
this bug has been tagged for inclusion in the RHEl4.7 release notes. please post
the necessary content for it. thanks!

Comment 125 Dave Wysochanski 2008-05-30 18:43:11 UTC
The main thing that needs added is a summary of comment #29, and a note that an
updated userspace device-mapper-multipath package (included in rhel4.7) is
required to utilize the kernel module.  Here's a first attempt.

An updated device-mapper-multipath package is required for utilization of the
hp_sw kernel module.

In addition, the HP array must be configured properly for active/passive mode
and recognition of connections from a Linux machine.  The following is an
example of configuration of an HP MSA1000 array with two connections.

CLI> show version
     Firmware version:         4.48 build 342
     Hardware Revision:        7 [AutoRev: 0x010000]
     Internal EMU Rev:         1.86 (9J33JN71778P)

CLI> show connections

Connection Name: <Unknown>
   Host WWNN = 200100E0-8B3C0A65
   Host WWPN = 210100E0-8B3C0A65
   Profile Name = Default
   Unit Offset = 0
   Controller 2 Port 1 Status = Online

Connection Name: <Unknown>
   Host WWNN = 200000E0-8B1C0A65
   Host WWPN = 210000E0-8B1C0A65
   Profile Name = Default
   Unit Offset = 0
   Controller 1 Port 1 Status = Online

CLI> add connection foo-p2 WWPN=210000E0-8B1C0A65 profile=Linux OFFSET=0
Connection has been added successfully.
Profile Linux is set for the new connection.

CLI> add connection foo-p1 WWPN=210100E0-8B3C0A65 profile=Linux OFFSET=0
Connection has been added successfully.
Profile Linux is set for the new connection.

CLI> show connections

Connection Name: foo-p2
   Host WWNN = 200000E0-8B1C0A65
   Host WWPN = 210000E0-8B1C0A65
   Profile Name = Linux
   Unit Offset = 0
   Controller 1 Port 1 Status = Online

Connection Name: foo-p1
   Host WWNN = 200100E0-8B3C0A65
   Host WWPN = 210100E0-8B3C0A65
   Profile Name = Linux
   Unit Offset = 0
   Controller 2 Port 1 Status = Online


Comment 126 Don Domingo 2008-06-01 23:00:18 UTC
thanks Dave. adding to "Known Issues" of RHEL4.7 release notes:

<quote>
If you need to use the hp_sw kernel module, install the updated
device-mapper-multipath package.

You also need to properly configure the HP array to correctly use active/passive
mode and recognize connections from a Linux machine. To do this, perform the
following steps:

   1. Determine what the world wide port name (WWPN) of each connection is using
show connections. Below is a sample output of show connections on an HP MSA1000
array with two connections:

      Connection Name: <Unknown>
         Host WWNN = 200100E0-8B3C0A65
         Host WWPN = 210100E0-8B3C0A65
         Profile Name = Default
         Unit Offset = 0
         Controller 2 Port 1 Status = Online

      Connection Name: <Unknown>
         Host WWNN = 200000E0-8B1C0A65
         Host WWPN = 210000E0-8B1C0A65
         Profile Name = Default
         Unit Offset = 0
         Controller 1 Port 1 Status = Online

   2. Configure each connection properly using the following command:

      add connection [connection name] WWPN=[WWPN ID] profile=Linux OFFSET=[unit
offset]

Note that [connection name] can be set arbitrarily.

Using the given example, the proper commands should be:

      add connection foo-p2 WWPN=210000E0-8B1C0A65 profile=Linux OFFSET=0

      add connection foo-p1 WWPN=210100E0-8B3C0A65 profile=Linux OFFSET=0

   3. Run show connections again to verify that each connection is properly
configured. In our example, the correct configuration should be:

      Connection Name: foo-p2
         Host WWNN = 200000E0-8B1C0A65
         Host WWPN = 210000E0-8B1C0A65
         Profile Name = Linux
         Unit Offset = 0
         Controller 1 Port 1 Status = Online

      Connection Name: foo-p1
         Host WWNN = 200100E0-8B3C0A65
         Host WWPN = 210100E0-8B3C0A65
         Profile Name = Linux
         Unit Offset = 0
         Controller 2 Port 1 Status = Online
</quote>

please advise if any further revisions are required. also, will a kbase article
be needed for this?

thanks!

Comment 127 Don Domingo 2008-06-02 23:14:40 UTC
Hi,

the RHEL4.7 release notes deadline is on June 17, 2008 (Tuesday). they will
undergo a final proofread before being dropped to translation, at which point no
further additions or revisions will be entertained.

a mockup of the RHEL4.7 release notes can be viewed here:
http://intranet.corp.redhat.com/ic/intranet/RHEL4u7relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don

Comment 130 errata-xmlrpc 2008-07-24 19:11:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html


Note You need to log in before you can comment on or make changes to this bug.