Description of problem: The HP/Storageworks active/passive sans do not work with dm multipathing because of the nature of their operation. This module I'm attatching is a backport of the upstream module that is used and has been tested upstream. I have yet to get a customer to test this, but it should work. I will continue trying to find a customer willing to test and confirm if this module works.
Created attachment 131053 [details] module to add.
adding mchristi to the cc list as he's the original author.
Created attachment 131073 [details] dm-hp-sw patch that applies to 2.6.9-39
Created attachment 131074 [details] patch for the config files.
putting on the RHEL4.5 proposed list. A customer has confirmed that this does work.
Created attachment 132388 [details] dm-hp-sw patch that applies to 2.6.9-39 with appropriate kconfig changes.
(In reply to comment #40) > (In reply to comment #35) > > Note also that the start cmd takes ~3.5s on my setup. > > Upstream, I retry the command 5 times (it is just a dumb hardcode). Does this > work? If you send IO before the 3.5 secs to the path that is becoming active > what is returned? Do READs/WRITEs get NOT_READY? I think one of us should look > at the qlogic fo driver again to confirm what it did. I thought it only retried > the START_STOP command a couple times if it got NOT_READY, but I do not remember > the code. It may have returned succes on NOT_READY and then internally handled > if IO got sense that indicated that the device was still becoming ready. > Oh yeah we could also just ask Andrew if there was an upper bound on how long it takes to complete a failover and add a timer :)
Created attachment 140056 [details] patch to fix panic, error path Here's the one patch I'm using on top of the dm-hp-sw.patch for unit testing rhel4 u4 code.
Created attachment 140721 [details] dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes Patch which adds dm-hp-sw module - currently under unit testing.
Created attachment 140741 [details] v0.91 dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes Fix dumb error with cmd_timeout units
Just an update. I am testing some error recovery paths with retries that I added to the code and trying to invoke various check conditions from the A/P MSA1000. We're also trying to obtain documentation on check conditions and/or getting them from existing kernel code snippits. Also the boot is not pretty, though I'm not sure any of them are show stoppers. There's basically 3 boot issues I'm seeing: 1) Lots of I/O errors on standby paths b/c of LVM or something else scanning 2) Some thrashing with a lot of paths issuing start/stop (probably because the failover is controller based and active/passive paths get seen by udev/multipath in a non-determinate fashion). 3) Sometimes all maps don't get popluated with all paths (might be bz 205781 though) so you have to re-run multipath after boot (saw this with 14 devices - 28 paths - so it's not an unreasonable configuration).
Will attach my latest code, which adds a retry flag to dm-mpath.c which is passed to dm_pg_init_complete() and allows dm-mpath to retry the pg_init. Work still is in progress, but basic retries seem to be ok.
Created attachment 142204 [details] Patch to add retry flag in dm-mpath.c
Created attachment 142207 [details] v0.961 dm-hp-sw patch that applies to 2.6.9-42 (rhel4 u4) with appropriate kconfig changes Latest dm-hp-sw code that uses dm-mpath.c retries via MP_RETRY_PG_INIT flag.
*** Bug 175197 has been marked as a duplicate of this bug. ***
Created attachment 148162 [details] Latest upstream patch against 2.6.20 This patch is on top of the retry flag patch and applies cleanly to 2.6.20. Fixes multiple pg_inits in progress at the same time using a simple list based on the FC node_name (unique per MSA1000). Gets closer to the more ideal of controller based failover without more extensive surgery to dm-mp. Still todo: 1) I/O errors on passive paths (would like to propose something even though it may get rejected by maintainers) 2) retry logic and check conditions (make final call on what to do here - is it worth it to do retries?) 3) boot issues (some paths don't get added to multipath maps on bootup - might be a driver / hotplug / udev issue) 4) misc code cleanup (comments, debug code / printk's)
Created attachment 148204 [details] Latest upstream patch against 2.6.20 A few fixes w/locking, etc.
Hello, Is it possible to get the backported patch for the current RHEL4 kernel, to create an updated test package/hotfix? This event sent from IssueTracker by adreyer issue 109951
Fix is not quite complete and not upstream. Do you view issue #1 as important/essential? 1) I/O errors on passive paths I was viewing this as an essential component until I heard otherwise (part of equivalent functionality with existing mp solutions so would be a regression). If not, you will see a lot of I/O errors with various tools and in /var/log/messages which may mask or even cause other real issues (at the very least will cause undo alarm & look scary).
This bugzilla had previously been approved for engineering consideration but Red Hat Product Management is currently reevaluating this issue for inclusion in RHEL4.6.
Created attachment 155210 [details] Simpler patch against 2.6.22-rc1 (does not have retries or anything) Only brief testing
Baseline patch (no retries, check conditions, etc) against 2.6.22-rc1 submitted to dm-devel.
Patch set submitted to dm-devel against 2.6.22-rc1. Mostly very basic support with some retries and handling of check conditions. No handling of I/O errors (future work). https://www.redhat.com/archives/dm-devel/2007-May/msg00105.html
Adding 'cc ecs-dev-list for tracking
Latest patches against 2.6.23-rc1 posted to dm-devel: https://www.redhat.com/archives/dm-devel/2007-July/msg00187.html Code has been decently tested with cable pulls during I/O runs and no major issues seen.
Any chance to patch our RHEL4 kernel? Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by racedo issue 109951
Still waiting for upstream acceptance.
Removing automation notification
Three patches which implement hp-sw handler now in linus's kernel: 1) generic retry support: http://tinyurl.com/yw6q2e 2) basic hp-sw support: http://tinyurl.com/22tw4c 3) add retries to hp: http://tinyurl.com/yt7abn
Just come back from another client engagement with the 7.0 firmware upgrade and can confirm that this _does_ work with group_by_prio and mpath_prio_alua. So, dm-hp-sw will only be needed for older arrays which cannot be upgraded to this firmware revision.
Created attachment 292333 [details] Initial backport of upstream 3 patches Initial patch against 2.6.9-68.7. Only compile tested. I did not run this code but looked at previous rhel4u5 patch and upstream patch and took my best guess. Will do some tests early this week.
Patch in #106 has at least one critical error (reversed logic in completion handler) that makes it non-functional. Working on an updated patch.
Created attachment 292553 [details] Updated rhel4.7 patch - currently under test and looking promising Fixes various bugs in initial backport, testing going ok so far. Interfaces used for failover: 1) to_scsi_device: get scsi_device pointer (needed for following APIs) 2) scsi_allocate_request: allocates a request for failover (START_STOP) command 3) scsi_do_req: sends the failover command 4) scsi_release_request: release scsi request used for failover command If you look at the history of this bug, you'll see I arrived at these interfaces because of the differences between the hp and emc hw handlers. The EMC handler is more complicated since it sends a MODE_SELECT. It must allocate a page, a bio, and a request. Since the HP handler is only sending a START_STOP command, I tried using a request directly, but then needed a bio for the completion callback. I then got a panic because apparently you need a page attached to the bio.
Note that to utilize the previous patch, something like the following should be placed in /etc/multipath.conf: devices { device { vendor "COMPAQ " product "MSA1000 VOLUME " path_grouping_policy failover hardware_handler "1 hp-sw" path_selector "round-robin 0" path_checker hp_sw features "2 pg_init_retries 7" no_path_retry 60 failback manual } }
Series posted to rhkernel: http://post-office.corp.redhat.com/archives/rhkernel-list/2008-January/msg01170.html
Committed in 68.23 . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
this bug has been tagged for inclusion in the RHEl4.7 release notes. please post the necessary content for it. thanks!
The main thing that needs added is a summary of comment #29, and a note that an updated userspace device-mapper-multipath package (included in rhel4.7) is required to utilize the kernel module. Here's a first attempt. An updated device-mapper-multipath package is required for utilization of the hp_sw kernel module. In addition, the HP array must be configured properly for active/passive mode and recognition of connections from a Linux machine. The following is an example of configuration of an HP MSA1000 array with two connections. CLI> show version Firmware version: 4.48 build 342 Hardware Revision: 7 [AutoRev: 0x010000] Internal EMU Rev: 1.86 (9J33JN71778P) CLI> show connections Connection Name: <Unknown> Host WWNN = 200100E0-8B3C0A65 Host WWPN = 210100E0-8B3C0A65 Profile Name = Default Unit Offset = 0 Controller 2 Port 1 Status = Online Connection Name: <Unknown> Host WWNN = 200000E0-8B1C0A65 Host WWPN = 210000E0-8B1C0A65 Profile Name = Default Unit Offset = 0 Controller 1 Port 1 Status = Online CLI> add connection foo-p2 WWPN=210000E0-8B1C0A65 profile=Linux OFFSET=0 Connection has been added successfully. Profile Linux is set for the new connection. CLI> add connection foo-p1 WWPN=210100E0-8B3C0A65 profile=Linux OFFSET=0 Connection has been added successfully. Profile Linux is set for the new connection. CLI> show connections Connection Name: foo-p2 Host WWNN = 200000E0-8B1C0A65 Host WWPN = 210000E0-8B1C0A65 Profile Name = Linux Unit Offset = 0 Controller 1 Port 1 Status = Online Connection Name: foo-p1 Host WWNN = 200100E0-8B3C0A65 Host WWPN = 210100E0-8B3C0A65 Profile Name = Linux Unit Offset = 0 Controller 2 Port 1 Status = Online
thanks Dave. adding to "Known Issues" of RHEL4.7 release notes: <quote> If you need to use the hp_sw kernel module, install the updated device-mapper-multipath package. You also need to properly configure the HP array to correctly use active/passive mode and recognize connections from a Linux machine. To do this, perform the following steps: 1. Determine what the world wide port name (WWPN) of each connection is using show connections. Below is a sample output of show connections on an HP MSA1000 array with two connections: Connection Name: <Unknown> Host WWNN = 200100E0-8B3C0A65 Host WWPN = 210100E0-8B3C0A65 Profile Name = Default Unit Offset = 0 Controller 2 Port 1 Status = Online Connection Name: <Unknown> Host WWNN = 200000E0-8B1C0A65 Host WWPN = 210000E0-8B1C0A65 Profile Name = Default Unit Offset = 0 Controller 1 Port 1 Status = Online 2. Configure each connection properly using the following command: add connection [connection name] WWPN=[WWPN ID] profile=Linux OFFSET=[unit offset] Note that [connection name] can be set arbitrarily. Using the given example, the proper commands should be: add connection foo-p2 WWPN=210000E0-8B1C0A65 profile=Linux OFFSET=0 add connection foo-p1 WWPN=210100E0-8B3C0A65 profile=Linux OFFSET=0 3. Run show connections again to verify that each connection is properly configured. In our example, the correct configuration should be: Connection Name: foo-p2 Host WWNN = 200000E0-8B1C0A65 Host WWPN = 210000E0-8B1C0A65 Profile Name = Linux Unit Offset = 0 Controller 1 Port 1 Status = Online Connection Name: foo-p1 Host WWNN = 200100E0-8B3C0A65 Host WWPN = 210100E0-8B3C0A65 Profile Name = Linux Unit Offset = 0 Controller 2 Port 1 Status = Online </quote> please advise if any further revisions are required. also, will a kbase article be needed for this? thanks!
Hi, the RHEL4.7 release notes deadline is on June 17, 2008 (Tuesday). they will undergo a final proofread before being dropped to translation, at which point no further additions or revisions will be entertained. a mockup of the RHEL4.7 release notes can be viewed here: http://intranet.corp.redhat.com/ic/intranet/RHEL4u7relnotesmockup.html please use the aforementioned link to verify if your bugzilla is already in the release notes (if it needs to be). each item in the release notes contains a link to its original bug; as such, you can search through the release notes by bug number. Cheers, Don
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html