multipath -ll output when all is OK: lun1 (360026b900034975f000037f64c870ba7) dm-0 DELL,MD32xxi [size=9.1T][features=2 pg_init_retries 1][hwhandler=1 rdac][rw] \_ round-robin 0 [prio=400][active] \_ 12:0:0:0 sdc 8:32 [active][ready] \_ 6:0:0:0 sde 8:64 [active][ready] \_ 11:0:0:0 sdg 8:96 [active][ready] \_ 9:0:0:0 sdi 8:128 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 10:0:0:0 sdb 8:16 [active][ghost] \_ 5:0:0:0 sdd 8:48 [active][ghost] \_ 7:0:0:0 sdf 8:80 [active][ghost] \_ 8:0:0:0 sdh 8:112 [active][ghost] lun0 (360026b90002fc58a00002a544c870c82) dm-1 DELL,MD32xxi [size=9.1T][features=2 pg_init_retries 1][hwhandler=1 rdac][rw] \_ round-robin 0 [prio=400][active] \_ 10:0:0:1 sdr 65:16 [active][ready] \_ 5:0:0:1 sds 65:32 [active][ready] \_ 8:0:0:1 sdt 65:48 [active][ready] \_ 7:0:0:1 sdu 65:64 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 6:0:0:1 sdj 8:144 [active][ghost] \_ 9:0:0:1 sdk 8:160 [active][ghost] \_ 11:0:0:1 sdl 8:176 [active][ghost] \_ 12:0:0:1 sdm 8:192 [active][ghost] Scenario: 1) Do I/O on the multipath device (lunN) 2) /etc/init.d/iscsi stop Expected Result: After retries the device fails and the application writing receives an error. Actual Result: The application writing to device gets stuck forever - need to reboot the machine To test this we turned off queue_if_no_path by setting no_path retry to 0. The problem still happens. This is or device configuration in multipath.conf: device { vendor "DELL" product "(MD32xx|MD32xxi|MD3000|MD3000i)" path_grouping_policy group_by_prio prio rdac prio_callout "/sbin/mpath_prio_rdac /dev/%n" polling_interval 5 path_checker rdac path_selector "round-robin 0" hardware_handler "1 rdac" failback immediate no_path_retry 0 features "2 pg_init_retries 1" rr_min_io 10 } iscsid configuration timeouts/retries: node.session.timeo.replacement_timeout = 20 node.conn[0].timeo.login_timeout = 15 node.conn[0].timeo.logout_timeout = 15 node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 5 node.session.err_timeo.abort_timeout = 15 node.session.err_timeo.lu_reset_timeout = 20 node.session.initial_login_retry_max = 8
2.6.28+ kernels have an option to abort all requests on a request queue (blk_abort_queue). This function is called in dm-mpath.c from inside deactivate_path. This, together with generic device timeout handling like what needs to be backported in order to fix this issue.
Created attachment 460359 [details] Patch to offline devices before stopping iscsi I have traced the problem to device handler detach code. When doing iscsi stop, the device is removed together with the device handler. the multipath layer knows nothing about the device handler removal, and continues to allow I/O to be stacked up on the device request queue until multipathd fails the path. The result is a I/O hang - the queued requests will never be services, and need to reboot the machine to make things work again. Although the queue abort functionality will solve this issue, it looks quite complex to apply on the current RHEL5 code base. I have attached a patch to the iscsi script, which offlines all the devices currently associated with iscsi prior to loging out and removing the devices. udevsettle, udevtrigger are necassary to ensure that the device removal is propagated to user space.
Created attachment 468193 [details] Call pg_init_done directly from mpath code when device is removed When scsi_dh_activate returns SCSI_DH_NOSYS the H/W handler callback is not called, pg_init_done is not called in the multipath layer and pending I/O is requeued forever; this situation causes all userland processes currently performing I/O on the device to I/O hang. A similar situation occurs when the device has transitioned to SDEV_CANCEL/SDEV_DEL and the device handler data had not yet been deleted. The easiest way to reproduce this is in an ISCSI environment: > dd if=/dev/dm-0 of=/dev/zero bs=8k count=10000 & > /etc/init.d/iscsi stop In this example, dd will I/O hang forever and the only way to release it will be to reboot the machine This patch calls pg_init_done directly from the mpath code when the scsi_dh_activate call returns either SCSI_DH_NOSYS or SCSI_DH_DEV_OFFLINED. Additional code was added in scsi_dh_rdac to make sure pg_init_done will not be invoked twice (by both scsi_dh_rdac and mpath), resulting in an undesirable situation where pg_init_in_progess becomes negative. Note: When running an upstream kernel, the above scenario may not occur because the request queue is aborted in dm-mpath.c:fail_path. This patch makes sure the problem does not occur at all, rather than handling it when it does. In addition, it seems too risky to apply request queue abort functionality on RHEL5 at this stage.
Created attachment 468313 [details] call pg_init_done directly from mpath when H/W handler does not
Created attachment 468570 [details] missing pg_init_done - a more general implementation This can replace 468313, however I'm leaving the old one here since this is more general and may require testing other H/W handlers and perhaps other dm devices. The test for error on activate is moved into the scsi_dh layer.
Created attachment 468793 [details] Propagate SCSI device deletion to multipath layer - SCSI H/W handler part dm-devel requested that that patch be devided into two separate fixes. This is the SCSI H/W handlerpart of the fix
Created attachment 468797 [details] Handle device deletion by multipath This is the multipath part of the fix.
(In reply to comment #1) > 2.6.28+ kernels have an option to abort all requests on a request queue > (blk_abort_queue). This function is called in dm-mpath.c from inside > deactivate_path. This, together with generic device timeout handling like what > needs to be backported in order to fix this issue. FYI, the mpath call to blk_abort_queue has proven problematic and will be reverted upstream (and in RHEL6). So it is good that you have a solution that doesn't rely on it.
(In reply to comment #7) > Created attachment 468797 [details] > Handle device deletion by multipath > > This is the multipath part of the fix. As I shared on dm-devel: "Babu already pointed out that you don't need this mpath change as the default case already performs fail_path()." Or are you saying that the block you're hooking SCSI_DH_DEV_OFFLINED off of (SCSI_DH_NOSYS) provides valuable functionality (via !m->hw_handler_name conditional and printk)?
Moving back to NEW and assigning bug to myself. "MODIFIED" is reserved for a internal Red Hat procedures. A bug flows like this: 1) change gets accepted upstream 2) patch is posted to an internal Red Hat list for inclusion in appropriate kernel(s), bug transitions to POST 3) once patch is reviewed/ack'd internally and accepted the patch with transition to MODIFIED 4) the bug then transitions to ON_QA
Thanks you for the thorough explanation. Your comment is correct - the default case in pg_init_done does call fail_path. Adding this together with SCSI_DH_NOSYS case is therefore not required. M.
Created attachment 473259 [details] Updated patch - same as the one applied over Vanilla
And here is a reference to the patch that was submitted to linux-scsi: http://marc.info/?l=dm-devel&m=129252985123755&w=2
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Patch(es) available in kernel-2.6.18-256.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Dell, We don't have hardware to test this bug. Can your guys verify it with the newly build kernel? Any test detail will be appreciated.
Confirm patch in kernel git tree
Can Dell confirm whether the kernel build fix the problem?
We are currently using a RHEL5.6 based kernel in our product and not an upstream kernel so testing this may require additional work. During my testing I found that the following testing method gives the same results: instead of /etc/init.d/iscsi stop, one can write a script that goes through the sd devices and call: echo 1 > /sys/block/sdN/device/delete. Providing you have FC RDAC based storage (which is easier to find), I think this will give the same results.
(In reply to comment #25) > > Providing you have FC RDAC based storage (which is easier to find), I think > this will give the same results. The problem is, I cannot find any RDAC based iscsi target in Red Hat.
(In reply to comment #26) > (In reply to comment #25) > > > > > Providing you have FC RDAC based storage (which is easier to find), I think > > this will give the same results. > > The problem is, I cannot find any RDAC based iscsi target in Red Hat. For FC, some team have it, I will try my best.
No hardware. Sanity Only. patch applied into kernel -269.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html
*** Bug 674932 has been marked as a duplicate of this bug. ***
*** Bug 708914 has been marked as a duplicate of this bug. ***