Description of problem: While doing a multipath -ll with RHEL 5.3 beta I received nothing back. I have had multiple engineers look at this and these are the finding. The reason you get no output back from the multipath -ll when the array is in ALUA mode AND the ALUA lines are active is probably related to the illegal CDB issue we saw yesterday (the "rtpg sense code 05/24/00"). Is it possible to get a fibre trace of that? A Finisar trace of a multipath -ll command showed An illegal report target port groups CDB. Bytes 2-5 should be reserved and set to 0s. Instead, it looks like the Inq Page 0x83 CDB is re-used without zeroing, as bytes 2 and 4 contain 83 (page code) and 3C (allocation length). The problem is in the RHEL 5.3 beta (2.6.18-120.el5) code. I pulled that code and there's no blk_rq_init() call in blk_alloc_request(), which would explain the symptoms you're seeing. James Bottomley added blk_rq_init() in 2.6.26 to fix several uninitialized buffer issues, including this one. How reproducible: multipath -ll Actual results: No output returned Expected results: View of my LUNs and there path information Additional info:
(In reply to comment #0) > The problem is in the RHEL 5.3 beta (2.6.18-120.el5) code. I pulled that code > and there's no blk_rq_init() call in blk_alloc_request(), which would explain > the symptoms you're seeing. James Bottomley added blk_rq_init() in 2.6.26 to > fix several uninitialized buffer issues, including this one. Mike, is there an upstream patch that we need to fix this? Is this a regression in 5.3, or has RHEL 5 always had this problem?
Not a regression. This is new code in 5.3 and the setup for requests in RHEL 5 was just different than it is upstream now. I do not think we want to go the upstream route for this one. I had thought for RHEL5 that the patch Jerry Levy (levy_jerome) had sent to linux-scsi was fine.
(In reply to comment #3) > Not a regression. This is new code in 5.3 and the setup for requests in RHEL 5 > was just different than it is upstream now. Well, it appears to be a regression in terms of the observed behavior... multipath -ll used to work in 5.2 and does not in 5.3. We'll need to get this approved for 5.3.
Since EMC is the only one working on this and they are shooting for RHEL 5.4, I am going to remove the module for now. When RHEL 5.4 kernel window is closer, EMC can just send exactly what they want and not worry about these updates that are not going to help them since we never added the clarrions to the scsi_dh_alua device list so the module is not even being used in our kernel (EMC is only using the module by hacking support in themselves right now). So the patch that was posted for this BZ just removed the module.
in kernel-2.6.18-126.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
I have installed the i386 header rpm, is there anything that needs to be dome besides rpm -ivh ? If not I am getting the same results as reported originally.
~~~ Attention Partners ~~~ The *last* RHEL 5.3 Snapshot 6 is now available at partners.redhat.com. A fix for this bug should be present. Please test and update this bug with test results as soon as possible. If the fix present in Snap6 meets all the expected requirements for this bug, please add the keyword PartnerVerified. If any new bugs are discovered, please CLONE this bug and describe the issues encountered there.
I have loaded S6 kernel-2.6.18-126.el5 and there is no change in the behavior of the output from multipath -ll while in alua mode, it displays nothing.
I don't know hat to tell you :) As I said in comment #7 we dropped the module so you guys could hack on it and just send the final patch. I checked 2.6.18-126.el5 kernel and that did drop the module. In 5.2 what were you guys getting? Is this with explicit or implicit ALUA support? You guys added implicit support to the userspace tools right?
I have another question. Just to double check this: you asked if there is anything you need to do besides installing the new kernel rpm. You need boot into the new kernel. You can check if you are using the latest kernel by running the command # uname -r You should get 2.6.18-126.el5 If you don't see this, you need to reboot. When the grub boot loader comes up, you need to select the 2.6.18-126.el5 kernel (if it isn't already selected). Sorry if this was obvious.
I am not using the rpm I loaded S6 from scratch, see comment 13
We are going to try to reproduce this problem in Westford. We have a cx3 with Vendor: DGC Model: RAID 5 Rev: 0326 I believe ALUA is enabled. Are you using cx3? Are there any particular settings, or fw version, we will need to reproduce this?
I am using a CX4-960. I have just set the failover mode to 4. I am useing RedHat 5.3 snapshot6 kernel 2.6.18-126.el5 and the sugested settings in my /etc/multipath.conf file. # Version : 1.0 # defaults { user_friendly_names yes } # # The blacklist is the enumeration of all devices that are to be # excluded from multipath control blacklist { devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z][[0-9]*]" devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]" } devices { # Device attributed for EMC CLARiiON device { vendor "DGC" product "*" path_grouping_policy group_by_prio getuid_callout "/sbin/scsi_id -g -u -s /block/%n" # prio_callout "/sbin/mpath_prio_emc /dev/%n" prio_callout "mpath_prio_alua /dev/%n" # path_checker emc_clariion path_checker alua features "1 queue_if_no_path" # hardware_handler "1 emc" hardware_handler "1 alua" failback immediate no_path_retry 12 } }
Created attachment 327767 [details] /var/log/messages snippets using alua and emc in multipath.conf Attachment is snippets from /var/log/messages on test machine in Westford. Problem reproduced using the snippet in the previous device entry from multipath.conf in Westford using cx3 and rhel5.3 (2.6.18-128.el5). Changing the 'alua' parameters to the commented out 'emc' parameters enabled 'multipath -ll' to function. Also, following the directions in the rhel documentation using the following link, the 'multipath -ll' command functioned correctly. http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/DM_Multipath/mpio_setup.html#setup_procedure. What are the default values for the clariion for multipathd?
A bit more digging into the multipathd code reveals what appears to be the values set for DGC below. Note that alua is not used. This is from hwtable.c. { .vendor = "DGC", .product = ".*", .bl_product = "LUNZ", .getuid = DEFAULT_GETUID, .getprio = "/sbin/mpath_prio_emc /dev/%n", .features = "1 queue_if_no_path", .hwhandler = "1 emc", .selector = DEFAULT_SELECTOR, .pgpolicy = GROUP_BY_PRIO, .pgfailback = -FAILBACK_IMMEDIATE, .rr_weight = RR_WEIGHT_NONE, .no_path_retry = (300 / DEFAULT_CHECKINT), .minio = DEFAULT_MINIO, .checker_name = EMC_CLARIION, },
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The 5.3 release notes currently say: * the SCSI device handler infrastructure (scsi_dh) has been updated, providing the following improvements: * a generic ALUA (asymmetric logical unit access) handler has been implemented. * added support for LSI RDAC SCSI based storage devices. ----- Please remove the first bullet, about ALUA. Then re-write, since we will not need a list anymore.
I commented out the references to ALUA in the multipath.conf file and I get an output from multipath -ll, but it is the same output if I have the array in failover mode 1 or 4. Is this the expected behavior? mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5 [size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw] \_ round-robin 0 [prio=2][active] \_ 2:0:0:0 sdb 8:16 [active][ready] \_ 4:0:0:0 sdj 8:144 [active][ready] \_ round-robin 0 [prio=0][enabled] \_ 2:0:1:0 sdf 8:80 [active][ready] \_ 4:0:1:0 sdn 8:208 [active][ready]
(In reply to comment #25) > I commented out the references to ALUA in the multipath.conf file and I get an > output from multipath -ll, but it is the same output if I have the array in > failover mode 1 or 4. Is this the expected behavior? > > > mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5 > [size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw] Does your box support treesspass and alua at the same time? It looks like with the above info we are creating a device that will send a tress pass for failover/failback. For RHEL 5.2, did this work? Could you have the scsi target setup to do alua, then on the initiator set it up for the emc hardware handler that did treesspass, and did everything just work?
(In reply to comment #26) > (In reply to comment #25) > > I commented out the references to ALUA in the multipath.conf file and I get an > > output from multipath -ll, but it is the same output if I have the array in > > failover mode 1 or 4. Is this the expected behavior? > > > > > > mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5 > > [size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw] > > > Does your box support treesspass and alua at the same time? It looks like with > the above info we are creating a device that will send a tress pass for > failover/failback. > > For RHEL 5.2, did this work? Or could you just tell us what worked in RHEL 5.2? What multipath config and how was the target configured?
The Clariion supports trespass and alua at the same time; as far as I know it will respond to a trespass command if sent even in alua mode. I believe using the straight ALUA parameters (with no reference to EMC / DGC at all) worked in 5.2 but we'd have to retest to verify; I don't have the setup any more.
(In reply to comment #27) > (In reply to comment #26) > > (In reply to comment #25) > > > I commented out the references to ALUA in the multipath.conf file and I get an > > > output from multipath -ll, but it is the same output if I have the array in > > > failover mode 1 or 4. Is this the expected behavior? > > > > > > > > > mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5 > > > [size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw] > > > > > > Does your box support treesspass and alua at the same time? It looks like with > > the above info we are creating a device that will send a tress pass for > > failover/failback. > > > > For RHEL 5.2, did this work? > Or could you just tell us what worked in RHEL 5.2? What multipath config and > how was the target configured? Just confirming what Jerry Levy already indicated in comment #28. A CLARiiON configured in ALUA failover mode (4) will respond to the SCSI commands used when the same CLARiiON is configure in PNR failover mode (1), that is, inquiry VPD page 0xC0 to get path/LU info and a SCSI mode select page 0x22 to failover an LU from one CLARiiON service processor to the other SP. Not only will the CLARiiON support PNR and ALUA failover commands at the same time it will allow this on a per logical unit basis. That is, the CLARiiON will allow a hybrid configuration whereby one initiator accesses a logical unit using PNR failover mode commands and a second initiator accesses the same logical unit using ALUA failover commands.
See comment #28 and #29
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Updating PM score.
EMC, Is it ok to close this bug? We went all over the place with it. The initial reason for the BZ where we thought we might need something like blk_rq_init() should not be a issue in our kernel. We do not copy over the rq->flags so it did not need to be cleared. There were bugs in scsi_dh_emc and scsi_dh_alua that you might have been hitting in some testing. We fixed the scsi_dh_alua bugs, and we ended up not shipping scsi_dh_emc (still have dm_emc). And then I think we hit the same type of multipath -ll issue as in here in https://bugzilla.redhat.com/show_bug.cgi?id=482737 In comment 9 - 19, I think we figured out the problem.
I have no problem with closing this one.
Closing per recent comments.