Bug 471920 - [EMC 5.4 bug] multipath -ll issues requiring blk_rq_init() in 2.6.26
[EMC 5.4 bug] multipath -ll issues requiring blk_rq_init() in 2.6.26
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
All Linux
high Severity high
: rc
: 5.4
Assigned To: Mike Christie
Red Hat Kernel QE team
: OtherQA, Regression
Depends On:
Blocks: 459808 483701 483784 485920
  Show dependency treegraph
 
Reported: 2008-11-17 12:32 EST by Don
Modified: 2010-03-14 17:28 EDT (History)
25 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The 5.3 release notes currently say: * the SCSI device handler infrastructure (scsi_dh) has been updated, providing the following improvements: * a generic ALUA (asymmetric logical unit access) handler has been implemented. * added support for LSI RDAC SCSI based storage devices. ----- Please remove the first bullet, about ALUA. Then re-write, since we will not need a list anymore.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-07-14 09:17:54 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log/messages snippets using alua and emc in multipath.conf (6.22 KB, text/plain)
2008-12-23 14:33 EST, Rob Evers
no flags Details

  None (edit)
Description Don 2008-11-17 12:32:27 EST
Description of problem:
While doing a multipath -ll with RHEL 5.3 beta I received nothing back.


I have had multiple engineers look at this and these are the finding.


The reason you get no output back from the multipath -ll when 
the array is in ALUA mode AND the ALUA lines are active is 
probably related to the illegal CDB issue we saw yesterday 
(the "rtpg sense code 05/24/00"). Is it possible to get a 
fibre trace of that?

A Finisar trace of a multipath -ll command showed

An illegal report target port groups CDB.
Bytes 2-5 should be reserved and set to 0s.  Instead, it looks
like the Inq Page 0x83 CDB is re-used without zeroing, as bytes
2 and 4 contain 83 (page code) and 3C (allocation length).

The problem is in the RHEL 5.3 beta (2.6.18-120.el5) code. I pulled that code and there's no blk_rq_init() call in blk_alloc_request(), which would explain the symptoms you're seeing. James Bottomley added blk_rq_init() in 2.6.26 to fix several uninitialized buffer issues, including this one.


How reproducible:
multipath -ll

Actual results:
No output returned

Expected results:
View of my LUNs and there path information

Additional info:
Comment 2 Tom Coughlan 2008-11-17 14:40:47 EST
(In reply to comment #0)

> The problem is in the RHEL 5.3 beta (2.6.18-120.el5) code. I pulled that code
> and there's no blk_rq_init() call in blk_alloc_request(), which would explain
> the symptoms you're seeing. James Bottomley added blk_rq_init() in 2.6.26 to
> fix several uninitialized buffer issues, including this one.

Mike, is there an upstream patch that we need to fix this? 

Is this a regression in 5.3, or has RHEL 5 always had this problem?
Comment 3 Mike Christie 2008-11-18 14:44:11 EST
Not a regression. This is new code in 5.3 and the setup for requests in RHEL 5 was just different than it is upstream now. I do not think we want to go the upstream route for this one. I had thought for RHEL5 that the patch Jerry Levy (levy_jerome@emc.com) had sent to linux-scsi was fine.
Comment 5 Tom Coughlan 2008-11-25 13:49:16 EST
(In reply to comment #3)
> Not a regression. This is new code in 5.3 and the setup for requests in RHEL 5
> was just different than it is upstream now.

Well, it appears to be a regression in terms of the observed behavior... 
multipath -ll used to work in 5.2 and does not in 5.3. We'll need to get this approved for 5.3.
Comment 7 Mike Christie 2008-12-05 00:38:21 EST
Since EMC is the only one working on this and they are shooting for RHEL 5.4, I am going to remove the module for now. When RHEL 5.4 kernel window is closer, EMC can just send exactly what they want and not worry about these updates that are not going to help them since we never added the clarrions to the scsi_dh_alua device list so the module is not even being used in our kernel (EMC is only using the module by hacking support in themselves right now).

So the patch that was posted for this BZ just removed the module.
Comment 9 Don Zickus 2008-12-09 16:05:12 EST
in kernel-2.6.18-126.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 11 Don 2008-12-16 07:50:59 EST
I have installed the i386 header rpm, is there anything that needs to be dome besides rpm -ivh ? If not I am getting the same results as reported originally.
Comment 12 Chris Ward 2008-12-16 11:29:27 EST
~~~ Attention Partners ~~~ The *last* RHEL 5.3 Snapshot 6 is now available at partners.redhat.com. A fix for this bug should be present. Please test and update this bug with test results as soon as possible.  If the fix present in Snap6 meets all the expected requirements for this bug, please add the keyword PartnerVerified. If any new bugs are discovered, please CLONE this bug and describe the issues encountered there.
Comment 13 Don 2008-12-17 12:47:15 EST
I have loaded S6 kernel-2.6.18-126.el5 and there is no change in the behavior of the output from multipath -ll while in alua mode, it displays nothing.
Comment 14 Mike Christie 2008-12-17 13:02:27 EST
I don't know hat to tell you :) As I said in comment #7 we dropped the module so you guys could hack on it and just send the final patch. I checked 2.6.18-126.el5 kernel and that did drop the module.

In 5.2 what were you guys getting?

Is this with explicit or implicit ALUA support? You guys added implicit support to the userspace tools right?
Comment 15 Ben Marzinski 2008-12-18 13:11:11 EST
I have another question.  Just to double check this: you asked if there is anything you need to do besides installing the new kernel rpm.  You need boot into the new kernel.  You can check if you are using the latest kernel by running the command

# uname -r

You should get

2.6.18-126.el5

If you don't see this, you need to reboot.  When the grub boot loader comes up, you need to select the 2.6.18-126.el5 kernel (if it isn't already selected).

Sorry if this was obvious.
Comment 17 Don 2008-12-22 07:59:30 EST
I am not using the rpm I loaded S6 from scratch, see comment 13
Comment 20 Tom Coughlan 2008-12-22 15:02:51 EST
We are going to try to reproduce this problem in Westford. We have a cx3 with 

  Vendor: DGC      Model: RAID 5           Rev: 0326

I believe ALUA is enabled. Are you using cx3? Are there any particular settings, or fw version, we will need to reproduce this?
Comment 21 Don 2008-12-23 08:02:57 EST
I am using a CX4-960. I have just set the failover mode to 4. I am useing RedHat 5.3 snapshot6 kernel 2.6.18-126.el5 and the sugested settings in my /etc/multipath.conf file.

# Version  : 1.0
# 
defaults {
user_friendly_names  yes
}
#
# The blacklist is the enumeration of all devices that are to be
# excluded from multipath control
blacklist {
       devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
       devnode "^hd[a-z][[0-9]*]"
       devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}

devices {
#      Device attributed for EMC CLARiiON
        device {
                vendor                  "DGC"
                product                 "*"
                path_grouping_policy    group_by_prio
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
#               prio_callout            "/sbin/mpath_prio_emc /dev/%n"
                prio_callout            "mpath_prio_alua /dev/%n"
#               path_checker            emc_clariion 
                path_checker            alua  
                features                "1 queue_if_no_path"                   
#               hardware_handler        "1 emc"
                hardware_handler        "1 alua"
                failback                immediate
                no_path_retry           12
        }
}
Comment 22 Rob Evers 2008-12-23 14:33:02 EST
Created attachment 327767 [details]
/var/log/messages snippets using alua and emc in multipath.conf

Attachment is snippets from /var/log/messages on test machine in Westford.

Problem reproduced using the snippet in the previous device entry from multipath.conf in Westford using cx3 and rhel5.3 (2.6.18-128.el5).

Changing the 'alua' parameters to the commented out 'emc' parameters enabled 'multipath -ll' to function.

Also, following the directions in the rhel documentation using the following link, the 'multipath -ll' command functioned correctly.

http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/DM_Multipath/mpio_setup.html#setup_procedure.

What are the default values for the clariion for multipathd?
Comment 23 Rob Evers 2008-12-23 14:48:45 EST
A bit more digging into the multipathd code reveals what appears to be the values set for DGC below.  Note that alua is not used.  This is from hwtable.c.

        {
                .vendor        = "DGC",
                .product       = ".*",
                .bl_product    = "LUNZ",
                .getuid        = DEFAULT_GETUID,
                .getprio       = "/sbin/mpath_prio_emc /dev/%n",
                .features      = "1 queue_if_no_path",
                .hwhandler     = "1 emc",
                .selector      = DEFAULT_SELECTOR,
                .pgpolicy      = GROUP_BY_PRIO,
                .pgfailback    = -FAILBACK_IMMEDIATE,
                .rr_weight     = RR_WEIGHT_NONE,
                .no_path_retry = (300 / DEFAULT_CHECKINT),
                .minio         = DEFAULT_MINIO,
                .checker_name  = EMC_CLARIION,
        },
Comment 24 Tom Coughlan 2008-12-23 14:51:38 EST
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
The 5.3 release notes currently say:

* the SCSI device handler infrastructure (scsi_dh) has been updated, providing the following improvements:

    * a generic ALUA (asymmetric logical unit 
      access) handler has been implemented.
    * added support for LSI RDAC SCSI based
      storage devices. 

-----

Please remove the first bullet, about ALUA. Then re-write, since we will not need a list anymore.
Comment 25 Don 2008-12-24 09:44:35 EST
I commented out the references to ALUA in the multipath.conf file and I get an output from multipath -ll, but it is the same output if I have the array in failover mode 1 or 4. Is this the expected behavior?


mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5
[size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=2][active]
 \_ 2:0:0:0 sdb 8:16  [active][ready]
 \_ 4:0:0:0 sdj 8:144 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 2:0:1:0 sdf 8:80  [active][ready]
 \_ 4:0:1:0 sdn 8:208 [active][ready]
Comment 26 Mike Christie 2009-01-05 00:29:30 EST
(In reply to comment #25)
> I commented out the references to ALUA in the multipath.conf file and I get an
> output from multipath -ll, but it is the same output if I have the array in
> failover mode 1 or 4. Is this the expected behavior?
> 
> 
> mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5
> [size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw]


Does your box support treesspass and alua at the same time? It looks like with the above info we are creating a device that will send a tress pass for failover/failback.

For RHEL 5.2, did this work? Could you have the scsi target setup to do alua, then on the initiator set it up for the emc hardware handler that did treesspass, and did everything just work?
Comment 27 Mike Christie 2009-01-05 00:34:38 EST
(In reply to comment #26)
> (In reply to comment #25)
> > I commented out the references to ALUA in the multipath.conf file and I get an
> > output from multipath -ll, but it is the same output if I have the array in
> > failover mode 1 or 4. Is this the expected behavior?
> > 
> > 
> > mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5
> > [size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
> 
> 
> Does your box support treesspass and alua at the same time? It looks like with
> the above info we are creating a device that will send a tress pass for
> failover/failback.
> 
> For RHEL 5.2, did this work?


Or could you just tell us what worked in RHEL 5.2? What multipath config and how was the target configured?
Comment 28 Jerry Levy 2009-01-05 07:12:15 EST
The Clariion supports trespass and alua at the same time; as far as I know it will respond to a trespass command if sent even in alua mode. I believe using the straight ALUA parameters (with no reference to EMC / DGC at all) worked in 5.2 but we'd have to retest to verify; I don't have the setup any more.
Comment 29 Ed Goggin 2009-01-06 12:34:49 EST
(In reply to comment #27)
> (In reply to comment #26)
> > (In reply to comment #25)
> > > I commented out the references to ALUA in the multipath.conf file and I get an
> > > output from multipath -ll, but it is the same output if I have the array in
> > > failover mode 1 or 4. Is this the expected behavior?
> > > 
> > > 
> > > mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5
> > > [size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
> > 
> > 
> > Does your box support treesspass and alua at the same time? It looks like with
> > the above info we are creating a device that will send a tress pass for
> > failover/failback.
> > 
> > For RHEL 5.2, did this work?
> Or could you just tell us what worked in RHEL 5.2? What multipath config and
> how was the target configured?

Just confirming what Jerry Levy already indicated in comment #28.  A CLARiiON configured in ALUA failover mode (4) will respond to the SCSI commands used when the same CLARiiON is configure in PNR failover mode (1), that is, inquiry VPD page 0xC0 to get path/LU info and a SCSI mode select page 0x22 to failover an LU from one CLARiiON service processor to the other SP.

Not only will the CLARiiON support PNR and ALUA failover commands at the same time it will allow this on a per logical unit basis.  That is, the CLARiiON will allow a hybrid configuration whereby one initiator accesses a logical unit using PNR failover mode commands and a second initiator accesses the same logical unit using ALUA failover commands.
Comment 30 Wayne Berthiaume 2009-01-27 10:51:08 EST
See comment #28 and #29
Comment 31 RHEL Product and Program Management 2009-01-27 15:43:53 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 32 RHEL Product and Program Management 2009-02-16 10:04:09 EST
Updating PM score.
Comment 33 Mike Christie 2009-07-09 12:01:50 EDT
EMC,

Is it ok to close this bug? We went all over the place with it.

The initial reason for the BZ where we thought we might need something like blk_rq_init() should not be a issue in our kernel. We do not copy over the rq->flags so it did not need to be cleared.

There were bugs in scsi_dh_emc and scsi_dh_alua that you might have been hitting in some testing. We fixed the scsi_dh_alua bugs, and we ended up not shipping scsi_dh_emc (still have dm_emc).

And then I think we hit the same type of multipath -ll issue as in here in https://bugzilla.redhat.com/show_bug.cgi?id=482737 In comment 9 - 19, I think we figured out the problem.
Comment 34 Don 2009-07-09 14:27:51 EDT
I have no problem with closing this one.
Comment 35 Andrius Benokraitis 2009-07-14 09:17:54 EDT
Closing per recent comments.

Note You need to log in before you can comment on or make changes to this bug.