Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 471920

Summary:

[EMC 5.4 bug] multipath -ll issues requiring blk_rq_init() in 2.6.26

Product:

Red Hat Enterprise Linux 5

Reporter:

Don <blood_donald>

Component:

kernel

Assignee:

Mike Christie <mchristi>

Status:

CLOSED NOTABUG

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.3

CC:

agk, andriusb, berthiaume_wayne, bmarzins, bmr, bubrown, christophe.varoqui, coughlan, dwysocha, edamato, egoggin, heinzm, junichi.nomura, kueda, levy_jerome, lmb, mbroz, mchristi, mgahagan, peterm, prockai, revers, syeghiay, torel, tranlan

Target Milestone:

Keywords:

OtherQA, Regression

Target Release:

5.4

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

The 5.3 release notes currently say: * the SCSI device handler infrastructure (scsi_dh) has been updated, providing the following improvements: * a generic ALUA (asymmetric logical unit access) handler has been implemented. * added support for LSI RDAC SCSI based storage devices. ----- Please remove the first bullet, about ALUA. Then re-write, since we will not need a list anymore.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-07-14 13:17:54 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

459808, 483701, 483784, 485920

Attachments:

Description	Flags
/var/log/messages snippets using alua and emc in multipath.conf	none

Description Don 2008-11-17 17:32:27 UTC

Description of problem:
While doing a multipath -ll with RHEL 5.3 beta I received nothing back.


I have had multiple engineers look at this and these are the finding.


The reason you get no output back from the multipath -ll when 
the array is in ALUA mode AND the ALUA lines are active is 
probably related to the illegal CDB issue we saw yesterday 
(the "rtpg sense code 05/24/00"). Is it possible to get a 
fibre trace of that?

A Finisar trace of a multipath -ll command showed

An illegal report target port groups CDB.
Bytes 2-5 should be reserved and set to 0s.  Instead, it looks
like the Inq Page 0x83 CDB is re-used without zeroing, as bytes
2 and 4 contain 83 (page code) and 3C (allocation length).

The problem is in the RHEL 5.3 beta (2.6.18-120.el5) code. I pulled that code and there's no blk_rq_init() call in blk_alloc_request(), which would explain the symptoms you're seeing. James Bottomley added blk_rq_init() in 2.6.26 to fix several uninitialized buffer issues, including this one.


How reproducible:
multipath -ll

Actual results:
No output returned

Expected results:
View of my LUNs and there path information

Additional info:

Comment 2 Tom Coughlan 2008-11-17 19:40:47 UTC

(In reply to comment #0)

> The problem is in the RHEL 5.3 beta (2.6.18-120.el5) code. I pulled that code
> and there's no blk_rq_init() call in blk_alloc_request(), which would explain
> the symptoms you're seeing. James Bottomley added blk_rq_init() in 2.6.26 to
> fix several uninitialized buffer issues, including this one.

Mike, is there an upstream patch that we need to fix this? 

Is this a regression in 5.3, or has RHEL 5 always had this problem?

Comment 3 Mike Christie 2008-11-18 19:44:11 UTC

Not a regression. This is new code in 5.3 and the setup for requests in RHEL 5 was just different than it is upstream now. I do not think we want to go the upstream route for this one. I had thought for RHEL5 that the patch Jerry Levy (levy_jerome) had sent to linux-scsi was fine.

Comment 5 Tom Coughlan 2008-11-25 18:49:16 UTC

(In reply to comment #3)
> Not a regression. This is new code in 5.3 and the setup for requests in RHEL 5
> was just different than it is upstream now.

Well, it appears to be a regression in terms of the observed behavior... 
multipath -ll used to work in 5.2 and does not in 5.3. We'll need to get this approved for 5.3.

Comment 7 Mike Christie 2008-12-05 05:38:21 UTC

Since EMC is the only one working on this and they are shooting for RHEL 5.4, I am going to remove the module for now. When RHEL 5.4 kernel window is closer, EMC can just send exactly what they want and not worry about these updates that are not going to help them since we never added the clarrions to the scsi_dh_alua device list so the module is not even being used in our kernel (EMC is only using the module by hacking support in themselves right now).

So the patch that was posted for this BZ just removed the module.

Comment 9 Don Zickus 2008-12-09 21:05:12 UTC

in kernel-2.6.18-126.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 11 Don 2008-12-16 12:50:59 UTC

I have installed the i386 header rpm, is there anything that needs to be dome besides rpm -ivh ? If not I am getting the same results as reported originally.

Comment 12 Chris Ward 2008-12-16 16:29:27 UTC

~~~ Attention Partners ~~~ The *last* RHEL 5.3 Snapshot 6 is now available at partners.redhat.com. A fix for this bug should be present. Please test and update this bug with test results as soon as possible.  If the fix present in Snap6 meets all the expected requirements for this bug, please add the keyword PartnerVerified. If any new bugs are discovered, please CLONE this bug and describe the issues encountered there.

Comment 13 Don 2008-12-17 17:47:15 UTC

I have loaded S6 kernel-2.6.18-126.el5 and there is no change in the behavior of the output from multipath -ll while in alua mode, it displays nothing.

Comment 14 Mike Christie 2008-12-17 18:02:27 UTC

I don't know hat to tell you :) As I said in comment #7 we dropped the module so you guys could hack on it and just send the final patch. I checked 2.6.18-126.el5 kernel and that did drop the module.

In 5.2 what were you guys getting?

Is this with explicit or implicit ALUA support? You guys added implicit support to the userspace tools right?

Comment 15 Ben Marzinski 2008-12-18 18:11:11 UTC

I have another question.  Just to double check this: you asked if there is anything you need to do besides installing the new kernel rpm.  You need boot into the new kernel.  You can check if you are using the latest kernel by running the command

# uname -r

You should get

2.6.18-126.el5

If you don't see this, you need to reboot.  When the grub boot loader comes up, you need to select the 2.6.18-126.el5 kernel (if it isn't already selected).

Sorry if this was obvious.

Comment 17 Don 2008-12-22 12:59:30 UTC

I am not using the rpm I loaded S6 from scratch, see comment 13

Comment 20 Tom Coughlan 2008-12-22 20:02:51 UTC

We are going to try to reproduce this problem in Westford. We have a cx3 with 

  Vendor: DGC      Model: RAID 5           Rev: 0326

I believe ALUA is enabled. Are you using cx3? Are there any particular settings, or fw version, we will need to reproduce this?

Comment 21 Don 2008-12-23 13:02:57 UTC

I am using a CX4-960. I have just set the failover mode to 4. I am useing RedHat 5.3 snapshot6 kernel 2.6.18-126.el5 and the sugested settings in my /etc/multipath.conf file.

# Version  : 1.0
# 
defaults {
user_friendly_names  yes
}
#
# The blacklist is the enumeration of all devices that are to be
# excluded from multipath control
blacklist {
       devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
       devnode "^hd[a-z][[0-9]*]"
       devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}

devices {
#      Device attributed for EMC CLARiiON
        device {
                vendor                  "DGC"
                product                 "*"
                path_grouping_policy    group_by_prio
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
#               prio_callout            "/sbin/mpath_prio_emc /dev/%n"
                prio_callout            "mpath_prio_alua /dev/%n"
#               path_checker            emc_clariion 
                path_checker            alua  
                features                "1 queue_if_no_path"                   
#               hardware_handler        "1 emc"
                hardware_handler        "1 alua"
                failback                immediate
                no_path_retry           12
        }
}

Comment 22 Rob Evers 2008-12-23 19:33:02 UTC

Created attachment 327767 [details]
/var/log/messages snippets using alua and emc in multipath.conf

Attachment is snippets from /var/log/messages on test machine in Westford.

Problem reproduced using the snippet in the previous device entry from multipath.conf in Westford using cx3 and rhel5.3 (2.6.18-128.el5).

Changing the 'alua' parameters to the commented out 'emc' parameters enabled 'multipath -ll' to function.

Also, following the directions in the rhel documentation using the following link, the 'multipath -ll' command functioned correctly.

http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/DM_Multipath/mpio_setup.html#setup_procedure.

What are the default values for the clariion for multipathd?

Comment 23 Rob Evers 2008-12-23 19:48:45 UTC

A bit more digging into the multipathd code reveals what appears to be the values set for DGC below.  Note that alua is not used.  This is from hwtable.c.

        {
                .vendor        = "DGC",
                .product       = ".*",
                .bl_product    = "LUNZ",
                .getuid        = DEFAULT_GETUID,
                .getprio       = "/sbin/mpath_prio_emc /dev/%n",
                .features      = "1 queue_if_no_path",
                .hwhandler     = "1 emc",
                .selector      = DEFAULT_SELECTOR,
                .pgpolicy      = GROUP_BY_PRIO,
                .pgfailback    = -FAILBACK_IMMEDIATE,
                .rr_weight     = RR_WEIGHT_NONE,
                .no_path_retry = (300 / DEFAULT_CHECKINT),
                .minio         = DEFAULT_MINIO,
                .checker_name  = EMC_CLARIION,
        },

Comment 24 Tom Coughlan 2008-12-23 19:51:38 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
The 5.3 release notes currently say:

* the SCSI device handler infrastructure (scsi_dh) has been updated, providing the following improvements:

    * a generic ALUA (asymmetric logical unit 
      access) handler has been implemented.
    * added support for LSI RDAC SCSI based
      storage devices. 

-----

Please remove the first bullet, about ALUA. Then re-write, since we will not need a list anymore.

Comment 25 Don 2008-12-24 14:44:35 UTC

I commented out the references to ALUA in the multipath.conf file and I get an output from multipath -ll, but it is the same output if I have the array in failover mode 1 or 4. Is this the expected behavior?


mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5
[size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=2][active]
 \_ 2:0:0:0 sdb 8:16  [active][ready]
 \_ 4:0:0:0 sdj 8:144 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 2:0:1:0 sdf 8:80  [active][ready]
 \_ 4:0:1:0 sdn 8:208 [active][ready]

Comment 26 Mike Christie 2009-01-05 05:29:30 UTC

(In reply to comment #25)
> I commented out the references to ALUA in the multipath.conf file and I get an
> output from multipath -ll, but it is the same output if I have the array in
> failover mode 1 or 4. Is this the expected behavior?
> 
> 
> mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5
> [size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw]


Does your box support treesspass and alua at the same time? It looks like with the above info we are creating a device that will send a tress pass for failover/failback.

For RHEL 5.2, did this work? Could you have the scsi target setup to do alua, then on the initiator set it up for the emc hardware handler that did treesspass, and did everything just work?

Comment 27 Mike Christie 2009-01-05 05:34:38 UTC

(In reply to comment #26)
> (In reply to comment #25)
> > I commented out the references to ALUA in the multipath.conf file and I get an
> > output from multipath -ll, but it is the same output if I have the array in
> > failover mode 1 or 4. Is this the expected behavior?
> > 
> > 
> > mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5
> > [size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
> 
> 
> Does your box support treesspass and alua at the same time? It looks like with
> the above info we are creating a device that will send a tress pass for
> failover/failback.
> 
> For RHEL 5.2, did this work?


Or could you just tell us what worked in RHEL 5.2? What multipath config and how was the target configured?

Comment 28 Jerry Levy 2009-01-05 12:12:15 UTC

The Clariion supports trespass and alua at the same time; as far as I know it will respond to a trespass command if sent even in alua mode. I believe using the straight ALUA parameters (with no reference to EMC / DGC at all) worked in 5.2 but we'd have to retest to verify; I don't have the setup any more.

Comment 29 Ed Goggin 2009-01-06 17:34:49 UTC

(In reply to comment #27)
> (In reply to comment #26)
> > (In reply to comment #25)
> > > I commented out the references to ALUA in the multipath.conf file and I get an
> > > output from multipath -ll, but it is the same output if I have the array in
> > > failover mode 1 or 4. Is this the expected behavior?
> > > 
> > > 
> > > mpath1 (3600601607b9e1e00fac24f3410bbdd11) dm-2 DGC,RAID 5
> > > [size=1.0G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
> > 
> > 
> > Does your box support treesspass and alua at the same time? It looks like with
> > the above info we are creating a device that will send a tress pass for
> > failover/failback.
> > 
> > For RHEL 5.2, did this work?
> Or could you just tell us what worked in RHEL 5.2? What multipath config and
> how was the target configured?

Just confirming what Jerry Levy already indicated in comment #28.  A CLARiiON configured in ALUA failover mode (4) will respond to the SCSI commands used when the same CLARiiON is configure in PNR failover mode (1), that is, inquiry VPD page 0xC0 to get path/LU info and a SCSI mode select page 0x22 to failover an LU from one CLARiiON service processor to the other SP.

Not only will the CLARiiON support PNR and ALUA failover commands at the same time it will allow this on a per logical unit basis.  That is, the CLARiiON will allow a hybrid configuration whereby one initiator accesses a logical unit using PNR failover mode commands and a second initiator accesses the same logical unit using ALUA failover commands.

Comment 30 Wayne Berthiaume 2009-01-27 15:51:08 UTC

See comment #28 and #29

Comment 31 RHEL Program Management 2009-01-27 20:43:53 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 32 RHEL Program Management 2009-02-16 15:04:09 UTC

Updating PM score.

Comment 33 Mike Christie 2009-07-09 16:01:50 UTC

EMC,

Is it ok to close this bug? We went all over the place with it.

The initial reason for the BZ where we thought we might need something like blk_rq_init() should not be a issue in our kernel. We do not copy over the rq->flags so it did not need to be cleared.

There were bugs in scsi_dh_emc and scsi_dh_alua that you might have been hitting in some testing. We fixed the scsi_dh_alua bugs, and we ended up not shipping scsi_dh_emc (still have dm_emc).

And then I think we hit the same type of multipath -ll issue as in here in https://bugzilla.redhat.com/show_bug.cgi?id=482737 In comment 9 - 19, I think we figured out the problem.

Comment 34 Don 2009-07-09 18:27:51 UTC

I have no problem with closing this one.

Comment 35 Andrius Benokraitis 2009-07-14 13:17:54 UTC

Closing per recent comments.