Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 510665

Summary:

megaraid sas driver in rhel5.4-beta fails to scan for SAS tape drive (HP Ultrium 4-SCSI)

Product:

Red Hat Enterprise Linux 5

Reporter:

Mark Goodwin <mgoodwin>

Component:

kernel

Assignee:

Tomas Henzl <thenzl>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.4

CC:

andriusb, bo.yang, coughlan, cward, dzickus, hjia, ishida-sxc, jtluka, ltroan, mgoodwin, revers, tao, thenzl

Target Milestone:

Keywords:

Regression

Target Release:

5.4

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-09-02 08:37:30 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

475518

Attachments:

Description	Flags
syslog snippets from RHEL5.3, where tape device is discovered correctly	none
syslog snippets from RHEL5.4-beta, where tape device is NOT discovered	none
This patch suppose to fix the tape drive issue which reported by NEC.	none

Description Mark Goodwin 2009-07-10 06:06:27 UTC

The megaraid SAS driver in the RHEL5.4-beta kernel (2.6.18-155.el5
thru to 2.6.18-157.el5) fails to identify an HP Ultrium-4 SAS tape
drive.

Customer says this worked for RHEL5.3 / 2.6.16-128 series kernels,
and also works with the -155 kernel if they build their own driver
from the LSI website (I'm getting the src tarball they used for this
from them to compare).

So this may be a regression with the LSI driver update feature
for RHEL5.4 in BZ 475574.

Version-Release number of selected component (if applicable):
2.6.18-155.el5 (which has megaraid sas driver "00.00.04.08-RH1")

How reproducible:
always

Steps to Reproduce:
1.install 5.4beta and cat /proc/scsi/scsi
  
Actual results:
sequential access device does not show up in /proc/scsi/scsi
and the st0 device is not registered.

More info on the src diff when I get it.

Comment 1 Mark Goodwin 2009-07-10 06:12:40 UTC

The HBA is :

Host: scsi0 Channel: 02 Id: 00 Lun: 00
  Vendor: LSI      Model: MegaRAID 8708EM2 Rev: 1.40
  Type:   Direct-Access                    ANSI SCSI revision: 05


With the RHEL5.3 -128 kernel, the HBA and tape device show up as:

Host: scsi0 Channel: 00 Id: 07 Lun: 00
  Vendor: HP       Model: Ultrium 4-SCSI   Rev: U2AN
  Type:   Sequential-Access                ANSI SCSI revision: 05
Host: scsi0 Channel: 02 Id: 00 Lun: 00
  Vendor: LSI      Model: MegaRAID 8708EM2 Rev: 1.40
  Type:   Direct-Access                    ANSI SCSI revision: 05

Comment 2 Mark Goodwin 2009-07-17 01:55:05 UTC

Apparently LSI driver versions 00.00.04.07 and 00.00.04.09 from the
LSI website do not have the issue. Unclear at present what the difference
is between the LSI drivers what what's in the 5.4-beta driver.

Comment 5 Tom Coughlan 2009-07-22 19:30:06 UTC

Mark,

Is there anything in /var/log/messages indicating a failure when the driver loads, and scans the tape drive?  Please post the boot log. 

Bo,

Do you have any reports of trouble with this driver version configuring tapes? Have you tested this configuration? 

Tom

Comment 6 bo yang 2009-07-22 19:44:56 UTC

Tom/Mark,

I am download the src and testing.  Should have the result by tomorrow.

Thanks,

Bo Yang

Comment 7 Mark Goodwin 2009-07-23 00:46:35 UTC

(In reply to comment #5)
> Mark,
> 
> Is there anything in /var/log/messages indicating a failure when the driver
> loads, and scans the tape drive?  Please post the boot log. 
> 
> Tom 

Tom & Bo, I will attach messages snippets from RHEL5.3 (where the tape was
correctly discovered and initialized), and from RHEL5.4-beta (where the
tape was not found). A few things stand-out: For RHEL5.4, the only
reference to the tape device is :

Jul  7 20:03:58 localhost raidsrv[4068]: [PD:3(ID=7 SLT=8)] HP      Ultrium 4-SCSI  U2AN

which appears in the log *before* the SCSI subsystem has been initialized.
In the RHEL5.4-beta messages, there are no scsi tape (st driver) messages
at all.

So, perhaps raidsrv in RHEL5.4-beta has "claimed" the device somehow, and
the scsi_tape driver doesn't even get to look for it?

It would probably be worth enabling scsi_scan debugging on rhel5.4-beta.
That can be done by :

# echo 448 > /sys/module/scsi_mod/parameters/scsi_logging_level
or perhaps better to set this in /etc/modprobe.conf and then do
a full reboot.

Cheers
-- Mark Goodwin

Comment 8 Mark Goodwin 2009-07-23 00:49:24 UTC

Created attachment 354798 [details]
syslog snippets from RHEL5.3, where tape device is discovered correctly

Comment 9 Mark Goodwin 2009-07-23 00:50:16 UTC

Created attachment 354799 [details]
syslog snippets from RHEL5.4-beta, where tape device is NOT discovered

Comment 10 Mark Goodwin 2009-07-24 00:57:12 UTC

Please note we are waiting for a response from NEC engineering - the issue
in this BZ *might* be due to the NEC raidsrv daemon, where we see :

Jul  7 20:03:58 localhost raidsrv[4068]: [PD:3(ID=7 SLT=8)] HP      Ultrium
4-SCSI  U2AN

in syslog on the RHEL5.4-beta system, *before* scsi_mod and the megaraid
driver have even loaded. We need NEC to disable raidsrv and then reboot
RHEL5.4-beta to see if this resolves the issue. Note that raidsrv is not
part of RHEL5.4.

Given the above, it's plausible that the LSI megaraid driver in RHEL5.4-beta
has not actually regressed. We need to confirm this fairly urgently.

Thanks
-- Mark Goodwin

Comment 11 Mark Goodwin 2009-07-24 06:48:23 UTC

NEC disabled the raidsrv daemon on their RHEL5.4-beta test box
and this made no difference. So much for that theory ... I have
asked for scsi_scan debugging to be enabled and to send the logs.

-- Mark

Comment 12 bo yang 2009-07-24 13:57:26 UTC

I am in testing the tape drive.  I will give update soon.

Comment 13 Tom Coughlan 2009-07-27 22:19:25 UTC

Created attachment 355325 [details]
This patch suppose to fix the tape drive issue which reported by NEC.  

This is a copy of comment: 
 
https://bugzilla.redhat.com/show_bug.cgi?id=475574#c52

That BZ is for the general driver update. That code is checked in. The BZ should stay in its current state. This BZ is for the problem scanning tape drives seen with the driver update.

Comment 14 Mark Goodwin 2009-07-27 23:48:14 UTC

A test kernel with Bo's patch was brewed yesterday and made available to
the customer: http://people.redhat.com/~mgoodwin/BZ510665

This is going on over in the IT ticket. I'll update here as soon as we
have test results, hopefully today sometime.

Thanks
-- Mark Goodwin

Comment 15 Mark Goodwin 2009-07-28 06:45:30 UTC

Tom, the customer has verified the test kernel with Bo Yang's patch fixes
the issue reported in this BZ, i.e. the tape is now correctly discovered.
They are now running additional tests to verify nothing else has regressed.

So, as far as managing this BZ goes, it's over to engineering, PM and QA
to decide the fate of the fix for RHEL5.4. I have set CustomerVerified.

I'm going to set the IT to WoENG now.

Thanks
-- Mark Goodwin

Comment 16 RHEL Program Management 2009-07-28 13:55:39 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Tomas Henzl 2009-07-28 16:07:20 UTC

Bo,
thanks for the patch. We are now very late in the 5.4 process and this patch is
rather big, and with no explanation. It will be good to understand why is each part needed and where was the problem. Could you provide some comments for us ?
I know it is a painful work for you, but we have not the time to test this patch if there is not another regression, so at least good understanding is necessary.

Thanks,
Tomas

Comment 18 bo yang 2009-07-28 16:35:23 UTC

Tomas,

When I go through the src, I saw some of the src which we fixed the issues, but didn't included to the src.  It is important to add them in:  Here are the details:  Please let me know, or if you need more info?

1. Need add the spin lock when fire the cmd to FW to fix the potenial system hang issue.

-	writel((frame_phys_addr | (frame_count<<1))|1,
+	unsigned long flags;
+	spin_lock_irqsave(&instance->fire_lock, flags);
+	writel((frame_phys_addr | (frame_count<<1))|1, 
 			&(regs)->inbound_queue_port);
+	spin_unlock_irqrestore(&instance->fire_lock, flags);


2. System PDs is the DISK TYPE.

-	if (sdev->channel < MEGASAS_MAX_PD_CHANNELS) {
+	if (sdev->channel < MEGASAS_MAX_PD_CHANNELS && sdev->type == TYPE_DISK) {

3. If it is Abort EVENT type, Driver can't register EVENT again.
-	
-	if (instance->unload == 0) {
+	if ((!cmd->abort_aen) && (instance->unload == 0 )) {


4. For SAS2 controller, unmap_sgbuf is different from other controller.

+		if ((instance->pdev->device == PCI_DEVICE_ID_LSI_SAS0073SKINNY) ||
+		(instance->pdev->device == PCI_DEVICE_ID_LSI_SAS0071SKINNY)) {
+			buf_h = cmd->frame->io.sgl.sge_skinny[0].phys_addr;
+
+		} else if (IS_DMA64)
 

5. takeoff the extra printout.

-		printk(KERN_INFO "%s[%d]: event code 0x%04x\n", __FUNCTION__,
-			instance->host->host_no, instance->evt_detail->code);

Comment 19 Tom Coughlan 2009-07-28 19:47:32 UTC

(In reply to comment #18)
> Tomas,
> 
> When I go through the src, I saw some of the src which we fixed the issues, but
> didn't included to the src.  It is important to add them in:

Bo, 

Please be aware that, if we take a patch for megaraid sas, it will be going in the 5.4 release candidate. This is the last build. There will be no chance to fix any mistakes. No field test. 

Normally, we would only take patches for critical show stoppers (like the regression reported here, where tapes are not configured). The other bugs addressed by the patch have not been reported to us during beta test. I expect that you have a better view of the risk/benefit for these, but we have not seen them. We just need to be sure you understand the situation we are in with the 5.4 end-game. Are the other changes included in the patch really worth the risk at this point? 

Tom

Comment 20 Tomas Henzl 2009-07-29 12:36:47 UTC

(In reply to comment #15)
> They are now running additional tests to verify nothing else has regressed.

Mark, are these tests finished now  without problems ?

Thanks, Tomas

Comment 22 Tomas Henzl 2009-07-29 14:10:22 UTC

(In reply to comment #18)
> Tomas,
> 
> When I go through the src, I saw some of the src which we fixed the issues, but
> didn't included to the src.  It is important to add them in:  Here are the
> details:  Please let me know, or if you need more info?
> 
Thanks Bo, 
does your comment mean that the hunks 1-5 are unrelated to the tape driver issue ? 
We have to take the whole patch or nothing because of the testing done by the customer. I don't see problems in the patch itself it is only we can't test it enough - Comment #19.

> 1. Need add the spin lock when fire the cmd to FW to fix the potenial system
> hang issue.
Interesting that this hasn't caused problems without the lock.

Comment 23 Tomas Henzl 2009-07-29 14:11:12 UTC

Posted - http://post-office.corp.redhat.com/archives/rhkernel-list/2009-July/msg00774.html

Comment 24 Tomas Henzl 2009-07-30 08:55:30 UTC

Bo,
I forgot to ask you before - have you already posted this to upstream ?
If not please do this soon. This and some signs of upstream acceptance are also of interest.
Thanks, Tomas

Comment 26 bo yang 2009-07-30 16:18:20 UTC

Tomas,

I will post them possible early next week after I clean all my current work.  You should be copied if I post them.

Thanks,

Bo Yang

Comment 27 Tomas Henzl 2009-08-02 21:32:24 UTC

Bo,
please explain also this part of your patch.
I was questioned about this on our internal list.
You missed to explain this in Comment#18.
Thanks.
@@ -3604,6 +3614,7 @@ megasas_mgmt_fw_ioctl(struct megasas_ins
 	 */
 	memcpy(cmd->frame, ioc->frame.raw, 2 * MEGAMFI_FRAME_SIZE);
 	cmd->frame->hdr.context = cmd->index;
+	cmd->frame->hdr.pad_0 = 0;
 
 	/*
 	 * The management interface between applications and the fw uses
@@ -4034,19 +4045,11 @@ megasas_aen_polling(void *arg)
 	}
 
 	if (instance->evt_detail) {
-		printk(KERN_INFO "%s[%d]: event code 0x%04x\n", __FUNCTION__,
-			instance->host->host_no, instance->evt_detail->code);
 
 		switch (instance->evt_detail->code) {
 
-		case MR_EVT_LD_CREATED:
 		case MR_EVT_PD_INSERTED:
-		case MR_EVT_LD_DELETED:
-		case MR_EVT_LD_OFFLINE:

Comment 28 bo yang 2009-08-03 18:52:52 UTC

Tomas,

1. Driver need to clean the pad_0 field, because some of the application cmds don't clean it (FW will take long time to process if it not set to 0).
  
+ cmd->frame->hdr.pad_0 = 0;

2. For LDs case, driver don't need to scan the devices because our megaraid sas application already do.

-  case MR_EVT_LD_CREATED:
   case MR_EVT_PD_INSERTED:
-  case MR_EVT_LD_DELETED:
-  case MR_EVT_LD_OFFLINE:

3. Take off the printout, application will have all of those information.

-  printk(KERN_INFO "%s[%d]: event code 0x%04x\n", __FUNCTION__,
-   instance->host->host_no, instance->evt_detail->code);

Comment 29 Don Zickus 2009-08-05 14:08:57 UTC

in kernel-2.6.18-162.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 35 Tomas Henzl 2009-08-19 10:37:11 UTC

*** Bug 506510 has been marked as a duplicate of this bug. ***

Comment 36 errata-xmlrpc 2009-09-02 08:37:30 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html