Bug 510665 - megaraid sas driver in rhel5.4-beta fails to scan for SAS tape drive (HP Ultrium 4-SCSI)
megaraid sas driver in rhel5.4-beta fails to scan for SAS tape drive (HP Ultr...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.4
All Linux
high Severity high
: rc
: 5.4
Assigned To: Tomas Henzl
Red Hat Kernel QE team
: Regression
: 506510 (view as bug list)
Depends On:
Blocks: 475518
  Show dependency treegraph
 
Reported: 2009-07-10 02:06 EDT by Mark Goodwin
Modified: 2010-10-23 06:40 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:37:30 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
syslog snippets from RHEL5.3, where tape device is discovered correctly (3.32 KB, application/octet-stream)
2009-07-22 20:49 EDT, Mark Goodwin
no flags Details
syslog snippets from RHEL5.4-beta, where tape device is NOT discovered (10.38 KB, application/octet-stream)
2009-07-22 20:50 EDT, Mark Goodwin
no flags Details
This patch suppose to fix the tape drive issue which reported by NEC. (5.15 KB, patch)
2009-07-27 18:19 EDT, Tom Coughlan
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 04:53:34 EDT

  None (edit)
Description Mark Goodwin 2009-07-10 02:06:27 EDT
The megaraid SAS driver in the RHEL5.4-beta kernel (2.6.18-155.el5
thru to 2.6.18-157.el5) fails to identify an HP Ultrium-4 SAS tape
drive.

Customer says this worked for RHEL5.3 / 2.6.16-128 series kernels,
and also works with the -155 kernel if they build their own driver
from the LSI website (I'm getting the src tarball they used for this
from them to compare).

So this may be a regression with the LSI driver update feature
for RHEL5.4 in BZ 475574.

Version-Release number of selected component (if applicable):
2.6.18-155.el5 (which has megaraid sas driver "00.00.04.08-RH1")

How reproducible:
always

Steps to Reproduce:
1.install 5.4beta and cat /proc/scsi/scsi
  
Actual results:
sequential access device does not show up in /proc/scsi/scsi
and the st0 device is not registered.

More info on the src diff when I get it.
Comment 1 Mark Goodwin 2009-07-10 02:12:40 EDT
The HBA is :

Host: scsi0 Channel: 02 Id: 00 Lun: 00
  Vendor: LSI      Model: MegaRAID 8708EM2 Rev: 1.40
  Type:   Direct-Access                    ANSI SCSI revision: 05


With the RHEL5.3 -128 kernel, the HBA and tape device show up as:

Host: scsi0 Channel: 00 Id: 07 Lun: 00
  Vendor: HP       Model: Ultrium 4-SCSI   Rev: U2AN
  Type:   Sequential-Access                ANSI SCSI revision: 05
Host: scsi0 Channel: 02 Id: 00 Lun: 00
  Vendor: LSI      Model: MegaRAID 8708EM2 Rev: 1.40
  Type:   Direct-Access                    ANSI SCSI revision: 05
Comment 2 Mark Goodwin 2009-07-16 21:55:05 EDT
Apparently LSI driver versions 00.00.04.07 and 00.00.04.09 from the
LSI website do not have the issue. Unclear at present what the difference
is between the LSI drivers what what's in the 5.4-beta driver.
Comment 5 Tom Coughlan 2009-07-22 15:30:06 EDT
Mark,

Is there anything in /var/log/messages indicating a failure when the driver loads, and scans the tape drive?  Please post the boot log. 

Bo,

Do you have any reports of trouble with this driver version configuring tapes? Have you tested this configuration? 

Tom
Comment 6 bo yang 2009-07-22 15:44:56 EDT
Tom/Mark,

I am download the src and testing.  Should have the result by tomorrow.

Thanks,

Bo Yang
Comment 7 Mark Goodwin 2009-07-22 20:46:35 EDT
(In reply to comment #5)
> Mark,
> 
> Is there anything in /var/log/messages indicating a failure when the driver
> loads, and scans the tape drive?  Please post the boot log. 
> 
> Tom 

Tom & Bo, I will attach messages snippets from RHEL5.3 (where the tape was
correctly discovered and initialized), and from RHEL5.4-beta (where the
tape was not found). A few things stand-out: For RHEL5.4, the only
reference to the tape device is :

Jul  7 20:03:58 localhost raidsrv[4068]: [PD:3(ID=7 SLT=8)] HP      Ultrium 4-SCSI  U2AN

which appears in the log *before* the SCSI subsystem has been initialized.
In the RHEL5.4-beta messages, there are no scsi tape (st driver) messages
at all.

So, perhaps raidsrv in RHEL5.4-beta has "claimed" the device somehow, and
the scsi_tape driver doesn't even get to look for it?

It would probably be worth enabling scsi_scan debugging on rhel5.4-beta.
That can be done by :

# echo 448 > /sys/module/scsi_mod/parameters/scsi_logging_level
or perhaps better to set this in /etc/modprobe.conf and then do
a full reboot.

Cheers
-- Mark Goodwin
Comment 8 Mark Goodwin 2009-07-22 20:49:24 EDT
Created attachment 354798 [details]
syslog snippets from RHEL5.3, where tape device is discovered correctly
Comment 9 Mark Goodwin 2009-07-22 20:50:16 EDT
Created attachment 354799 [details]
syslog snippets from RHEL5.4-beta, where tape device is NOT discovered
Comment 10 Mark Goodwin 2009-07-23 20:57:12 EDT
Please note we are waiting for a response from NEC engineering - the issue
in this BZ *might* be due to the NEC raidsrv daemon, where we see :

Jul  7 20:03:58 localhost raidsrv[4068]: [PD:3(ID=7 SLT=8)] HP      Ultrium
4-SCSI  U2AN

in syslog on the RHEL5.4-beta system, *before* scsi_mod and the megaraid
driver have even loaded. We need NEC to disable raidsrv and then reboot
RHEL5.4-beta to see if this resolves the issue. Note that raidsrv is not
part of RHEL5.4.

Given the above, it's plausible that the LSI megaraid driver in RHEL5.4-beta
has not actually regressed. We need to confirm this fairly urgently.

Thanks
-- Mark Goodwin
Comment 11 Mark Goodwin 2009-07-24 02:48:23 EDT
NEC disabled the raidsrv daemon on their RHEL5.4-beta test box
and this made no difference. So much for that theory ... I have
asked for scsi_scan debugging to be enabled and to send the logs.

-- Mark
Comment 12 bo yang 2009-07-24 09:57:26 EDT
I am in testing the tape drive.  I will give update soon.
Comment 13 Tom Coughlan 2009-07-27 18:19:25 EDT
Created attachment 355325 [details]
This patch suppose to fix the tape drive issue which reported by NEC.  

This is a copy of comment: 
 
https://bugzilla.redhat.com/show_bug.cgi?id=475574#c52

That BZ is for the general driver update. That code is checked in. The BZ should stay in its current state. This BZ is for the problem scanning tape drives seen with the driver update.
Comment 14 Mark Goodwin 2009-07-27 19:48:14 EDT
A test kernel with Bo's patch was brewed yesterday and made available to
the customer: http://people.redhat.com/~mgoodwin/BZ510665

This is going on over in the IT ticket. I'll update here as soon as we
have test results, hopefully today sometime.

Thanks
-- Mark Goodwin
Comment 15 Mark Goodwin 2009-07-28 02:45:30 EDT
Tom, the customer has verified the test kernel with Bo Yang's patch fixes
the issue reported in this BZ, i.e. the tape is now correctly discovered.
They are now running additional tests to verify nothing else has regressed.

So, as far as managing this BZ goes, it's over to engineering, PM and QA
to decide the fate of the fix for RHEL5.4. I have set CustomerVerified.

I'm going to set the IT to WoENG now.

Thanks
-- Mark Goodwin
Comment 16 RHEL Product and Program Management 2009-07-28 09:55:39 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 17 Tomas Henzl 2009-07-28 12:07:20 EDT
Bo,
thanks for the patch. We are now very late in the 5.4 process and this patch is
rather big, and with no explanation. It will be good to understand why is each part needed and where was the problem. Could you provide some comments for us ?
I know it is a painful work for you, but we have not the time to test this patch if there is not another regression, so at least good understanding is necessary.

Thanks,
Tomas
Comment 18 bo yang 2009-07-28 12:35:23 EDT
Tomas,

When I go through the src, I saw some of the src which we fixed the issues, but didn't included to the src.  It is important to add them in:  Here are the details:  Please let me know, or if you need more info?

1. Need add the spin lock when fire the cmd to FW to fix the potenial system hang issue.

-	writel((frame_phys_addr | (frame_count<<1))|1,
+	unsigned long flags;
+	spin_lock_irqsave(&instance->fire_lock, flags);
+	writel((frame_phys_addr | (frame_count<<1))|1, 
 			&(regs)->inbound_queue_port);
+	spin_unlock_irqrestore(&instance->fire_lock, flags);


2. System PDs is the DISK TYPE.

-	if (sdev->channel < MEGASAS_MAX_PD_CHANNELS) {
+	if (sdev->channel < MEGASAS_MAX_PD_CHANNELS && sdev->type == TYPE_DISK) {

3. If it is Abort EVENT type, Driver can't register EVENT again.
-	
-	if (instance->unload == 0) {
+	if ((!cmd->abort_aen) && (instance->unload == 0 )) {


4. For SAS2 controller, unmap_sgbuf is different from other controller.

+		if ((instance->pdev->device == PCI_DEVICE_ID_LSI_SAS0073SKINNY) ||
+		(instance->pdev->device == PCI_DEVICE_ID_LSI_SAS0071SKINNY)) {
+			buf_h = cmd->frame->io.sgl.sge_skinny[0].phys_addr;
+
+		} else if (IS_DMA64)
 

5. takeoff the extra printout.

-		printk(KERN_INFO "%s[%d]: event code 0x%04x\n", __FUNCTION__,
-			instance->host->host_no, instance->evt_detail->code);
Comment 19 Tom Coughlan 2009-07-28 15:47:32 EDT
(In reply to comment #18)
> Tomas,
> 
> When I go through the src, I saw some of the src which we fixed the issues, but
> didn't included to the src.  It is important to add them in:

Bo, 

Please be aware that, if we take a patch for megaraid sas, it will be going in the 5.4 release candidate. This is the last build. There will be no chance to fix any mistakes. No field test. 

Normally, we would only take patches for critical show stoppers (like the regression reported here, where tapes are not configured). The other bugs addressed by the patch have not been reported to us during beta test. I expect that you have a better view of the risk/benefit for these, but we have not seen them. We just need to be sure you understand the situation we are in with the 5.4 end-game. Are the other changes included in the patch really worth the risk at this point? 

Tom
Comment 20 Tomas Henzl 2009-07-29 08:36:47 EDT
(In reply to comment #15)
> They are now running additional tests to verify nothing else has regressed.

Mark, are these tests finished now  without problems ?

Thanks, Tomas
Comment 22 Tomas Henzl 2009-07-29 10:10:22 EDT
(In reply to comment #18)
> Tomas,
> 
> When I go through the src, I saw some of the src which we fixed the issues, but
> didn't included to the src.  It is important to add them in:  Here are the
> details:  Please let me know, or if you need more info?
> 
Thanks Bo, 
does your comment mean that the hunks 1-5 are unrelated to the tape driver issue ? 
We have to take the whole patch or nothing because of the testing done by the customer. I don't see problems in the patch itself it is only we can't test it enough - Comment #19.

> 1. Need add the spin lock when fire the cmd to FW to fix the potenial system
> hang issue.
Interesting that this hasn't caused problems without the lock.
Comment 24 Tomas Henzl 2009-07-30 04:55:30 EDT
Bo,
I forgot to ask you before - have you already posted this to upstream ?
If not please do this soon. This and some signs of upstream acceptance are also of interest.
Thanks, Tomas
Comment 26 bo yang 2009-07-30 12:18:20 EDT
Tomas,

I will post them possible early next week after I clean all my current work.  You should be copied if I post them.

Thanks,

Bo Yang
Comment 27 Tomas Henzl 2009-08-02 17:32:24 EDT
Bo,
please explain also this part of your patch.
I was questioned about this on our internal list.
You missed to explain this in Comment#18.
Thanks.
@@ -3604,6 +3614,7 @@ megasas_mgmt_fw_ioctl(struct megasas_ins
 	 */
 	memcpy(cmd->frame, ioc->frame.raw, 2 * MEGAMFI_FRAME_SIZE);
 	cmd->frame->hdr.context = cmd->index;
+	cmd->frame->hdr.pad_0 = 0;
 
 	/*
 	 * The management interface between applications and the fw uses
@@ -4034,19 +4045,11 @@ megasas_aen_polling(void *arg)
 	}
 
 	if (instance->evt_detail) {
-		printk(KERN_INFO "%s[%d]: event code 0x%04x\n", __FUNCTION__,
-			instance->host->host_no, instance->evt_detail->code);
 
 		switch (instance->evt_detail->code) {
 
-		case MR_EVT_LD_CREATED:
 		case MR_EVT_PD_INSERTED:
-		case MR_EVT_LD_DELETED:
-		case MR_EVT_LD_OFFLINE:
Comment 28 bo yang 2009-08-03 14:52:52 EDT
Tomas,

1. Driver need to clean the pad_0 field, because some of the application cmds don't clean it (FW will take long time to process if it not set to 0).
  
+ cmd->frame->hdr.pad_0 = 0;

2. For LDs case, driver don't need to scan the devices because our megaraid sas application already do.

-  case MR_EVT_LD_CREATED:
   case MR_EVT_PD_INSERTED:
-  case MR_EVT_LD_DELETED:
-  case MR_EVT_LD_OFFLINE:

3. Take off the printout, application will have all of those information.

-  printk(KERN_INFO "%s[%d]: event code 0x%04x\n", __FUNCTION__,
-   instance->host->host_no, instance->evt_detail->code);
Comment 29 Don Zickus 2009-08-05 10:08:57 EDT
in kernel-2.6.18-162.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 35 Tomas Henzl 2009-08-19 06:37:11 EDT
*** Bug 506510 has been marked as a duplicate of this bug. ***
Comment 36 errata-xmlrpc 2009-09-02 04:37:30 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.