Bug 494658 - With Red Hat errata 128.1.6 installed system hangs with SATA drives installed.
With Red Hat errata 128.1.6 installed system hangs with SATA drives installed.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
x86_64 Linux
urgent Severity urgent
: rc
: 5.4
Assigned To: David Milburn
Red Hat Kernel QE team
: ZStream
Depends On:
Blocks: 459515 506029
  Show dependency treegraph
 
Reported: 2009-04-07 13:57 EDT by Oonkwee Lim
Modified: 2010-10-23 04:51 EDT (History)
16 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:05:43 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
SCSI errors from the messages file with one blank SATA disk installed. (236.45 KB, text/plain)
2009-06-03 20:35 EDT, Clinton.Speas
no flags Details
SCSI errors in messages file when single SATA drive installed with MD partitions. (163.33 KB, text/plain)
2009-06-03 20:36 EDT, Clinton.Speas
no flags Details
scsi_logging_level=0x1c0 (201.87 KB, text/plain)
2009-06-04 17:20 EDT, Clinton.Speas
no flags Details
scsi_logging_level=0x1ff (423.45 KB, text/plain)
2009-06-04 17:21 EDT, Clinton.Speas
no flags Details
Serial console output, 0x1c0 and kernel 128.1.6. (349.58 KB, text/plain)
2009-06-09 20:12 EDT, Clinton.Speas
no flags Details
Serial console output, 0x1c0 and kernel 152. (784.85 KB, text/plain)
2009-06-09 20:13 EDT, Clinton.Speas
no flags Details
Serial console output, 0x1c0 and kernel 152.bz494658.2 (315.01 KB, text/plain)
2009-06-10 20:56 EDT, Clinton.Speas
no flags Details
Serial console output, 0x1c0 and kernel 152.bz494658.3 (194.21 KB, text/plain)
2009-06-10 20:58 EDT, Clinton.Speas
no flags Details
Backport of upstream patch (3.63 KB, patch)
2009-06-11 13:40 EDT, David Milburn
no flags Details | Diff
Serial console output, 0x1c0 and kernel 152.bz494658.4 (29.03 KB, text/plain)
2009-06-11 17:51 EDT, Clinton.Speas
no flags Details
Serial console output, 0x1c0 and kernel 152.bz494658.5 (407.66 KB, text/plain)
2009-06-11 17:52 EDT, Clinton.Speas
no flags Details

  None (edit)
Description Oonkwee Lim 2009-04-07 13:57:49 EDT
Description of problem:
With Red Hat errata 128.1.6 installed system hangs with SATA drives installed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Upgrade with Red Hat Errata 128.1.6.
2.
3.
  
Actual results:
The system hang with the SATA drive LEDs flashing in secession over and over again as if constantly being scanned.

After more than 5 minutes there was a message.

scsi_alloc_sdev  Allocation failure during SCSI scanning.  Some SCSI devices
might not be configured.


Expected results:
The system should not hang.

Additional info:
The hang disappears once the SATA drives are removed.
Comment 1 Oonkwee Lim 2009-04-07 14:22:00 EDT
This is a bit different from the bug reported in 483171, where the system
panicked when SATA drives were installed.
Comment 2 Robert N. Evans 2009-04-07 15:05:54 EDT
From bug 483171 it can be seen that this problem occurs when the SATA disk is connected to an HBA managed by aic94xx.ko.  LSPCI of the device on a system that exhibits the problem shows this device:

07:00.0 Serial Attached SCSI controller: Adaptec AIC-9410W SAS (Razor ASIC non-RAID) (rev 09)
        Subsystem: NEC Corporation: Unknown device 8350
        Flags: bus master, 66Mhz, slow devsel, latency 32, IRQ 13
        Memory at 84200000 (64-bit, non-prefetchable) [size=256K]
        Memory at c4000000 (64-bit, prefetchable) [size=128K]
        I/O ports at 1000 [size=256]
        Capabilities: [40] PCI-X non-bridge device.
        Capabilities: [58] Power Management version 2
        Capabilities: [e0] Message Signalled Interrupts: 64bit+ Queue=0/2 Enable-
Comment 3 Andrius Benokraitis 2009-04-07 15:14:50 EDT
Have you tried the RHEL 5.4 kernel as well where this was included?

in kernel-2.6.18-131.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 4 Andrius Benokraitis 2009-04-15 00:33:29 EDT
Robert/Oonkwee - any testing status on the 5.4 test kernels?
Comment 6 Andrius Benokraitis 2009-04-15 11:23:19 EDT
Jim - I think this is in your court now - please let me know if you can generate this with the latest 5.4 kernel as well...
Comment 7 Oonkwee Lim 2009-04-15 14:45:52 EDT
I cannot find kernel-2.6.18-131.el5.  Can I use kernel-2.6.18-133.el5
Comment 8 Andrius Benokraitis 2009-04-15 15:04:17 EDT
Yes please, -131 or newer.
Comment 9 Oonkwee Lim 2009-05-08 17:05:26 EDT
We have tested with version 144 and the problem still occurs.

It is doing the same thing the 5.3 errata did, cycling between the drives like it is scanning them.  It is still cycling the leds 10 minutes later.
Comment 10 Andrius Benokraitis 2009-05-12 14:12:57 EDT
Oonkwee - do you happen to have a patch to fix this issue?
Comment 11 Andrius Benokraitis 2009-05-12 14:20:26 EDT
Jesse - you wouldn't happen to know if this is a known issue with aic94xx?
Comment 12 Oonkwee Lim 2009-05-12 18:09:11 EDT
No I do not have a patch for this.  We do not encounter this problem yet, since I did not merge in the 5.3 errata or 5.4 changes.  I notice that the patch I did for the previous issue of the system crashing with SATA disks is incorporated in the latest 5.4 source.  Therefore I think this issue was introduced after the 5.3 GA.
Comment 13 Jesse Larrew 2009-05-12 19:27:07 EDT
(In reply to comment #11)
> Jesse - you wouldn't happen to know if this is a known issue with aic94xx?  

Adding Peter Bogdanovic to the CC list. Peter is a developer on the System X team at IBM, and is familiar with the aic94xx driver.

Peter, has IBM seen this before?
Comment 14 Peter Bogdanovic 2009-05-13 14:19:45 EDT
I did not see that problem when I recently tested swapping SATA drives on a system with an Adaptec razor, aic94xxx, running 2.6.18-141. I will try booting with the SATA disk plugged in to see if that makes a difference.
Comment 15 David Milburn 2009-05-15 17:40:35 EDT
For this problem, it maybe helpful to turn scsi scan logging on in 
/etc/modprobe.conf and rebuild initrd

options scsi_mod scsi_logging_level=0x1c0
Comment 16 Clinton.Speas 2009-06-03 20:28:48 EDT
I have tried to set the scsi_logging_level as shown above.  I tried setting scsi_logging_level on the kernel command line and echoing 0x1c0 into /sys/module/scsi_mod/scsi_logging_level.  I am not getting any new messages on the console or in the messages file.  Is there another way to get more debug information while I am experiencing this problem?
Comment 17 Clinton.Speas 2009-06-03 20:35:32 EDT
Created attachment 346482 [details]
SCSI errors from the messages file with one blank SATA disk installed.
Comment 18 Clinton.Speas 2009-06-03 20:36:28 EDT
Created attachment 346483 [details]
SCSI errors in messages file when single SATA drive installed with MD partitions.
Comment 19 David Milburn 2009-06-04 13:29:05 EDT
Can you double-check that you rebuilt the initrd?

Here is an example:

# cat /etc/modprobe.conf
alias eth0 tg3
alias eth1 tg3
alias scsi_hostadapter mptbase
alias scsi_hostadapter1 mptsas
alias scsi_hostadapter2 ata_piix
options scsi_mod scsi_logging_level=0x1c0

# cp /boot/initrd-(kernel-version).img /boot/initrd-(kernel-version).img.bak
# mkinitrd -f -v initrd-$(uname -r).img $(uname -r)

You should see some additional debug on the console during boot, like

scsi scan: Sending REPORT LUNS to host 0 channel 0 id 1 (try 0)
scsi scan: REPORT LUNS successful (try 0) result 0x0

You can also try setting scsi_logging_level to 0x1ff, that should cover
(errors | timeout | scan) logging.
Comment 20 Clinton.Speas 2009-06-04 17:20:58 EDT
Created attachment 346588 [details]
scsi_logging_level=0x1c0
Comment 21 Clinton.Speas 2009-06-04 17:21:25 EDT
Created attachment 346589 [details]
scsi_logging_level=0x1ff
Comment 22 Clinton.Speas 2009-06-04 17:22:49 EDT
Thank you for the update and example of mkinitrd command line.
Comment 23 Clinton.Speas 2009-06-04 18:23:41 EDT
This problem has been reproduced under the following conditions.

1.	 On a Stratus ftServer model 2510, 4410, and a 6210 (code name Fusion-H).
2.	With ftSSS for Linux 6.0.3.0 (RHEL 5.3 GA) installed, the problem occurs when upgrading to one of the Red Hat errata.
a.	With Red Hat errata’s 2.6.18.-128.1.6.el5 and 2.6.18.-128.1.10.el5
3.	With internal SATA disks, either blank disks or disks with MD partitions.  Attached are the SCSI errors logged  for both.
a.	Disks are Seagate ST3500630NS SATA drives (Stratus Part number  260-01649-001).
b.	With 1, 2, and 4 disks.  If 4 SATA drives are configured, then the problem is serious enough to prevent system bootload from finishing.
c.	Or with just Red Hat 5.3 kernel-2.6.18.-128.1.6.el5 installed, and only the top enclosure powered on in the Stratus ftServer.  This configuration of ftServer is as close as possible to a reference platform (with only one CPU/IO slice in the system).

REPRODUCTION:

Method  1 (with a minimal configuration):

1.	Remove power from bottom enclosure.
2.	Install Red Hat 5.3 with only one SAS drive installed in the top
enclosure (other disk slots empty).
3.	Remove “quiet” from the kernel command line in grub.conf.
4.	Upgrade to kernel-2.6.18-128.1.6.el5 or 2.6.18-128.1.10.el5.
5.	Shutdown and power off to insert one or two blank SATA disks (or you
can use disks with MD partitions).
6.	Boot the system.  The SCSI errors will start scrolling down the console during the boot process and continue after the system is up.  The disks will not get added.

Method  2 (with a typical Stratus configuration):

1.	With both CPU/IO slices powered on, install Red Hat 5.3.
2.	Install ftSSS for Linux 6.0.3.0.
3.	Remove “quiet” from the kernel command line in grub.conf.
4.	Install 4 SATA drives and all configured with MD partitions. 
5.	Upgrade to 2.6.18-128.1.6.el5 or 2.6.18-128.1.10.el5.
6.	Reboot the system.  The SCSI errors will start scrolling down the console during the boot process and the system will not boot up.
Comment 27 David Milburn 2009-06-09 18:35:30 EDT
I built a test kernel with some debuging turned on, would you please boot
the kernel-2.6.18-152.el5 test kernel located here?

http://people.redhat.com/dmilburn/

I wasn't able to find any of the "scsi_alloc_sdev" failures in the logs
in Comment #20 and #21, we may not be capturing everything in /var/log/messages,
would it be possible to setup a serial console to capture the output while
booting. We need to see some debug around the actual scsi_alloc_sdev failures.

Also before installing the test kernel set "scsi_logging_level=0x1c0", the
installation of the rpm will take care of the initrd. 

Also, I looked over the change log for -128.1.6.el5,  the only relevant patch
I can see is the one to fixup sas_sata_ops which if I understand correctly you
including this patch in you Method 1[2] step #1 otherwise system will crash.

[scsi] libata: sas_ata fixup sas_sata_ops
Comment 28 Clinton.Speas 2009-06-09 20:12:50 EDT
Created attachment 347115 [details]
Serial console output, 0x1c0 and kernel 128.1.6.
Comment 29 Clinton.Speas 2009-06-09 20:13:40 EDT
Created attachment 347116 [details]
Serial console output, 0x1c0 and kernel 152.
Comment 30 David Milburn 2009-06-10 20:10:13 EDT
Looking at the logs, it appears that sas_form_port() (sas_port.c) maybe
continually hitting this condition 

                if (memcmp(port->attached_sas_addr, phy->attached_sas_addr,
                           SAS_ADDR_SIZE) != 0)
                        sas_deform_port(phy);

I have built two more test kernels, kernel-2.6.18-152.el5.bz494658.2 adds some
more debugging to sas_form_port, please boot it and attach the serial console
output.

Also, I have built kernel-2.6.18-152.el5.bz494658.3 which is the same as .2,
but includes this upstream patch

commit 3b6e9fafc40e36f50f0bd0f1ee758eecd79f1098
Author: Darrick J. Wong <djwong@us.ibm.com>
Date:   Fri Jan 26 14:08:41 2007 -0800

    [SCSI] libsas: Fix incorrect sas_port deformation in sas_form_port

Please boot the .3 kernel and attach the serial console output.

http://people.redhat.com/dmilburn/
Comment 31 Clinton.Speas 2009-06-10 20:56:05 EDT
Created attachment 347319 [details]
Serial console output, 0x1c0 and kernel 152.bz494658.2
Comment 32 Clinton.Speas 2009-06-10 20:58:45 EDT
Created attachment 347320 [details]
Serial console output, 0x1c0 and kernel 152.bz494658.3

The SATA disks added as the kernel was booting.  Once the system was up I confirmed I could see all the scsi devices and "fdisk -l" displayed their empty partition table.
Comment 33 David Milburn 2009-06-11 11:12:56 EDT
Based upon Comment #31 output, it does look like we are failing the check
between port->attached_sas_addr and phy->attached_sas_addr. So basically
libsas constantly removes the phy from the port and then re-adds, and this
never ends.

SAS_FORM_PORT: port 0x7fc04880 port->attached_sas_addr 0x7fc04a90, phy->attached_sas_addr 0x7fc00ee0, SAS_ADDR_SIZE 8
SAS_DEFORM_PORT: port 0x7fc04880
SAS_DEFORM_PORT: port->num_phys 1
SAS_FORM_PORT: num_phys 8
sas: phy1 added to port1, phy_mask:0x2
sas: phy-0:1 added to port-0:1, phy_mask:0x2 (50030130f1012611)

SAS_FORM_PORT: port 0x7fc04ad8 port->attached_sas_addr 0x7fc04ce8, phy->attached_sas_addr 0x7fc01730, SAS_ADDR_SIZE 8
SAS_DEFORM_PORT: port 0x7fc04ad8
SAS_DEFORM_PORT: port->num_phys 1
SAS_FORM_PORT: num_phys 8
sas: phy2 added to port2, phy_mask:0x4
sas: phy-0:2 added to port-0:2, phy_mask:0x4 (50030130f1012612)

This is actually the patch in the .3 test kernel, the one I referenced
above is already present in RHEL5.

commit a29c05153630b2cd5ea078c97c0abe084cd830d8
Author: James Bottomley <James.Bottomley@HansenPartnership.com>
Date:   Sat Feb 23 23:38:44 2008 -0600

    [SCSI] libsas: use the supplied address for SATA devices rather than changi

We will need to do one more test with all the debugging turned off, I should
have another kernel rpm for final testing ready soon.
Comment 35 David Milburn 2009-06-11 13:40:23 EDT
Created attachment 347445 [details]
Backport of upstream patch

commit a29c05153630b2cd5ea078c97c0abe084cd830d8
Author: James Bottomley <James.Bottomley@HansenPartnership.com>
Date:   Sat Feb 23 23:38:44 2008 -0600

    [SCSI] libsas: use the supplied address for SATA devices rather than changing it
Comment 37 RHEL Product and Program Management 2009-06-11 13:51:43 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 38 David Milburn 2009-06-11 15:32:56 EDT
Would you please verify kernel-2.6.18-152.el5.bz494658.4? It the -152.el5
test kernel patched (Comment #35), based upon the successfully .3 test this
should work, please attach the serial console output.

I would like to ask you to also test kernel-2.6.18-152.el5.bz494658.5, this
one will fail but I would like to look at the serial console output.

http://people.redhat.com/dmilburn/
Comment 39 Clinton.Speas 2009-06-11 17:51:01 EDT
Created attachment 347487 [details]
Serial console output, 0x1c0 and kernel 152.bz494658.4

Boot looked clean and the disks added no problem.
Comment 40 Clinton.Speas 2009-06-11 17:52:22 EDT
Created attachment 347488 [details]
Serial console output, 0x1c0 and kernel 152.bz494658.5
Comment 44 Don Zickus 2009-06-18 10:51:17 EDT
in kernel-2.6.18-154.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 46 Clinton.Speas 2009-06-19 15:13:50 EDT
I was able to verify that this fix is in kernel-2.6.18-154.el5.
Comment 48 errata-xmlrpc 2009-09-02 04:05:43 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.