Description of problem: With Red Hat errata 128.1.6 installed system hangs with SATA drives installed. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Upgrade with Red Hat Errata 128.1.6. 2. 3. Actual results: The system hang with the SATA drive LEDs flashing in secession over and over again as if constantly being scanned. After more than 5 minutes there was a message. scsi_alloc_sdev Allocation failure during SCSI scanning. Some SCSI devices might not be configured. Expected results: The system should not hang. Additional info: The hang disappears once the SATA drives are removed.
This is a bit different from the bug reported in 483171, where the system panicked when SATA drives were installed.
From bug 483171 it can be seen that this problem occurs when the SATA disk is connected to an HBA managed by aic94xx.ko. LSPCI of the device on a system that exhibits the problem shows this device: 07:00.0 Serial Attached SCSI controller: Adaptec AIC-9410W SAS (Razor ASIC non-RAID) (rev 09) Subsystem: NEC Corporation: Unknown device 8350 Flags: bus master, 66Mhz, slow devsel, latency 32, IRQ 13 Memory at 84200000 (64-bit, non-prefetchable) [size=256K] Memory at c4000000 (64-bit, prefetchable) [size=128K] I/O ports at 1000 [size=256] Capabilities: [40] PCI-X non-bridge device. Capabilities: [58] Power Management version 2 Capabilities: [e0] Message Signalled Interrupts: 64bit+ Queue=0/2 Enable-
Have you tried the RHEL 5.4 kernel as well where this was included? in kernel-2.6.18-131.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Robert/Oonkwee - any testing status on the 5.4 test kernels?
Jim - I think this is in your court now - please let me know if you can generate this with the latest 5.4 kernel as well...
I cannot find kernel-2.6.18-131.el5. Can I use kernel-2.6.18-133.el5
Yes please, -131 or newer.
We have tested with version 144 and the problem still occurs. It is doing the same thing the 5.3 errata did, cycling between the drives like it is scanning them. It is still cycling the leds 10 minutes later.
Oonkwee - do you happen to have a patch to fix this issue?
Jesse - you wouldn't happen to know if this is a known issue with aic94xx?
No I do not have a patch for this. We do not encounter this problem yet, since I did not merge in the 5.3 errata or 5.4 changes. I notice that the patch I did for the previous issue of the system crashing with SATA disks is incorporated in the latest 5.4 source. Therefore I think this issue was introduced after the 5.3 GA.
(In reply to comment #11) > Jesse - you wouldn't happen to know if this is a known issue with aic94xx? Adding Peter Bogdanovic to the CC list. Peter is a developer on the System X team at IBM, and is familiar with the aic94xx driver. Peter, has IBM seen this before?
I did not see that problem when I recently tested swapping SATA drives on a system with an Adaptec razor, aic94xxx, running 2.6.18-141. I will try booting with the SATA disk plugged in to see if that makes a difference.
For this problem, it maybe helpful to turn scsi scan logging on in /etc/modprobe.conf and rebuild initrd options scsi_mod scsi_logging_level=0x1c0
I have tried to set the scsi_logging_level as shown above. I tried setting scsi_logging_level on the kernel command line and echoing 0x1c0 into /sys/module/scsi_mod/scsi_logging_level. I am not getting any new messages on the console or in the messages file. Is there another way to get more debug information while I am experiencing this problem?
Created attachment 346482 [details] SCSI errors from the messages file with one blank SATA disk installed.
Created attachment 346483 [details] SCSI errors in messages file when single SATA drive installed with MD partitions.
Can you double-check that you rebuilt the initrd? Here is an example: # cat /etc/modprobe.conf alias eth0 tg3 alias eth1 tg3 alias scsi_hostadapter mptbase alias scsi_hostadapter1 mptsas alias scsi_hostadapter2 ata_piix options scsi_mod scsi_logging_level=0x1c0 # cp /boot/initrd-(kernel-version).img /boot/initrd-(kernel-version).img.bak # mkinitrd -f -v initrd-$(uname -r).img $(uname -r) You should see some additional debug on the console during boot, like scsi scan: Sending REPORT LUNS to host 0 channel 0 id 1 (try 0) scsi scan: REPORT LUNS successful (try 0) result 0x0 You can also try setting scsi_logging_level to 0x1ff, that should cover (errors | timeout | scan) logging.
Created attachment 346588 [details] scsi_logging_level=0x1c0
Created attachment 346589 [details] scsi_logging_level=0x1ff
Thank you for the update and example of mkinitrd command line.
This problem has been reproduced under the following conditions. 1. On a Stratus ftServer model 2510, 4410, and a 6210 (code name Fusion-H). 2. With ftSSS for Linux 6.0.3.0 (RHEL 5.3 GA) installed, the problem occurs when upgrading to one of the Red Hat errata. a. With Red Hat errata’s 2.6.18.-128.1.6.el5 and 2.6.18.-128.1.10.el5 3. With internal SATA disks, either blank disks or disks with MD partitions. Attached are the SCSI errors logged for both. a. Disks are Seagate ST3500630NS SATA drives (Stratus Part number 260-01649-001). b. With 1, 2, and 4 disks. If 4 SATA drives are configured, then the problem is serious enough to prevent system bootload from finishing. c. Or with just Red Hat 5.3 kernel-2.6.18.-128.1.6.el5 installed, and only the top enclosure powered on in the Stratus ftServer. This configuration of ftServer is as close as possible to a reference platform (with only one CPU/IO slice in the system). REPRODUCTION: Method 1 (with a minimal configuration): 1. Remove power from bottom enclosure. 2. Install Red Hat 5.3 with only one SAS drive installed in the top enclosure (other disk slots empty). 3. Remove “quiet” from the kernel command line in grub.conf. 4. Upgrade to kernel-2.6.18-128.1.6.el5 or 2.6.18-128.1.10.el5. 5. Shutdown and power off to insert one or two blank SATA disks (or you can use disks with MD partitions). 6. Boot the system. The SCSI errors will start scrolling down the console during the boot process and continue after the system is up. The disks will not get added. Method 2 (with a typical Stratus configuration): 1. With both CPU/IO slices powered on, install Red Hat 5.3. 2. Install ftSSS for Linux 6.0.3.0. 3. Remove “quiet” from the kernel command line in grub.conf. 4. Install 4 SATA drives and all configured with MD partitions. 5. Upgrade to 2.6.18-128.1.6.el5 or 2.6.18-128.1.10.el5. 6. Reboot the system. The SCSI errors will start scrolling down the console during the boot process and the system will not boot up.
I built a test kernel with some debuging turned on, would you please boot the kernel-2.6.18-152.el5 test kernel located here? http://people.redhat.com/dmilburn/ I wasn't able to find any of the "scsi_alloc_sdev" failures in the logs in Comment #20 and #21, we may not be capturing everything in /var/log/messages, would it be possible to setup a serial console to capture the output while booting. We need to see some debug around the actual scsi_alloc_sdev failures. Also before installing the test kernel set "scsi_logging_level=0x1c0", the installation of the rpm will take care of the initrd. Also, I looked over the change log for -128.1.6.el5, the only relevant patch I can see is the one to fixup sas_sata_ops which if I understand correctly you including this patch in you Method 1[2] step #1 otherwise system will crash. [scsi] libata: sas_ata fixup sas_sata_ops
Created attachment 347115 [details] Serial console output, 0x1c0 and kernel 128.1.6.
Created attachment 347116 [details] Serial console output, 0x1c0 and kernel 152.
Looking at the logs, it appears that sas_form_port() (sas_port.c) maybe continually hitting this condition if (memcmp(port->attached_sas_addr, phy->attached_sas_addr, SAS_ADDR_SIZE) != 0) sas_deform_port(phy); I have built two more test kernels, kernel-2.6.18-152.el5.bz494658.2 adds some more debugging to sas_form_port, please boot it and attach the serial console output. Also, I have built kernel-2.6.18-152.el5.bz494658.3 which is the same as .2, but includes this upstream patch commit 3b6e9fafc40e36f50f0bd0f1ee758eecd79f1098 Author: Darrick J. Wong <djwong.com> Date: Fri Jan 26 14:08:41 2007 -0800 [SCSI] libsas: Fix incorrect sas_port deformation in sas_form_port Please boot the .3 kernel and attach the serial console output. http://people.redhat.com/dmilburn/
Created attachment 347319 [details] Serial console output, 0x1c0 and kernel 152.bz494658.2
Created attachment 347320 [details] Serial console output, 0x1c0 and kernel 152.bz494658.3 The SATA disks added as the kernel was booting. Once the system was up I confirmed I could see all the scsi devices and "fdisk -l" displayed their empty partition table.
Based upon Comment #31 output, it does look like we are failing the check between port->attached_sas_addr and phy->attached_sas_addr. So basically libsas constantly removes the phy from the port and then re-adds, and this never ends. SAS_FORM_PORT: port 0x7fc04880 port->attached_sas_addr 0x7fc04a90, phy->attached_sas_addr 0x7fc00ee0, SAS_ADDR_SIZE 8 SAS_DEFORM_PORT: port 0x7fc04880 SAS_DEFORM_PORT: port->num_phys 1 SAS_FORM_PORT: num_phys 8 sas: phy1 added to port1, phy_mask:0x2 sas: phy-0:1 added to port-0:1, phy_mask:0x2 (50030130f1012611) SAS_FORM_PORT: port 0x7fc04ad8 port->attached_sas_addr 0x7fc04ce8, phy->attached_sas_addr 0x7fc01730, SAS_ADDR_SIZE 8 SAS_DEFORM_PORT: port 0x7fc04ad8 SAS_DEFORM_PORT: port->num_phys 1 SAS_FORM_PORT: num_phys 8 sas: phy2 added to port2, phy_mask:0x4 sas: phy-0:2 added to port-0:2, phy_mask:0x4 (50030130f1012612) This is actually the patch in the .3 test kernel, the one I referenced above is already present in RHEL5. commit a29c05153630b2cd5ea078c97c0abe084cd830d8 Author: James Bottomley <James.Bottomley> Date: Sat Feb 23 23:38:44 2008 -0600 [SCSI] libsas: use the supplied address for SATA devices rather than changi We will need to do one more test with all the debugging turned off, I should have another kernel rpm for final testing ready soon.
Created attachment 347445 [details] Backport of upstream patch commit a29c05153630b2cd5ea078c97c0abe084cd830d8 Author: James Bottomley <James.Bottomley> Date: Sat Feb 23 23:38:44 2008 -0600 [SCSI] libsas: use the supplied address for SATA devices rather than changing it
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Would you please verify kernel-2.6.18-152.el5.bz494658.4? It the -152.el5 test kernel patched (Comment #35), based upon the successfully .3 test this should work, please attach the serial console output. I would like to ask you to also test kernel-2.6.18-152.el5.bz494658.5, this one will fail but I would like to look at the serial console output. http://people.redhat.com/dmilburn/
Created attachment 347487 [details] Serial console output, 0x1c0 and kernel 152.bz494658.4 Boot looked clean and the disks added no problem.
Created attachment 347488 [details] Serial console output, 0x1c0 and kernel 152.bz494658.5
in kernel-2.6.18-154.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
I was able to verify that this fix is in kernel-2.6.18-154.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html