Bug 589405 - (CR178148) Running a storage controller fail and drive fail test on an LSI storage array connected to a RHEL5.5 x64 host with QLE4062 host adapter gives a kernel panic
Summary: (CR178148) Running a storage controller fail and drive fail test on an LSI st...
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: x86_64
OS: Linux
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
Depends On:
TreeView+ depends on / blocked
Reported: 2010-05-06 05:03 UTC by Sujithkumary
Modified: 2011-10-10 13:36 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2011-10-10 13:36:10 UTC
Target Upstream Version:

Attachments (Terms of Use)
Host Serial Logs (522.07 KB, application/x-zip-compressed)
2010-05-06 05:03 UTC, Sujithkumary
no flags Details

Description Sujithkumary 2010-05-06 05:03:35 UTC
Created attachment 411810 [details]
Host Serial Logs

Description of problem:
Running a storage controller fail and drive fail test on an LSI storage array connected to a RHEL5.5 x64 host with QLE4062 host adapter gives a kernel panic after 12 hours of running the test. This is specific to SANboot and this is not observed when the host is booted up with internal HDD. The issue was raised as soon as we  hit it. The reproducibility is unknown. The call trace of the panic is below:

NMI Watchdog detected LOCKUP on CPU 0
8>CPU 0
8>Modules linked in: ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth 

dm_log_clustered(U) lockd sunrpc loop dm_mirror dm_multipath scsi_dh video backlight sbs 

power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac 

parport_pc lp parport joydev pcspkr shpchp qla3xxx bnx2 serio_raw i5000_edac edac_mc 

dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache qisioctl(U) mppVhba(U) 

ata_piix libata megaraid_sas qla4xxx scsi_transport_iscsi2 scsi_transport_iscsi mppUpper(U) 

sg sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
8>Pid: 495, comm: scsi_eh_0 Tainted: G      2.6.18-194.el5 #1
8>RIP: 0010:[ffffffff80065c0b>]  [ffffffff80065c0b>] .text.lock.spinlock+0x11/0x30
8>RSP: 0018:ffffffff80448f00  EFLAGS: 00000086
8>RAX: 0000000000000046 RBX: ffff81007eb404f8 RCX: 000000000000005a
8>RDX: ffff81007eb9dd38 RSI: ffff81007eb404f8 RDI: ffff81007eb405f8
8>RBP: 0000000000000000 R08: ffff81007eb9dd80 R09: 000000000000003c
8>R10: 0000000000000002 R11: ffffffff8812226e R12: 000000000000005a
8>R13: 0000000000000000 R14: ffff81007eb9dd38 R15: ffff81007eb9dd38
8>FS:  0000000000000000(0000) GS:ffffffff803cb000(0000) knlGS:0000000000000000
8>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
8>CR2: 000000000911d000 CR3: 00000000355aa000 CR4: 00000000000006e0
8>Process scsi_eh_0 (pid: 495, threadinfo ffff81007eb9c000, task ffff81007f897100)
8>Stack:  ffffffff88120687 0000000000000000 ffff81007f5b4440 000000000000005a
8> ffffffff80010c00 ffffffff803e8d80 0000000000005a00 000000000000005a
8> ffff81007f5b4440 ffffffff803e8dbc ffffffff800bbfaf ffffffff80012409
8>Call Trace:
8> IRQ>  [ffffffff88120687>] :qla4xxx:qla4xxx_intr_handler+0x3c/0x201
8> [ffffffff80010c00>] handle_IRQ_event+0x51/0xa6
8> [ffffffff800bbfaf>] __do_IRQ+0xa4/0x103
8> [ffffffff80012409>] __do_softirq+0x89/0x133
8> [ffffffff8006da2b>] do_IRQ+0xe7/0xf5
8> [ffffffff8005e615>] ret_from_intr+0x0/0xa
8> EOI>  [ffffffff8811a74d>] :qla4xxx:qla4xxx_eh_device_reset+0x151/0x267
8> [ffffffff80151a1a>] kobject_release+0x0/0x9
8> [ffffffff880779a5>] :scsi_mod:scsi_try_bus_device_reset+0x21/0x42
8> [ffffffff88078850>] :scsi_mod:scsi_eh_ready_devs+0x1ad/0x493
8> [ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
8> [ffffffff88079237>] :scsi_mod:scsi_error_handler+0x323/0x4ac
8> [ffffffff88078f14>] :scsi_mod:scsi_error_handler+0x0/0x4ac
8> [ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
8> [ffffffff80032bdc>] kthread+0xfe/0x132
8> [ffffffff8005efb1>] child_rip+0xa/0x11
8> [ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
8> [ffffffff801f85a7>] hcd_submit_urb+0x0/0x752
8> [ffffffff80032ade>] kthread+0x0/0x132
8> [ffffffff8005efa7>] child_rip+0x0/0x11
8>Code: 7e f9 e9 f9 fe ff ff f3 90 83 3f 00 7e f9 e9 f8 fe ff ff f3
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
Red Hat nash version starting
 session1: Cannot notify userspace of session event 106. Check iscsi daemon
 session2: Cannot notify userspace of session event 106. Check iscsi daemon
                        Welcome to Red Hat Enterprise Linux Server
                        Press 'I' to enter interactive startup.
Setting clock  (utc): Fri Apr 30 10:34:52 IST 2010 [  OK  ]

Starting udev: qla3xxx QLogic ISP3XXX Network Driver
qla3xxx Driver name: qla3xxx, Version: v2.
Kernel panic - not syncing: Out of memory and no killable processes...

Version-Release number of selected component (if applicable):
RHEL 5.5 kernel 2.6.18-194.el5.ELsmp

How reproducible:Unknown

Steps to Reproduce:

1) It is 1x2 (Server X array) ISCSI Setup with a Single server with a QLE4062c Host Adapter connected to a LSI 7900 and an LSI 4988 array via a Dell 6224 switch.

2) Create a 50 GB LUN on one array and map it to the host. The host is shutdown, hard disk is removed. It is then powered up and the Host Adapter BIOS is configured to boot from the presented LUN.

3) Install RHEL5.5 x86_64 (2.6.18-194.el5) OS on the mapped LUN. At the point of installation only 1 path to the LUN is presented to the host.

4)  Install the host with IBM RDAC(LSI MPP) multipathing driver. Once the multipathing driver is installed, the 2nd path to the LUN is presented to the host.

5) Create 64 LUNs  of 1 GB on each array and map to the host. The host sees the volumes successfully. Create ext3 filesystem on 6 volumes and mount them

6) Run a single thread of I/O to all volumes(excluding the 50 GB LUN on which OS is installed) with block size from 32 to 4096.The IO is run on the raw devices as well as the 6 mounted filesystems.

7) Run the CFDF script that sleeps five minutes between each step.
            a. Offlines both A storage controllers
            b. Fails a drive
            c. Onlines both A storage controllers
            d. Reconstructs the drive
            e. Repeat above steps, alternating controllers and using a number of different drives.

8) After I/O running successfully for around 12 hours the host hits the panic mentioned above.

Actual results:
Host hits a Kernel Panic

Expected results:
I/O's should run gracefully without any failures. 

Additional info:
The host had IBM RDAC(LSI MPP) multipathing driver  installed on it.

Comment 1 Alan 2010-09-09 01:34:12 UTC
I am seeing a similar kernel panic with my QLA4010c using the qisioctl driver instead of the ibm/rdac one.  since my qla4xxx module is included in my initrd file this actually makes my machine unbootable with any kernel newer than 2.6.18-164.15.1.el5
I just verified with kernel 2.6.18-194.11.3.el5 and the error persists.  This should probably not be "low" priority as it effectively precludes anyone from running RHEL5.5 and booting off of iSCSI with a qla4xxx card...

Comment 2 Abdel Sadek 2011-08-01 21:27:42 UTC
This can be closed from NetApp perspective as we've dropped using and supporting  the Qlogic QLE406x and QLE405x.

Comment 3 Chris Tatman 2011-10-10 13:36:10 UTC
This can be closed....per comment # 2.



Note You need to log in before you can comment on or make changes to this bug.