Bug 220783 - [RHEL5 beta2] Kernel Panic occurred during I/O stress test in EX8350 environment
[RHEL5 beta2] Kernel Panic occurred during I/O stress test in EX8350 environment
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
urgent Severity high
: ---
: ---
Assigned To: Jun'ichi Nomura (Red Hat)
Brian Brock
: Regression
Depends On:
  Show dependency treegraph
Reported: 2006-12-26 15:00 EST by Issue Tracker
Modified: 2007-11-30 17:07 EST (History)
10 users (show)

See Also:
Fixed In Version: RC
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-02-07 21:03:01 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
zipfile from NEC containing patches provided by Promise (2.18 KB, application/octet-stream)
2006-12-26 15:04 EST, Jeff Layton
no flags Details
posted patch (3.17 KB, patch)
2007-01-16 11:08 EST, Jun'ichi Nomura (Red Hat)
no flags Details | Diff

  None (edit)
Description Issue Tracker 2006-12-26 15:00:39 EST
Escalated to Bugzilla from IssueTracker
Comment 1 Issue Tracker 2006-12-26 15:00:53 EST
Description of Problem:
   Kernel panic occurred during heavy I/O stress test(lisa) on Promise EX8350 system.
   This issue occurs only on stex.ko included after RHEL5 beta2.
   Since stex.ko included in RHEL4 Update4 already has Promise's patches for this issue applied, this is regression issue.
 Version-Release of Selected Component:
   RHEL5 Beta2 (Kernel 2.6.18-1.2747.el5)
   stex.ko (
   We confirm this problem is reproduced by executing heavy I/O stress test for a few hours.
 Step to Reproduce:
   1. Install RHEL5 beta2 with EX8350 driver disk
      note1: RHEL5 doesn't yet contain the PCI id for EX8350, so we use a
             driver disk created from the inbox driver.
      note2: Please make a lot of partitions for the I/O stress test.
             For us, create 15 partitions.(See the sysreport)
   2. Boot system in run level 5
   3. Install WebPAM on X-window environment.
     1) expand *tar.gz
        # tar zxvf WebPAM_Installer.tar.gz
     2) execute WebPAM_Installer.bin
        # ./WebPAM_Installer.bin
     3) reboot
   4. Boot system in run level 3
   5. mount all partitions.
      note: For us, we mount sda1-15 and sdb1-4.
   6. execute lisa(I/O stress test).
   7. The panic occurs after a few hours.
 Actual results:
   We got the following messages in dump.
Kernel BUG at lib/list_debug.c:26
invalid opcode: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/class
CPU 0 
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg(U) i2c_i801 ide_cd e1000 i2c_core cdrom serio_raw pcspkr shpchp usb_storage stex sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 5909, comm: diff Not tainted 2.6.18-1.2747.el5 #1
RIP: 0010:[<ffffffff8013fa82>]  [<ffffffff8013fa82>] __list_add+0x24/0x68
RSP: 0018:ffff810024c557b8  EFLAGS: 00010082
RAX: 0000000000000058 RBX: ffff810014b8bc58 RCX: ffffffff80355da8
RDX: ffffffff80355da8 RSI: 0000000000000000 RDI: ffffffff80355da0
RBP: ffff81003f244710 R08: ffffffff80355da8 R09: 0000000000000046
R10: ffff810024c55458 R11: 0000000000000080 R12: ffff81001b1433d8
R13: ffff81003f676c98 R14: ffff81001b1433d8 R15: 0000000000001000
FS:  00002aaaaaad1f40(0000) GS:ffffffff80406000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaadcf00f CR3: 000000001a860000 CR4: 00000000000006e0
Process diff (pid: 5909, threadinfo ffff810024c54000, task ffff81002c638040)
Stack:  ffff81001b1433d8 ffff81003f244700 ffff81003f181000 ffffffff801346e4
 0000000000000003 0000000000000003 ffff81003f13b000 ffffffff8807fab8
 0000000000000003 ffff81001b1433d8 ffff81003f676c98 00000000007dc050
Call Trace:
 [<ffffffff801346e4>] blk_queue_start_tag+0xbe/0xc8
 [<ffffffff8807fab8>] :scsi_mod:scsi_request_fn+0x13a/0x390
 [<ffffffff801318c4>] elv_insert+0x131/0x1fe
 [<ffffffff8000bb16>] __make_request+0x3da/0x429
 [<ffffffff8001b7e4>] generic_make_request+0x219/0x230
 [<ffffffff80032db2>] submit_bio+0xcd/0xd4
 [<ffffffff800e3686>] mpage_bio_submit+0x22/0x26
 [<ffffffff800387e7>] mpage_readpages+0x11d/0x131
 [<ffffffff800126e3>] __do_page_cache_readahead+0xfe/0x1ce
 [<ffffffff80031dc2>] blockable_page_cache_readahead+0x53/0xb2
 [<ffffffff8002ea23>] make_ahead_window+0x82/0x9e
 [<ffffffff800137ef>] page_cache_readahead+0x17f/0x1af
 [<ffffffff8000bc8b>] do_generic_mapping_read+0x126/0x3f8
 [<ffffffff8000c0c6>] __generic_file_aio_read+0x169/0x1bc
 [<ffffffff8001645a>] generic_file_aio_read+0x34/0x39
 [<ffffffff8000c8d8>] do_sync_read+0xc7/0x104
 [<ffffffff8000b1e0>] vfs_read+0xcb/0x171
 [<ffffffff80011451>] sys_read+0x45/0x6e
 [<ffffffff8005b641>] tracesys+0xd1/0xdc
DWARF2 unwinder stuck at tracesys+0xd1/0xdc
Leftover inexact backtrace:

Code: 0f 0b 68 d3 e9 28 80 c2 1a 00 48 8b 55 00 48 39 da 74 1b 48 
RIP  [<ffffffff8013fa82>] __list_add+0x24/0x68
 RSP <ffff810024c557b8>
 Expected results:
   The system should not panic.
 Driver and Hardware Environment:
    Machine                   : NEC Express5800/120Eh(Woodcrest)
    BIOS                      : 1.0.5G39
    Extended Memory           : 1GB
    Hyper-Threading Technology: none
    RAID configuration        : /dev/sda,/dev/sdb(RAID5 split configuration)
 Business impact:
    NEC expects to ship 2000 servers per year equipped with Promise SuperTrak EX8350/4350.  If the problrem may ocurred in EX8350/EX4350 environment, these customers will not be able to use RHEL5.

 Additional Info:
  We got two patches from Promise to fix this problem.
  The two patches attached in "stex_for_rhel5.zip" provide the fix for this problem.
  Please apply these patches to RHEL5b2 driver in the following order.

    1. stex_modify_function_stex_intr_because_irq_returns_prototype.txt
    2. stex_add_function_to_allocate_tag_rather_than_use_shared_tag.txt

  We confirmed this issue does not occur in the environment with patches for stex.ko applied.

  Please open Bugzilla for this Issue and provide right to access by NEC Confidential and Ed Lin who is engineer of Promise.
  Because of registering information also by Ed Lin, if we need more information.
  E-mail account of Ed Lin is ed.lin@promise.com.

This event sent from IssueTracker by jlayton  [Support Engineering Group]
 issue 110001
Comment 2 Issue Tracker 2006-12-26 15:01:09 EST
File uploaded: Patch_stex_for_rhel5.zip

This event sent from IssueTracker by jlayton  [Support Engineering Group]
 issue 110001
it_file 78511
Comment 3 Issue Tracker 2006-12-26 15:01:23 EST
File uploaded: sysreport_stex_.tar.bz2

This event sent from IssueTracker by jlayton  [Support Engineering Group]
 issue 110001
it_file 78512
Comment 4 Issue Tracker 2006-12-26 15:01:38 EST


A proposed patches are available. It seems that NEC has already worked with
Promise about this issue. 


> Please open Bugzilla for this Issue and provide right to access by NEC
Confidential and Ed Lin who is engineer of Promise.

So, please open the buzilla and make it public.

Issue escalated to Support Engineering Group by: mmatsuya.
mmatsuya assigned to issue for NEC-Support.
Category set to: Kernel
Internal Status set to 'Waiting on SEG'
Priority set to: 5

This event sent from IssueTracker by jlayton  [Support Engineering Group]
 issue 110001
Comment 5 Jeff Layton 2006-12-26 15:03:12 EST
This may be related to BZ 219838, though the stack trace there looks different.
Comment 6 Jeff Layton 2006-12-26 15:04:29 EST
Created attachment 144379 [details]
zipfile from NEC containing patches provided by Promise
Comment 7 Larry Troan 2007-01-01 18:09:23 EST
Promise asks if this bug can be made public in order that thry can view it. 

Bug is alorady open to NEC. If a problem in opening to the public, we can add
specific names from promise to the cc list instead.

Awaiting response from NEC.
Comment 8 Issue Tracker 2007-01-08 21:15:45 EST
    We have no problem to open this bug as public.
    Would you please change setup as public bug.


This event sent from IssueTracker by katou-txa 
 issue 110001
Comment 9 Larry Troan 2007-01-09 15:24:09 EST
Per comment #8, making this bug public and opening comments up with NEC's
Comment 10 Larry Troan 2007-01-09 15:30:07 EST
Note that this bug fix is not in RHEL5.0. 

Assuming the patch is accepted upstream, the earliest the fix will be available
is  RHEL5.1. Flagging as a 5.1 candidate.
Comment 11 RHEL Product and Program Management 2007-01-09 15:44:59 EST
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.
Comment 18 Peter Martuccelli 2007-01-15 11:22:49 EST
As we are short on time please post the patch for review while development
determines the impact on taking this patch in at such a late time.

No devel ACK at this time.
Comment 19 Larry Troan 2007-01-15 15:04:00 EST
Per Engineering meeting on 1/15/07,
1) need to fix the "Shared TCQ Map" code rather than the current first patch
   which is a hack to a specific driver and does not address the real problem
   that's in common code.
2) second patch is OK but being held until we have a valid first patch.
Comment 23 Jun'ichi Nomura (Red Hat) 2007-01-16 10:59:22 EST
Patch posted for review.
Comment 24 Jun'ichi Nomura (Red Hat) 2007-01-16 11:08:52 EST
Created attachment 145698 [details]
posted patch
Comment 26 Jay Turner 2007-01-16 13:30:53 EST
This one scares me . . . will need quite a bit of help with testing and
qualification.  QE ack for RHEL5.
Comment 27 Don Zickus 2007-01-18 11:35:33 EST
in 2.6.18-4.el5
Comment 29 RHEL Product and Program Management 2007-02-07 21:03:01 EST
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.

Note You need to log in before you can comment on or make changes to this bug.