Bug 499782

Summary:	RHEL4 : HP-Japan : kernel BUG at drivers/block/cfq-iosched.c:630!
Product:	Red Hat Enterprise Linux 4	Reporter:	Lachlan McIlroy <lmcilroy>
Component:	kernel	Assignee:	Jeff Moyer <jmoyer>
Status:	CLOSED WONTFIX	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.0	CC:	coughlan, jmoyer, jwest, thenzl, vgaikwad
Target Milestone:	rc
Target Release:	4.8
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-06-14 20:21:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lachlan McIlroy 2009-05-08 06:33:14 UTC

Description of problem:

SYSTEM MAP: System.map-2.6.9-34.0.2
DEBUG KERNEL: vmlinux-2.6.9-34.0.2.ELsmp (2.6.9-34.0.2.ELsmp)
   DUMPFILE: vmcore
       CPUS: 2
       DATE: Wed Apr 22 23:20:46 2009
     UPTIME: 585 days, 20:36:08
LOAD AVERAGE: 0.63, 0.42, 0.30
      TASKS: 331
   NODENAME: mc-ldp03
    RELEASE: 2.6.9-34.0.2.ELsmp
    VERSION: #1 SMP Fri Jun 30 10:33:58 EDT 2006
    MACHINE: i686  (3803 Mhz)
     MEMORY: 4.4 GB
      PANIC: "kernel BUG at drivers/block/cfq-iosched.c:630!"
        PID: 1616
    COMMAND: "kjournald"
       TASK: f76f83b0  [THREAD_INFO: c3720000]
        CPU: 0
      STATE: TASK_RUNNING (PANIC)


------------[ cut here ]------------
kernel BUG at drivers/block/cfq-iosched.c:630!
invalid operand: 0000 [#1]
SMP
Modules linked in: iptable_filter ip_tables parport_pc parport st seos(U) eAC_mini(U) sg cpqci(U) netconsole
netdump dm_mirror dm_mod uhci_hcd ehci_hcd hw_random e1000(U) tg3 bond1(U) bonding(U) floppy ext3 jbd cciss s
d_mod scsi_mod
CPU:    0
EIP:    0060:[<c022a96f>]    Tainted: P      VLI
EFLAGS: 00010046   (2.6.9-34.0.2.ELsmp)
EIP is at cfq_put_request+0x15/0x86
eax: f7d31028   ebx: c375ad6c   ecx: c3777f10   edx: f7d11c40
esi: f7d31028   edi: 00000001   ebp: 00000000   esp: c03eaf88
ds: 007b   es: 007b   ss: 0068
Process kjournald (pid: 1616, threadinfo=c03ea000 task=f76f83b0)
Stack: c375ad6c f7d31028 c02219bc c0223b19 f7f5cc80 c375ad6c 00000000 c02248f6
      00000000 f7400000 00000000 f7df3000 f885569c 00000001 00000001 00000000
      00000082 f7dd4640 00000001 00000000 c3720ce8 c0107472 c3720ccc c03ea000
Call Trace:
[<c02219bc>] elv_put_request+0x9/0xa
[<c0223b19>] __blk_put_request+0x56/0x73
[<c02248f6>] end_that_request_last+0xa7/0xbb
[<f885569c>] do_cciss_intr+0x341/0x4b4 [cciss]
[<c0107472>] handle_IRQ_event+0x25/0x4f
[<c01079d2>] do_IRQ+0x11c/0x1ae
=======================
[<c02d304c>] common_interrupt+0x18/0x20
[<c022007b>] show_pools+0x73/0xe2
[<c0224174>] __make_request+0x452/0x46c
[<c022431c>] generic_make_request+0x18e/0x19e
[<c0120291>] autoremove_wake_function+0x0/0x2d
[<c02243f6>] submit_bio+0xca/0xd2
[<c015e7c9>] bio_alloc+0x100/0x168
[<c015e180>] submit_bh+0x141/0x166
[<f8863a62>] journal_commit_transaction+0x847/0xfc1 [jbd]
[<c0120291>] autoremove_wake_function+0x0/0x2d
[<c0120291>] autoremove_wake_function+0x0/0x2d
[<f8865e8d>] kjournald+0xc7/0x219 [jbd]
[<c0120291>] autoremove_wake_function+0x0/0x2d
[<c0120291>] autoremove_wake_function+0x0/0x2d
[<c011d549>] schedule_tail+0x31/0xa7
[<f8865dc0>] commit_timeout+0x0/0x5 [jbd]
[<f8865dc6>] kjournald+0x0/0x219 [jbd]
[<c01041f5>] kernel_thread_helper+0x5/0xb
Code: 04 24 39 4c 86 18 b8 00 00 00 00 0f 4f e8 5e 89 e8 5b 5e 5f 5d c3 56 89 c6 53 89 d3 8b 4b 40 8b 50 4c 8
5 c9 74 2e 39 58 08 75 08 <0f> 0b 76 02 1c 7a 2f c0 8d 41 20 39 41 20 74 08 0f 0b 77 02 1c

core can be found on core-i386.gsslab.rdu.redhat.com
Login with kerberos name/password
$ cd /cores/20090429212851/work
/cores/20090429212851/work$ ./crash 

I think what introduced the bug was the patch linux-2.6.9-cciss-update.patch which did this:

-#define CCISS_LOCK(i)  (hba[i]->queue->queue_lock)
+#define CCISS_LOCK(i)  (&hba[i]->lock)

what I think this change has done was cause do_cciss_intr() to acquire a private lock instead of the queue lock.

Comment 1 Jeff Moyer 2009-08-21 14:47:02 UTC

I'll take this bug, if that's okay with you, Tomas.

Cheers,
Jeff

Comment 2 Tomas Henzl 2009-08-24 10:19:24 UTC

(In reply to comment #1)
> I'll take this bug, if that's okay with you, Tomas.
> 
> Cheers,
> Jeff  
OK, thanks,
Tomas

Comment 3 Jeff Moyer 2010-10-14 16:56:12 UTC

I've seen a sprinkling of these bugs across RHEL 4 and RHEL 5, and they all seem to involve cciss devices (and multiple different driver versions).  I'm inclined to think that this is a firmware issue.  If someone is able to reproduce the problem reliably, then we can work with HP to zero in on the problem.  Does anyone have such an environment?