| Summary: | kernel BUG at fs/bio.c:221! | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Lachlan McIlroy <lmcilroy> |
| Component: | kernel | Assignee: | Jeff Moyer <jmoyer> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 5.3 | CC: | eguan, qcai, rwheeler, vgaikwad |
| Target Milestone: | rc | ||
| Target Release: | 5.7 | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-10-24 01:44:37 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Lachlan McIlroy
2011-02-21 06:00:15 UTC
vmcore is available from core-i386.gsslab.rdu.redhat.com:/cores/20110216224912
KERNEL: usr/lib/debug/lib/modules/2.6.18-128.el5PAE/vmlinux
DUMPFILE: ./00419896.vmcore
CPUS: 2
DATE: Tue Feb 8 02:26:04 2011
UPTIME: 83 days, 04:11:29
LOAD AVERAGE: 1.74, 1.40, 1.57
TASKS: 553
NODENAME: hostname
RELEASE: 2.6.18-128.el5PAE
VERSION: #1 SMP Wed Dec 17 12:02:33 EST 2008
MACHINE: i686 (3600 Mhz)
MEMORY: 4.4 GB
PANIC: "kernel BUG at fs/bio.c:221!"
crash> bt
PID: 22025 TASK: f4677550 CPU: 0 COMMAND: "smtp"
#0 [c0732e24] crash_kexec at c0440b3a
#1 [c0732e68] die at c04064bc
#2 [c0732e98] do_invalid_op at c0406c11
#3 [c0732f48] error_code (via invalid_op) at c0405a87
EAX: 00000000 EBX: f7fd4500 ECX: 00000002 EDX: f7fd4500 EBP: 00000000
DS: 007b ESI: 00000000 ES: 007b EDI: c0474173
CS: 0060 EIP: c0475828 ERR: ffffffff EFLAGS: 00210246
#4 [c0732f7c] bio_put at c0475828
#5 [c0732f80] end_bio_bh_io_sync at c04741a3
#6 [c0732f88] bio_endio at c0475b6f
#7 [c0732f9c] cciss_softirq_done at f88bcea7
#8 [c0732fd4] blk_done_softirq at c04dba2f
#9 [c0732fe4] __do_softirq at c042900d
--- <soft IRQ> ---
#0 [d4620f94] do_softirq at c04073d7
#1 [d4620fa0] do_IRQ at c04074d1
#2 [d4620fb8] common_interrupt at c0405941
EAX: 00000000 EBX: 09c772e0 ECX: 00000000 EDX: 00000093
DS: 007b ESI: 09c772e0 ES: 007b EDI: 00000093
SS: 007b ESP: bfafd700 EBP: bfafd738
CS: 0073 EIP: 08078944 ERR: ffffff36 EFLAGS: 00200206
crash> dis -r c04741a3
0xc0474173 <end_bio_bh_io_sync>: push %ebx
0xc0474174 <end_bio_bh_io_sync+1>: mov %eax,%ebx
0xc0474176 <end_bio_bh_io_sync+3>: cmpl $0x0,0x20(%ebx)
0xc047417a <end_bio_bh_io_sync+7>: mov %ecx,%edx
0xc047417c <end_bio_bh_io_sync+9>: mov 0x3c(%eax),%ecx
0xc047417f <end_bio_bh_io_sync+12>: mov $0x1,%eax
0xc0474184 <end_bio_bh_io_sync+17>: jne 0xc04741aa <end_bio_bh_io_sync+55>
0xc0474186 <end_bio_bh_io_sync+19>: cmp $0xffffffa1,%edx
0xc0474189 <end_bio_bh_io_sync+22>: jne 0xc0474196 <end_bio_bh_io_sync+35>
0xc047418b <end_bio_bh_io_sync+24>: lock btsl $0x7,0x10(%ebx)
0xc0474191 <end_bio_bh_io_sync+30>: lock btsl $0xd,(%ecx)
0xc0474196 <end_bio_bh_io_sync+35>: mov 0x10(%ebx),%edx
0xc0474199 <end_bio_bh_io_sync+38>: mov %ecx,%eax
0xc047419b <end_bio_bh_io_sync+40>: and $0x1,%edx
0xc047419e <end_bio_bh_io_sync+43>: call *0x20(%ecx)
0xc04741a1 <end_bio_bh_io_sync+46>: mov %ebx,%eax
0xc04741a3 <end_bio_bh_io_sync+48>: call 0xc047581f <bio_put>
crash> dis -r c0475828
0xc047581f <bio_put>: mov %eax,%edx
0xc0475821 <bio_put+2>: mov 0x38(%eax),%eax
0xc0475824 <bio_put+5>: test %eax,%eax
0xc0475826 <bio_put+7>: jne 0xc0475830 <bio_put+17>
0xc0475828 <bio_put+9>: ud2a
bio is still in %ebx
crash> bio f7fd4500
struct bio {
bi_sector = 10003368,
bi_next = 0x0,
bi_bdev = 0xf7d09440,
bi_flags = 9,
bi_rw = 1,
bi_vcnt = 1,
bi_idx = 0,
bi_phys_segments = 1,
bi_hw_segments = 1,
bi_size = 0,
bi_hw_front_size = 4096,
bi_hw_back_size = 4096,
bi_max_vecs = 1,
bi_io_vec = 0xd8dd5ac0,
bi_end_io = 0xc0474173 <end_bio_bh_io_sync>,
bi_cnt = {
counter = 0
},
bi_private = 0xf12691a4,
bi_destructor = 0xc0475b16 <bio_fs_destructor>
}
crash> kmem -s f7fd4500
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
f7fea980 bio 128 258 570 19 4k
SLAB MEMORY TOTAL ALLOCATED FREE
f7fd4000 f7fd4100 30 16 14
FREE / [ALLOCATED]
f7fd4500 (cpu 0 cache)
Yep, it's been freed. Try to find the command structure:
Will need to pull the request structure out of cciss_softirq_done and then get the command from it.
The request is the first argument (%eax) to cciss_softirq_done() and that comes from %ebx.
crash> dis -r c04dba2f
0xc04db9e5 <blk_done_softirq>: push %ebx
0xc04db9e6 <blk_done_softirq+1>: sub $0x8,%esp
0xc04db9e9 <blk_done_softirq+4>: cli
0xc04db9ea <blk_done_softirq+5>: mov %esp,%edx
0xc04db9ec <blk_done_softirq+7>: mov $0xc072b648,%eax
0xc04db9f1 <blk_done_softirq+12>: and $0xfffff000,%edx
0xc04db9f7 <blk_done_softirq+18>: mov 0x10(%edx),%edx
0xc04db9fa <blk_done_softirq+21>: add -0x3f913400(,%edx,4),%eax
0xc04dba01 <blk_done_softirq+28>: mov (%eax),%edx
0xc04dba03 <blk_done_softirq+30>: mov %edx,(%esp)
0xc04dba06 <blk_done_softirq+33>: mov %esp,0x4(%edx)
0xc04dba09 <blk_done_softirq+36>: mov 0x4(%eax),%edx
0xc04dba0c <blk_done_softirq+39>: mov %edx,0x4(%esp)
0xc04dba10 <blk_done_softirq+43>: mov %esp,(%edx)
0xc04dba12 <blk_done_softirq+45>: mov %eax,0x4(%eax)
0xc04dba15 <blk_done_softirq+48>: mov %eax,(%eax)
0xc04dba17 <blk_done_softirq+50>: sti
0xc04dba18 <blk_done_softirq+51>: jmp 0xc04dba32 <blk_done_softirq+77>
0xc04dba1a <blk_done_softirq+53>: mov 0x4(%ebx),%edx
0xc04dba1d <blk_done_softirq+56>: lea -0x8(%ebx),%eax
0xc04dba20 <blk_done_softirq+59>: mov (%ebx),%ecx
0xc04dba22 <blk_done_softirq+61>: mov %edx,0x4(%ecx)
0xc04dba25 <blk_done_softirq+64>: mov %ecx,(%edx)
0xc04dba27 <blk_done_softirq+66>: mov %ebx,0x4(%ebx)
0xc04dba2a <blk_done_softirq+69>: mov %ebx,(%ebx)
0xc04dba2c <blk_done_softirq+71>: mov 0x64(%eax),%edx
0xc04dba2f <blk_done_softirq+74>: call *0x6c(%edx)
%ebx is saved onto the stack in the 4th push.
crash> dis -r f88bcea7
0xf88bce3c <cciss_softirq_done>: push %ebp
0xf88bce3d <cciss_softirq_done+1>: push %edi
0xf88bce3e <cciss_softirq_done+2>: push %esi
0xf88bce3f <cciss_softirq_done+3>: push %ebx
0xf88bce40 <cciss_softirq_done+4>: sub $0x24,%esp
0xf88bce43 <cciss_softirq_done+7>: mov %eax,0x4(%esp)
0xf88bce47 <cciss_softirq_done+11>: mov 0x40(%eax),%eax
0xf88bce4a <cciss_softirq_done+14>: mov %eax,%edx
0xf88bce4c <cciss_softirq_done+16>: mov %eax,0xc(%esp)
0xf88bce50 <cciss_softirq_done+20>: mov 0x22c(%eax),%eax
0xf88bce56 <cciss_softirq_done+26>: movzbl 0x1(%edx),%ecx
0xf88bce5a <cciss_softirq_done+30>: mov -0x7737780(,%eax,4),%ebp
0xf88bce61 <cciss_softirq_done+37>: mov 0x15(%edx),%al
0xf88bce64 <cciss_softirq_done+40>: and $0xffffffc0,%eax
0xf88bce67 <cciss_softirq_done+43>: cmp $0x80,%al
0xf88bce69 <cciss_softirq_done+45>: sete %al
0xf88bce6c <cciss_softirq_done+48>: xor %edx,%edx
0xf88bce6e <cciss_softirq_done+50>: movzbl %al,%eax
0xf88bce71 <cciss_softirq_done+53>: inc %eax
0xf88bce72 <cciss_softirq_done+54>: jmp 0xf88bce82 <cciss_softirq_done+70>
0xf88bce74 <cciss_softirq_done+56>: cmp $0x3,%eax
0xf88bce77 <cciss_softirq_done+59>: jne 0xf88bce81 <cciss_softirq_done+69>
0xf88bce79 <cciss_softirq_done+61>: ud2a
0xf88bce81 <cciss_softirq_done+69>: inc %edx
0xf88bce82 <cciss_softirq_done+70>: cmp %ecx,%edx
0xf88bce84 <cciss_softirq_done+72>: jl 0xf88bce74 <cciss_softirq_done+56>
0xf88bce86 <cciss_softirq_done+74>: mov 0x4(%esp),%ecx
0xf88bce8a <cciss_softirq_done+78>: mov 0x48(%ecx),%esi
0xf88bce8d <cciss_softirq_done+81>: mov 0x34(%ecx),%eax
0xf88bce90 <cciss_softirq_done+84>: jmp 0xf88bceae <cciss_softirq_done+114>
0xf88bce92 <cciss_softirq_done+86>: mov 0x20(%eax),%edx
0xf88bce95 <cciss_softirq_done+89>: mov %esi,%ecx
0xf88bce97 <cciss_softirq_done+91>: mov 0x8(%eax),%ebx
0xf88bce9a <cciss_softirq_done+94>: movl $0x0,0x8(%eax)
0xf88bcea1 <cciss_softirq_done+101>: and $0xfffffe00,%edx
0xf88bcea7 <cciss_softirq_done+107>: call 0xc0475b21 <bio_endio>
crash> bt -f
...
#7 [c0732f9c] cciss_softirq_done at f88bcea7
[RA: c04dba32 SP: c0732fa0 FP: c0732fd4 SIZE: 56]
c0732fa0: 00000001 f6145b38 cb48d894 cb700000
c0732fb0: 00200286 00000001 0000000f f7cd1180
c0732fc0: 00000000 f6145b40 00000001 c06e4b20
c0732fd0: 0000000a c04dba32
...
Subtract 0x8 from the list pointer to get the request pointer:
crash> px 0xf6145b40-0x8
$2 = 0xf6145b38
crash> request 0xf6145b38
struct request {
queuelist = {
next = 0xf6145b38,
prev = 0xf6145b38
},
donelist = {
next = 0xf6145b40,
prev = 0xf6145b40
},
flags = 2373,
sector = 10003360,
nr_sectors = 8,
current_nr_sectors = 8,
hard_sector = 10003360,
hard_nr_sectors = 8,
hard_cur_sectors = 8,
bio = 0xf7fd4500,
biotail = 0xf7fd4500,
elevator_private = 0x0,
completion_data = 0xcb700000,
rq_status = -1,
errors = 0,
rq_disk = 0xf7cd1180,
start_time = 2891021760,
nr_phys_segments = 1,
nr_hw_segments = 1,
ioprio = 0,
tag = 0,
ref_count = 0,
q = 0xf7cb53e8,
rl = 0x0,
waiting = 0x0,
special = 0x0,
buffer = 0xf1cdf000 "\250\215\t",
cmd_len = 0,
cmd = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
data_len = 0,
sense_len = 0,
data = 0x0,
sense = 0x0,
timeout = 0,
retries = 0,
end_io = 0,
end_io_data = 0x0
}
The bio pointer matches so this is the correct request. Check if it's allocated too:
crash> kmem -s 0xf6145b38
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
f7fea280 blkdev_requests 172 22 230 10 4k
SLAB MEMORY TOTAL ALLOCATED FREE
f6145000 f6145078 23 0 23
FREE / [ALLOCATED]
f6145b38 (cpu 0 cache)
Nope, it's been freed too.
The command structure is in the completion_data field:
crash> CommandList_struct 0xcb700000
struct CommandList_struct {
...
busaddr = 191889408,
err_info = 0xcb6e0000,
ctlr = 0,
cmd_type = 0,
cmdindex = 0,
prev = 0xeb700000,
next = 0xcb700250,
rq = 0xdddd8328,
waiting = 0x0,
retry_count = 0,
scsi_cmd = 0x0,
pad = 0xcb700250 ""
}
The request pointer does not match the request above - this command has been reused. The command appears to have been allocated from the command pool with index 0.
Get the controller structure to check this:
crash> px hba
hba = $3 =
{0xcb488000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
crash> ctlr_info_t 0xcb488000
struct ctlr_info_t {
ctlr = 0,
devname = "cciss0\000",
product_name = 0xf88c10ab "Smart Array 6i",
firm_ver = "2.84",
...
commands_outstanding = 0,
...
reqQ = 0x0,
cmpQ = 0xeb700000,
Qdepth = 0,
maxQsinceinit = 159,
maxSG = 31,
lock = {
raw_lock = {
slock = 1
}
},
cmd_pool = 0xcb700000,
cmd_pool_dhandle = 191889408,
errinfo_pool = 0xcb6e0000,
errinfo_pool_dhandle = 191758336,
cmd_pool_bits = 0xf7cba3c0,
nr_allocs = -1193686898,
nr_frees = -1193686900,
...
commands_outstanding is 0 but there are still commands on the completion queue. (nr_allocs - nr_frees) is 2 which means there should be 2 I/Os in the queue yet the first command in the completion queue is an ioctl. Take a look at that command:
crash> CommandList_struct 0xeb700000
struct CommandList_struct {
...
busaddr = 0x2b700000,
err_info = 0xf50cc000,
ctlr = 0x0,
cmd_type = 0x1,
cmdindex = 0xffffffff,
prev = 0xeb700000,
next = 0xeb700000,
rq = 0x0,
waiting = 0xe6ce5e2c,
retry_count = 0x0,
scsi_cmd = 0x0,
pad = 0xeb700250 "\030\345\221", <incomplete sequence \357>
}
prev and next pointers point back to itself so that means there are no other commands on the completion queue. We've lost 2 I/Os?
Check the command pool which is only used by I/Os:
crash> rd 0xf7cba3c0
f7cba3c0: 00000003 ....
There are two commands allocated in the first two slots of the pool but neither are found in the completion queue or the request queue. This may explain why the request and command structures above don't match up - the command is in use by another I/O.
From what I can guess there are 3 valid commands in progress:
The command on the head of the completion queue is:
crash> CommandList_struct 0xeb700000
struct CommandList_struct {
...
busaddr = 728760320,
err_info = 0xf50cc000,
ctlr = 0,
cmd_type = 1,
cmdindex = -1,
prev = 0xeb700000,
next = 0xeb700000,
rq = 0x0,
waiting = 0xe6ce5e2c,
retry_count = 0,
scsi_cmd = 0x0,
pad = 0xeb700250 "\030\345\221", <incomplete sequence \357>
}
This is an ioctl from this process:
crash> bt 3810
PID: 3810 TASK: e6ccb000 CPU: 0 COMMAND: "cmaidad"
#0 [e6ce5d00] schedule at c060d94d
#1 [e6ce5d6c] wait_for_completion at c060dabb
#2 [e6ce5d94] cciss_ioctl at f88bddb6
#3 [e6ce5e50] blkdev_driver_ioctl at c04dca80
#4 [e6ce5e6c] blkdev_ioctl at c04dd0a5
#5 [e6ce5f64] block_ioctl at c04778c7
#6 [e6ce5f70] do_ioctl at c0480dfa
#7 [e6ce5f84] vfs_ioctl at c0481082
#8 [e6ce5fa0] sys_ioctl at c04810dc
#9 [e6ce5fb8] system_call at c0404f10
EAX: 00000036 EBX: 00000005 ECX: c054420b EDX: bf91b954
DS: 007b ESI: bf91b954 ES: 007b EDI: 00000005
SS: 007b ESP: bf91b914 EBP: bf91ba08
CS: 0073 EIP: 00ce8402 ERR: 00000036 EFLAGS: 00003283
The next two are I/Os:
crash> CommandList_struct 0xcb700000
struct CommandList_struct {
...
busaddr = 191889408,
err_info = 0xcb6e0000,
ctlr = 0,
cmd_type = 0,
cmdindex = 0,
prev = 0xeb700000,
next = 0xcb700250,
rq = 0xdddd8328,
waiting = 0x0,
retry_count = 0,
scsi_cmd = 0x0,
pad = 0xcb700250 ""
}
crash> CommandList_struct 0xcb700250
struct CommandList_struct {
...
busaddr = 191890000,
err_info = 0xcb6e0030,
ctlr = 0,
cmd_type = 0,
cmdindex = 1,
prev = 0xeb700000,
next = 0xeb700000,
rq = 0xf6d469e0,
waiting = 0x0,
retry_count = 0,
scsi_cmd = 0x0,
pad = 0xcb7004a0 ""
}
The list pointers on these 3 commands should form a complete circular double-linked list but they don't. The list pointers for the ioctl both point back to itself. For the first I/O the previous pointer points to the ioctl and the next pointer points to the other I/O (this seems to be the only command that makes sense). For the second I/O the previous and next pointers both point back to the ioctl (this would make sense if there were two commands in the list). I reckon that adding/removing an invalid command from the completion queue has messed up the list pointers.
I also think this has the exact same cause as the RHEL4 bug where the driver tries to complete an invalid command and it then inserts it into the block queue via block_complete_request() in complete_command().
Lachlan
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug. Could you please ensure that the customer is running the latest cciss firmware? They are running the latest version: product_name = 0xf88c10ab "Smart Array 6i", firm_ver = "2.84", (In reply to comment #2) > From what I can guess there are 3 valid commands in progress: > > The command on the head of the completion queue is: > > crash> CommandList_struct 0xeb700000 > struct CommandList_struct { > ... > busaddr = 728760320, > err_info = 0xf50cc000, > ctlr = 0, > cmd_type = 1, > cmdindex = -1, > prev = 0xeb700000, > next = 0xeb700000, > rq = 0x0, > waiting = 0xe6ce5e2c, > retry_count = 0, > scsi_cmd = 0x0, > pad = 0xeb700250 "\030\345\221", <incomplete sequence \357> > } > > This is an ioctl from this process: > > crash> bt 3810 > PID: 3810 TASK: e6ccb000 CPU: 0 COMMAND: "cmaidad" > #0 [e6ce5d00] schedule at c060d94d > #1 [e6ce5d6c] wait_for_completion at c060dabb > #2 [e6ce5d94] cciss_ioctl at f88bddb6 > #3 [e6ce5e50] blkdev_driver_ioctl at c04dca80 > #4 [e6ce5e6c] blkdev_ioctl at c04dd0a5 > #5 [e6ce5f64] block_ioctl at c04778c7 > #6 [e6ce5f70] do_ioctl at c0480dfa > #7 [e6ce5f84] vfs_ioctl at c0481082 > #8 [e6ce5fa0] sys_ioctl at c04810dc > #9 [e6ce5fb8] system_call at c0404f10 > EAX: 00000036 EBX: 00000005 ECX: c054420b EDX: bf91b954 > DS: 007b ESI: bf91b954 ES: 007b EDI: 00000005 > SS: 007b ESP: bf91b914 EBP: bf91ba08 > CS: 0073 EIP: 00ce8402 ERR: 00000036 EFLAGS: 00003283 > > The next two are I/Os: > > crash> CommandList_struct 0xcb700000 > struct CommandList_struct { > ... > busaddr = 191889408, > err_info = 0xcb6e0000, > ctlr = 0, > cmd_type = 0, > cmdindex = 0, > prev = 0xeb700000, > next = 0xcb700250, > rq = 0xdddd8328, > waiting = 0x0, > retry_count = 0, > scsi_cmd = 0x0, > pad = 0xcb700250 "" > } > > crash> CommandList_struct 0xcb700250 > struct CommandList_struct { > ... > busaddr = 191890000, > err_info = 0xcb6e0030, > ctlr = 0, > cmd_type = 0, > cmdindex = 1, > prev = 0xeb700000, > next = 0xeb700000, > rq = 0xf6d469e0, > waiting = 0x0, > retry_count = 0, > scsi_cmd = 0x0, > pad = 0xcb7004a0 "" > } > > The list pointers on these 3 commands should form a complete circular > double-linked list but they don't. The list pointers for the ioctl both point > back to itself. For the first I/O the previous pointer points to the ioctl and > the next pointer points to the other I/O (this seems to be the only command > that makes sense). For the second I/O the previous and next pointers both > point back to the ioctl (this would make sense if there were two commands in > the list). I reckon that adding/removing an invalid command from the > completion queue has messed up the list pointers. You can get to this state via a set of valid operations on valid data structures: State 0: ioctl->cmd1->cmd2 list_del cmd1 State 1: ioctl->cmd2 list_del cmd2 State 2: ioctl > I also think this has the exact same cause as the RHEL4 bug where the driver > tries to complete an invalid command and it then inserts it into the block > queue via block_complete_request() in complete_command(). Lachlan, what do you mean by "invalid command" in the comments above? What is cmaidad? How often can this problem be reproduced? Does disabling management/monitoring services (such as smartd or that cmaidad) prevent the issue from occurring? > > You can get to this state via a set of valid operations on valid data > structures: > > State 0: ioctl->cmd1->cmd2 > list_del cmd1 > State 1: ioctl->cmd2 > list_del cmd2 > State 2: ioctl That's possibly what's happened here but there are a number of things that don't add up here like why the bio and request structures are being completed when they are already free or why the request points to a cciss command that refers to a different request. It's like the request is being completed a second time. > > > I also think this has the exact same cause as the RHEL4 bug where the driver > > tries to complete an invalid command and it then inserts it into the block > > queue via block_complete_request() in complete_command(). > > Lachlan, what do you mean by "invalid command" in the comments above? My theory is that the cciss driver received an interrupt in do_cciss_intr(), decoded it and indexed the command at slot 0 in the command pool. Ordinarily this would be correct but in this case I suspect there may have been an additional invalid interrupt or an interrupt with the wrong value causing the command at slot 0 to be completed twice. > > What is cmaidad? How often can this problem be reproduced? Does disabling > management/monitoring services (such as smartd or that cmaidad) prevent the > issue from occurring? cmaidad is part of HP's Insight Manager package and it monitors the health of the disks. I think there were ioctls in progress in the RHEL4 version of this problem too so this monitoring application might be triggering the issue. The problem occurs very rarely and so far has not been reproduced at will. Due to the infrequency of incidences it's not possible to confirm if disabling this service prevents the problem but it could be a workaround. I've asked HP for a test case that would issue the ioctls that their monitoring utilities use in rapid succession. Hopefully that will help trigger the problem, if that's the cause. |