Hide Forgot
Description of problem: This bug is the RHEL5 variant of BZ 594598. Customer's system crashed with the following panic: kernel BUG at fs/bio.c:221! invalid opcode: 0000 [#1] SMP last sysfs file: /block/ram0/dev Modules linked in: st sg autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi ac parport_pc lp parport floppy hpilo e752x_edac edac_mc ide_cd cdrom pcspkr serio_raw tg3(U) dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 0 EIP: 0060:[<c0475828>] Tainted: G VLI EFLAGS: 00210246 (2.6.18-128.el5PAE #1) EIP is at bio_put+0x9/0x29 eax: 00000000 ebx: f7fd4500 ecx: 00000002 edx: f7fd4500 esi: 00000000 edi: c0474173 ebp: 00000000 esp: c0732f80 ds: 007b es: 007b ss: 0068 Process smtp (pid: 22025, ti=c0732000 task=f4677550 task.ti=d4620000) Stack: c04741a8 f7fd4500 c0475b71 00000000 00000000 c06e4b20 cb488000 f88bceac 00000001 f6145b38 cb48d894 cb700000 00200286 00000001 0000000f f7cd1180 00000000 f6145b40 00000001 c06e4b20 0000000a c04dba32 c0732fd8 c0732fd8 Call Trace: [<c04741a8>] end_bio_bh_io_sync+0x35/0x39 [<c0475b71>] bio_endio+0x50/0x55 [<f88bceac>] cciss_softirq_done+0x70/0x2a1 [cciss] [<c04dba32>] blk_done_softirq+0x4d/0x58 [<c042900f>] __do_softirq+0x87/0x114 [<c04073d7>] do_softirq+0x52/0x9c [<c044b060>] __do_IRQ+0x0/0xd6 [<c04074d6>] do_IRQ+0xb5/0xc3 [<c0405946>] common_interrupt+0x1a/0x20 ======================= Code: 04 00 00 00 00 8b 44 24 04 5a 59 5b 5e 5f 5d c3 53 89 c3 8b 00 e8 3f 7b ff ff 89 d8 5b e9 37 7b ff ff 89 c2 8b 40 38 85 c0 75 08 <0f> 0b dd 00 57 40 63 c0 f0 ff 4a 38 0f 94 c0 84 c0 74 0c c7 42 EIP: [<c0475828>] bio_put+0x9/0x29 SS:ESP 0068:c0732f80 Version-Release number of selected component (if applicable): kernel-2.6.18-128.el5PAE How reproducible: Infrequent, no known steps to reproduce.
vmcore is available from core-i386.gsslab.rdu.redhat.com:/cores/20110216224912 KERNEL: usr/lib/debug/lib/modules/2.6.18-128.el5PAE/vmlinux DUMPFILE: ./00419896.vmcore CPUS: 2 DATE: Tue Feb 8 02:26:04 2011 UPTIME: 83 days, 04:11:29 LOAD AVERAGE: 1.74, 1.40, 1.57 TASKS: 553 NODENAME: hostname RELEASE: 2.6.18-128.el5PAE VERSION: #1 SMP Wed Dec 17 12:02:33 EST 2008 MACHINE: i686 (3600 Mhz) MEMORY: 4.4 GB PANIC: "kernel BUG at fs/bio.c:221!" crash> bt PID: 22025 TASK: f4677550 CPU: 0 COMMAND: "smtp" #0 [c0732e24] crash_kexec at c0440b3a #1 [c0732e68] die at c04064bc #2 [c0732e98] do_invalid_op at c0406c11 #3 [c0732f48] error_code (via invalid_op) at c0405a87 EAX: 00000000 EBX: f7fd4500 ECX: 00000002 EDX: f7fd4500 EBP: 00000000 DS: 007b ESI: 00000000 ES: 007b EDI: c0474173 CS: 0060 EIP: c0475828 ERR: ffffffff EFLAGS: 00210246 #4 [c0732f7c] bio_put at c0475828 #5 [c0732f80] end_bio_bh_io_sync at c04741a3 #6 [c0732f88] bio_endio at c0475b6f #7 [c0732f9c] cciss_softirq_done at f88bcea7 #8 [c0732fd4] blk_done_softirq at c04dba2f #9 [c0732fe4] __do_softirq at c042900d --- <soft IRQ> --- #0 [d4620f94] do_softirq at c04073d7 #1 [d4620fa0] do_IRQ at c04074d1 #2 [d4620fb8] common_interrupt at c0405941 EAX: 00000000 EBX: 09c772e0 ECX: 00000000 EDX: 00000093 DS: 007b ESI: 09c772e0 ES: 007b EDI: 00000093 SS: 007b ESP: bfafd700 EBP: bfafd738 CS: 0073 EIP: 08078944 ERR: ffffff36 EFLAGS: 00200206 crash> dis -r c04741a3 0xc0474173 <end_bio_bh_io_sync>: push %ebx 0xc0474174 <end_bio_bh_io_sync+1>: mov %eax,%ebx 0xc0474176 <end_bio_bh_io_sync+3>: cmpl $0x0,0x20(%ebx) 0xc047417a <end_bio_bh_io_sync+7>: mov %ecx,%edx 0xc047417c <end_bio_bh_io_sync+9>: mov 0x3c(%eax),%ecx 0xc047417f <end_bio_bh_io_sync+12>: mov $0x1,%eax 0xc0474184 <end_bio_bh_io_sync+17>: jne 0xc04741aa <end_bio_bh_io_sync+55> 0xc0474186 <end_bio_bh_io_sync+19>: cmp $0xffffffa1,%edx 0xc0474189 <end_bio_bh_io_sync+22>: jne 0xc0474196 <end_bio_bh_io_sync+35> 0xc047418b <end_bio_bh_io_sync+24>: lock btsl $0x7,0x10(%ebx) 0xc0474191 <end_bio_bh_io_sync+30>: lock btsl $0xd,(%ecx) 0xc0474196 <end_bio_bh_io_sync+35>: mov 0x10(%ebx),%edx 0xc0474199 <end_bio_bh_io_sync+38>: mov %ecx,%eax 0xc047419b <end_bio_bh_io_sync+40>: and $0x1,%edx 0xc047419e <end_bio_bh_io_sync+43>: call *0x20(%ecx) 0xc04741a1 <end_bio_bh_io_sync+46>: mov %ebx,%eax 0xc04741a3 <end_bio_bh_io_sync+48>: call 0xc047581f <bio_put> crash> dis -r c0475828 0xc047581f <bio_put>: mov %eax,%edx 0xc0475821 <bio_put+2>: mov 0x38(%eax),%eax 0xc0475824 <bio_put+5>: test %eax,%eax 0xc0475826 <bio_put+7>: jne 0xc0475830 <bio_put+17> 0xc0475828 <bio_put+9>: ud2a bio is still in %ebx crash> bio f7fd4500 struct bio { bi_sector = 10003368, bi_next = 0x0, bi_bdev = 0xf7d09440, bi_flags = 9, bi_rw = 1, bi_vcnt = 1, bi_idx = 0, bi_phys_segments = 1, bi_hw_segments = 1, bi_size = 0, bi_hw_front_size = 4096, bi_hw_back_size = 4096, bi_max_vecs = 1, bi_io_vec = 0xd8dd5ac0, bi_end_io = 0xc0474173 <end_bio_bh_io_sync>, bi_cnt = { counter = 0 }, bi_private = 0xf12691a4, bi_destructor = 0xc0475b16 <bio_fs_destructor> } crash> kmem -s f7fd4500 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE f7fea980 bio 128 258 570 19 4k SLAB MEMORY TOTAL ALLOCATED FREE f7fd4000 f7fd4100 30 16 14 FREE / [ALLOCATED] f7fd4500 (cpu 0 cache) Yep, it's been freed. Try to find the command structure: Will need to pull the request structure out of cciss_softirq_done and then get the command from it. The request is the first argument (%eax) to cciss_softirq_done() and that comes from %ebx. crash> dis -r c04dba2f 0xc04db9e5 <blk_done_softirq>: push %ebx 0xc04db9e6 <blk_done_softirq+1>: sub $0x8,%esp 0xc04db9e9 <blk_done_softirq+4>: cli 0xc04db9ea <blk_done_softirq+5>: mov %esp,%edx 0xc04db9ec <blk_done_softirq+7>: mov $0xc072b648,%eax 0xc04db9f1 <blk_done_softirq+12>: and $0xfffff000,%edx 0xc04db9f7 <blk_done_softirq+18>: mov 0x10(%edx),%edx 0xc04db9fa <blk_done_softirq+21>: add -0x3f913400(,%edx,4),%eax 0xc04dba01 <blk_done_softirq+28>: mov (%eax),%edx 0xc04dba03 <blk_done_softirq+30>: mov %edx,(%esp) 0xc04dba06 <blk_done_softirq+33>: mov %esp,0x4(%edx) 0xc04dba09 <blk_done_softirq+36>: mov 0x4(%eax),%edx 0xc04dba0c <blk_done_softirq+39>: mov %edx,0x4(%esp) 0xc04dba10 <blk_done_softirq+43>: mov %esp,(%edx) 0xc04dba12 <blk_done_softirq+45>: mov %eax,0x4(%eax) 0xc04dba15 <blk_done_softirq+48>: mov %eax,(%eax) 0xc04dba17 <blk_done_softirq+50>: sti 0xc04dba18 <blk_done_softirq+51>: jmp 0xc04dba32 <blk_done_softirq+77> 0xc04dba1a <blk_done_softirq+53>: mov 0x4(%ebx),%edx 0xc04dba1d <blk_done_softirq+56>: lea -0x8(%ebx),%eax 0xc04dba20 <blk_done_softirq+59>: mov (%ebx),%ecx 0xc04dba22 <blk_done_softirq+61>: mov %edx,0x4(%ecx) 0xc04dba25 <blk_done_softirq+64>: mov %ecx,(%edx) 0xc04dba27 <blk_done_softirq+66>: mov %ebx,0x4(%ebx) 0xc04dba2a <blk_done_softirq+69>: mov %ebx,(%ebx) 0xc04dba2c <blk_done_softirq+71>: mov 0x64(%eax),%edx 0xc04dba2f <blk_done_softirq+74>: call *0x6c(%edx) %ebx is saved onto the stack in the 4th push. crash> dis -r f88bcea7 0xf88bce3c <cciss_softirq_done>: push %ebp 0xf88bce3d <cciss_softirq_done+1>: push %edi 0xf88bce3e <cciss_softirq_done+2>: push %esi 0xf88bce3f <cciss_softirq_done+3>: push %ebx 0xf88bce40 <cciss_softirq_done+4>: sub $0x24,%esp 0xf88bce43 <cciss_softirq_done+7>: mov %eax,0x4(%esp) 0xf88bce47 <cciss_softirq_done+11>: mov 0x40(%eax),%eax 0xf88bce4a <cciss_softirq_done+14>: mov %eax,%edx 0xf88bce4c <cciss_softirq_done+16>: mov %eax,0xc(%esp) 0xf88bce50 <cciss_softirq_done+20>: mov 0x22c(%eax),%eax 0xf88bce56 <cciss_softirq_done+26>: movzbl 0x1(%edx),%ecx 0xf88bce5a <cciss_softirq_done+30>: mov -0x7737780(,%eax,4),%ebp 0xf88bce61 <cciss_softirq_done+37>: mov 0x15(%edx),%al 0xf88bce64 <cciss_softirq_done+40>: and $0xffffffc0,%eax 0xf88bce67 <cciss_softirq_done+43>: cmp $0x80,%al 0xf88bce69 <cciss_softirq_done+45>: sete %al 0xf88bce6c <cciss_softirq_done+48>: xor %edx,%edx 0xf88bce6e <cciss_softirq_done+50>: movzbl %al,%eax 0xf88bce71 <cciss_softirq_done+53>: inc %eax 0xf88bce72 <cciss_softirq_done+54>: jmp 0xf88bce82 <cciss_softirq_done+70> 0xf88bce74 <cciss_softirq_done+56>: cmp $0x3,%eax 0xf88bce77 <cciss_softirq_done+59>: jne 0xf88bce81 <cciss_softirq_done+69> 0xf88bce79 <cciss_softirq_done+61>: ud2a 0xf88bce81 <cciss_softirq_done+69>: inc %edx 0xf88bce82 <cciss_softirq_done+70>: cmp %ecx,%edx 0xf88bce84 <cciss_softirq_done+72>: jl 0xf88bce74 <cciss_softirq_done+56> 0xf88bce86 <cciss_softirq_done+74>: mov 0x4(%esp),%ecx 0xf88bce8a <cciss_softirq_done+78>: mov 0x48(%ecx),%esi 0xf88bce8d <cciss_softirq_done+81>: mov 0x34(%ecx),%eax 0xf88bce90 <cciss_softirq_done+84>: jmp 0xf88bceae <cciss_softirq_done+114> 0xf88bce92 <cciss_softirq_done+86>: mov 0x20(%eax),%edx 0xf88bce95 <cciss_softirq_done+89>: mov %esi,%ecx 0xf88bce97 <cciss_softirq_done+91>: mov 0x8(%eax),%ebx 0xf88bce9a <cciss_softirq_done+94>: movl $0x0,0x8(%eax) 0xf88bcea1 <cciss_softirq_done+101>: and $0xfffffe00,%edx 0xf88bcea7 <cciss_softirq_done+107>: call 0xc0475b21 <bio_endio> crash> bt -f ... #7 [c0732f9c] cciss_softirq_done at f88bcea7 [RA: c04dba32 SP: c0732fa0 FP: c0732fd4 SIZE: 56] c0732fa0: 00000001 f6145b38 cb48d894 cb700000 c0732fb0: 00200286 00000001 0000000f f7cd1180 c0732fc0: 00000000 f6145b40 00000001 c06e4b20 c0732fd0: 0000000a c04dba32 ... Subtract 0x8 from the list pointer to get the request pointer: crash> px 0xf6145b40-0x8 $2 = 0xf6145b38 crash> request 0xf6145b38 struct request { queuelist = { next = 0xf6145b38, prev = 0xf6145b38 }, donelist = { next = 0xf6145b40, prev = 0xf6145b40 }, flags = 2373, sector = 10003360, nr_sectors = 8, current_nr_sectors = 8, hard_sector = 10003360, hard_nr_sectors = 8, hard_cur_sectors = 8, bio = 0xf7fd4500, biotail = 0xf7fd4500, elevator_private = 0x0, completion_data = 0xcb700000, rq_status = -1, errors = 0, rq_disk = 0xf7cd1180, start_time = 2891021760, nr_phys_segments = 1, nr_hw_segments = 1, ioprio = 0, tag = 0, ref_count = 0, q = 0xf7cb53e8, rl = 0x0, waiting = 0x0, special = 0x0, buffer = 0xf1cdf000 "\250\215\t", cmd_len = 0, cmd = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", data_len = 0, sense_len = 0, data = 0x0, sense = 0x0, timeout = 0, retries = 0, end_io = 0, end_io_data = 0x0 } The bio pointer matches so this is the correct request. Check if it's allocated too: crash> kmem -s 0xf6145b38 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE f7fea280 blkdev_requests 172 22 230 10 4k SLAB MEMORY TOTAL ALLOCATED FREE f6145000 f6145078 23 0 23 FREE / [ALLOCATED] f6145b38 (cpu 0 cache) Nope, it's been freed too. The command structure is in the completion_data field: crash> CommandList_struct 0xcb700000 struct CommandList_struct { ... busaddr = 191889408, err_info = 0xcb6e0000, ctlr = 0, cmd_type = 0, cmdindex = 0, prev = 0xeb700000, next = 0xcb700250, rq = 0xdddd8328, waiting = 0x0, retry_count = 0, scsi_cmd = 0x0, pad = 0xcb700250 "" } The request pointer does not match the request above - this command has been reused. The command appears to have been allocated from the command pool with index 0. Get the controller structure to check this: crash> px hba hba = $3 = {0xcb488000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0} crash> ctlr_info_t 0xcb488000 struct ctlr_info_t { ctlr = 0, devname = "cciss0\000", product_name = 0xf88c10ab "Smart Array 6i", firm_ver = "2.84", ... commands_outstanding = 0, ... reqQ = 0x0, cmpQ = 0xeb700000, Qdepth = 0, maxQsinceinit = 159, maxSG = 31, lock = { raw_lock = { slock = 1 } }, cmd_pool = 0xcb700000, cmd_pool_dhandle = 191889408, errinfo_pool = 0xcb6e0000, errinfo_pool_dhandle = 191758336, cmd_pool_bits = 0xf7cba3c0, nr_allocs = -1193686898, nr_frees = -1193686900, ... commands_outstanding is 0 but there are still commands on the completion queue. (nr_allocs - nr_frees) is 2 which means there should be 2 I/Os in the queue yet the first command in the completion queue is an ioctl. Take a look at that command: crash> CommandList_struct 0xeb700000 struct CommandList_struct { ... busaddr = 0x2b700000, err_info = 0xf50cc000, ctlr = 0x0, cmd_type = 0x1, cmdindex = 0xffffffff, prev = 0xeb700000, next = 0xeb700000, rq = 0x0, waiting = 0xe6ce5e2c, retry_count = 0x0, scsi_cmd = 0x0, pad = 0xeb700250 "\030\345\221", <incomplete sequence \357> } prev and next pointers point back to itself so that means there are no other commands on the completion queue. We've lost 2 I/Os? Check the command pool which is only used by I/Os: crash> rd 0xf7cba3c0 f7cba3c0: 00000003 .... There are two commands allocated in the first two slots of the pool but neither are found in the completion queue or the request queue. This may explain why the request and command structures above don't match up - the command is in use by another I/O. From what I can guess there are 3 valid commands in progress: The command on the head of the completion queue is: crash> CommandList_struct 0xeb700000 struct CommandList_struct { ... busaddr = 728760320, err_info = 0xf50cc000, ctlr = 0, cmd_type = 1, cmdindex = -1, prev = 0xeb700000, next = 0xeb700000, rq = 0x0, waiting = 0xe6ce5e2c, retry_count = 0, scsi_cmd = 0x0, pad = 0xeb700250 "\030\345\221", <incomplete sequence \357> } This is an ioctl from this process: crash> bt 3810 PID: 3810 TASK: e6ccb000 CPU: 0 COMMAND: "cmaidad" #0 [e6ce5d00] schedule at c060d94d #1 [e6ce5d6c] wait_for_completion at c060dabb #2 [e6ce5d94] cciss_ioctl at f88bddb6 #3 [e6ce5e50] blkdev_driver_ioctl at c04dca80 #4 [e6ce5e6c] blkdev_ioctl at c04dd0a5 #5 [e6ce5f64] block_ioctl at c04778c7 #6 [e6ce5f70] do_ioctl at c0480dfa #7 [e6ce5f84] vfs_ioctl at c0481082 #8 [e6ce5fa0] sys_ioctl at c04810dc #9 [e6ce5fb8] system_call at c0404f10 EAX: 00000036 EBX: 00000005 ECX: c054420b EDX: bf91b954 DS: 007b ESI: bf91b954 ES: 007b EDI: 00000005 SS: 007b ESP: bf91b914 EBP: bf91ba08 CS: 0073 EIP: 00ce8402 ERR: 00000036 EFLAGS: 00003283 The next two are I/Os: crash> CommandList_struct 0xcb700000 struct CommandList_struct { ... busaddr = 191889408, err_info = 0xcb6e0000, ctlr = 0, cmd_type = 0, cmdindex = 0, prev = 0xeb700000, next = 0xcb700250, rq = 0xdddd8328, waiting = 0x0, retry_count = 0, scsi_cmd = 0x0, pad = 0xcb700250 "" } crash> CommandList_struct 0xcb700250 struct CommandList_struct { ... busaddr = 191890000, err_info = 0xcb6e0030, ctlr = 0, cmd_type = 0, cmdindex = 1, prev = 0xeb700000, next = 0xeb700000, rq = 0xf6d469e0, waiting = 0x0, retry_count = 0, scsi_cmd = 0x0, pad = 0xcb7004a0 "" } The list pointers on these 3 commands should form a complete circular double-linked list but they don't. The list pointers for the ioctl both point back to itself. For the first I/O the previous pointer points to the ioctl and the next pointer points to the other I/O (this seems to be the only command that makes sense). For the second I/O the previous and next pointers both point back to the ioctl (this would make sense if there were two commands in the list). I reckon that adding/removing an invalid command from the completion queue has messed up the list pointers. I also think this has the exact same cause as the RHEL4 bug where the driver tries to complete an invalid command and it then inserts it into the block queue via block_complete_request() in complete_command(). Lachlan
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug.
Could you please ensure that the customer is running the latest cciss firmware?
They are running the latest version: product_name = 0xf88c10ab "Smart Array 6i", firm_ver = "2.84",
(In reply to comment #2) > From what I can guess there are 3 valid commands in progress: > > The command on the head of the completion queue is: > > crash> CommandList_struct 0xeb700000 > struct CommandList_struct { > ... > busaddr = 728760320, > err_info = 0xf50cc000, > ctlr = 0, > cmd_type = 1, > cmdindex = -1, > prev = 0xeb700000, > next = 0xeb700000, > rq = 0x0, > waiting = 0xe6ce5e2c, > retry_count = 0, > scsi_cmd = 0x0, > pad = 0xeb700250 "\030\345\221", <incomplete sequence \357> > } > > This is an ioctl from this process: > > crash> bt 3810 > PID: 3810 TASK: e6ccb000 CPU: 0 COMMAND: "cmaidad" > #0 [e6ce5d00] schedule at c060d94d > #1 [e6ce5d6c] wait_for_completion at c060dabb > #2 [e6ce5d94] cciss_ioctl at f88bddb6 > #3 [e6ce5e50] blkdev_driver_ioctl at c04dca80 > #4 [e6ce5e6c] blkdev_ioctl at c04dd0a5 > #5 [e6ce5f64] block_ioctl at c04778c7 > #6 [e6ce5f70] do_ioctl at c0480dfa > #7 [e6ce5f84] vfs_ioctl at c0481082 > #8 [e6ce5fa0] sys_ioctl at c04810dc > #9 [e6ce5fb8] system_call at c0404f10 > EAX: 00000036 EBX: 00000005 ECX: c054420b EDX: bf91b954 > DS: 007b ESI: bf91b954 ES: 007b EDI: 00000005 > SS: 007b ESP: bf91b914 EBP: bf91ba08 > CS: 0073 EIP: 00ce8402 ERR: 00000036 EFLAGS: 00003283 > > The next two are I/Os: > > crash> CommandList_struct 0xcb700000 > struct CommandList_struct { > ... > busaddr = 191889408, > err_info = 0xcb6e0000, > ctlr = 0, > cmd_type = 0, > cmdindex = 0, > prev = 0xeb700000, > next = 0xcb700250, > rq = 0xdddd8328, > waiting = 0x0, > retry_count = 0, > scsi_cmd = 0x0, > pad = 0xcb700250 "" > } > > crash> CommandList_struct 0xcb700250 > struct CommandList_struct { > ... > busaddr = 191890000, > err_info = 0xcb6e0030, > ctlr = 0, > cmd_type = 0, > cmdindex = 1, > prev = 0xeb700000, > next = 0xeb700000, > rq = 0xf6d469e0, > waiting = 0x0, > retry_count = 0, > scsi_cmd = 0x0, > pad = 0xcb7004a0 "" > } > > The list pointers on these 3 commands should form a complete circular > double-linked list but they don't. The list pointers for the ioctl both point > back to itself. For the first I/O the previous pointer points to the ioctl and > the next pointer points to the other I/O (this seems to be the only command > that makes sense). For the second I/O the previous and next pointers both > point back to the ioctl (this would make sense if there were two commands in > the list). I reckon that adding/removing an invalid command from the > completion queue has messed up the list pointers. You can get to this state via a set of valid operations on valid data structures: State 0: ioctl->cmd1->cmd2 list_del cmd1 State 1: ioctl->cmd2 list_del cmd2 State 2: ioctl > I also think this has the exact same cause as the RHEL4 bug where the driver > tries to complete an invalid command and it then inserts it into the block > queue via block_complete_request() in complete_command(). Lachlan, what do you mean by "invalid command" in the comments above? What is cmaidad? How often can this problem be reproduced? Does disabling management/monitoring services (such as smartd or that cmaidad) prevent the issue from occurring?
> > You can get to this state via a set of valid operations on valid data > structures: > > State 0: ioctl->cmd1->cmd2 > list_del cmd1 > State 1: ioctl->cmd2 > list_del cmd2 > State 2: ioctl That's possibly what's happened here but there are a number of things that don't add up here like why the bio and request structures are being completed when they are already free or why the request points to a cciss command that refers to a different request. It's like the request is being completed a second time. > > > I also think this has the exact same cause as the RHEL4 bug where the driver > > tries to complete an invalid command and it then inserts it into the block > > queue via block_complete_request() in complete_command(). > > Lachlan, what do you mean by "invalid command" in the comments above? My theory is that the cciss driver received an interrupt in do_cciss_intr(), decoded it and indexed the command at slot 0 in the command pool. Ordinarily this would be correct but in this case I suspect there may have been an additional invalid interrupt or an interrupt with the wrong value causing the command at slot 0 to be completed twice. > > What is cmaidad? How often can this problem be reproduced? Does disabling > management/monitoring services (such as smartd or that cmaidad) prevent the > issue from occurring? cmaidad is part of HP's Insight Manager package and it monitors the health of the disks. I think there were ioctls in progress in the RHEL4 version of this problem too so this monitoring application might be triggering the issue. The problem occurs very rarely and so far has not been reproduced at will. Due to the infrequency of incidences it's not possible to confirm if disabling this service prevents the problem but it could be a workaround.
I've asked HP for a test case that would issue the ioctls that their monitoring utilities use in rapid succession. Hopefully that will help trigger the problem, if that's the cause.