Description of problem: Version-Release number of selected component (if applicable): kernel-2.6.18-53.1.6.el5 How reproducible: We are frequently seeing a kernel panic whenever our software RAID driver sends event notification to management applications using the kill_fasync() interface. The kernel trace is shown below, while the complete kernel log is attached with the defect report. NMI Watchdog detected LOCKUP on CPU 5 CPU 5 Modules linked in: autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ipt_REJECT(U) ip6t_REJECT(U) xt_tcpudp(U) x_tables(U) ipv6(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) shpchp(U) 3c59x(U) ide_cd(U) cdrom(U) mii(U) floppy(U) i2c_i801(U) parport_pc(U) parport(U) serio_raw(U) i2c_core(U) pcspkr(U) sg(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) ata_piix(U) libata(U) megasr(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Pid: 4609, comm: dd Tainted: P 2.6.18-prep #1 RIP: 0010:[<ffffffff8006221d>] [<ffffffff8006221d>] __read_lock_failed+0x5/0x14 RSP: 0018:ffff81003fe6bb10 EFLAGS: 00000097 RAX: 0000000000000005 RBX: ffff81002c24fa00 RCX: 0000000000000002 RDX: 0000000000000000 RSI: ffffffff802a59ba RDI: ffff81002c24fa00 RBP: ffff81002c24fa00 R08: 0000000000000002 R09: 0000000000000001 R10: ffff81003e5b45f0 R11: ffff81002c24fa18 R12: 0000000000020001 R13: 0000000000000005 R14: ffff81003e4f9de9 R15: 0000000000000067 FS: 0000000000000000(0000) GS:ffff81003ff1e2c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003743614cdb CR3: 0000000038e70000 CR4: 00000000000006e0 Process dd (pid: 4609, threadinfo ffff81003cf82000, task ffff810039072100) Stack: ffffffff8000c945 ffff8100223973d8 ffffffff800e65d6 ffffffff803007b8 ffff8100223973d8 0000000000020001 000000000000001d ffff81003e5b4570 ffffffff8004dcd2 0000000000020001 ffffffff8812f290 000000000000001d Call Trace: <IRQ> [<ffffffff8000c945>] _raw_read_lock+0x27/0x29 [<ffffffff800e65d6>] send_sigio+0x18/0xb6 [<ffffffff8004dcd2>] __kill_fasync+0x46/0x54 [<ffffffff800445bf>] kill_fasync+0x2c/0x41 [<ffffffff880b6b3f>] :megasr:osl_post_aen+0x2a/0x2e [<ffffffff880ca97f>] :megasr:rcl_evtmgr_log_event+0x1564/0x16c7 [<ffffffff8004641b>] try_to_wake_up+0x407/0x418 [<ffffffff880c831e>] :megasr:rx_update_auto_resume_data+0x467/0x481 [<ffffffff880e0e83>] :megasr:r5_update_st_prog+0x1b/0x50 [<ffffffff880e1fa9>] :megasr:r5_do_next+0x1c8/0x41a [<ffffffff880ec6bd>] :megasr:hal_ahci_complete_commands+0x293/0x2bb [<ffffffff880f1461>] :megasr:hal_ahci_isr+0xc27/0xca5 [<ffffffff880b6d9b>] :megasr:lsraid_isr+0x20/0x49 [<ffffffff80010ca5>] handle_IRQ_event+0x22/0x56 [<ffffffff800bdea9>] __do_IRQ+0xa5/0x107 [<ffffffff80012204>] __do_softirq+0x95/0xf3 [<ffffffff8006d8fc>] do_IRQ+0xf6/0x104 [<ffffffff8005d652>] ret_from_intr+0x0/0xf <EOI> [<ffffffff8001600c>] __bitmap_empty+0x58/0x64 [<ffffffff800769fd>] flush_tlb_mm+0xa8/0xd5 [<ffffffff800211f3>] sys_mprotect+0x6b9/0x858 [<ffffffff8005d2a6>] tracesys+0xd5/0xdf Code: 83 3f 01 78 f9 f0 ff 0f 0f 88 ed ff ff ff c3 89 d1 f3 a5 c3 Kernel panic - not syncing: nmi watchdog <0>BUG: spinlock lockup on CPU#3, kjournald/615, ffffffff802efea0 (Tainted: P ) Call Trace: <NMI> [<ffffffff800077ac>] _raw_spin_lock+0xcd/0xeb [<ffffffff80066411>] oops_begin+0x3a/0x70 [<ffffffff80066654>] die_nmi+0xf/0xbc [<ffffffff80066d61>] nmi_watchdog_tick+0x107/0x1fb [<ffffffff800668fa>] default_do_nmi+0x86/0x214 [<ffffffff80066eb6>] do_nmi+0x61/0x94 [<ffffffff800661f3>] nmi+0x7f/0x80 [<ffffffff8001532c>] sync_buffer+0x0/0x42 [<ffffffff8000774c>] _raw_spin_lock+0x6d/0xeb <<EOE>> [<ffffffff8807cd90>] :scsi_mod:scsi_request_fn+0x197/0x395 [<ffffffff8005a789>] generic_unplug_device+0x18/0x25 [<ffffffff8816abd2>] :dm_mod:dm_table_unplug_all+0x33/0x41 [<ffffffff88168d57>] :dm_mod:dm_unplug_all+0x1d/0x28 [<ffffffff8005c2bd>] blk_backing_dev_unplug+0x54/0x5a [<ffffffff80015365>] sync_buffer+0x39/0x42 [<ffffffff80063aa1>] __wait_on_bit+0x40/0x6e [<ffffffff8001532c>] sync_buffer+0x0/0x42 [<ffffffff80063b3b>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff8009f1df>] wake_bit_function+0x0/0x23 [<ffffffff88035a9f>] :jbd:journal_commit_transaction+0x5eb/0x1160 [<ffffffff88039d85>] :jbd:kjournald+0xc2/0x21d [<ffffffff8009f1b1>] autoremove_wake_function+0x0/0x2e [<ffffffff88039cc3>] :jbd:kjournald+0x0/0x21d [<ffffffff8009effd>] keventd_create_kthread+0x0/0x66 [<ffffffff80033491>] kthread+0xfe/0x132 [<ffffffff8005e079>] child_rip+0xa/0x11 [<ffffffff8005d6a8>] restore_args+0x0/0x30 [<ffffffff80033393>] kthread+0x0/0x132 [<ffffffff8005e06f>] child_rip+0x0/0x11
Created attachment 294370 [details] Kernel Log
Any update on this?
(In reply to comment #0) > Pid: 4609, comm: dd Tainted: P 2.6.18-prep #1 To recap earlier mail discussion: 1) Chip indicated that it is ok to call kill_fasync() from ISR. 2) The trace indicates that there is a deadlock involving your code. He suggested turning on lockdep, by building against the RHEL-5 kernel-debug package. The additional debug information might help you to identify the problem in your closed source driver code. This is about as far as we are prepared to go in helping you to solve this problem in your driver. If you can reproduce this problem on a kernel that we ship, or even with a driver whose source code is available to us, we maybe able to justify spending some time analyzing this. Otherwise we will consult with you on this as time permits, but we will not have the time to do any detailed analysis. I will leave this BZ open for a while, in case you have specific follow-up questions resulting from your analysis of the trace.
Hi, We already run the test with the RHEL-5 kernel debug package and the log is already attached here with. Could you please have a look at the log?
(In reply to comment #6) > Could you please have a look at the log? No. We expect you to debug your closed source driver.
Atul: Please run the megasr driver with the CL_IOCTL log flag, and try to generate a kdump when the deadlock has occured. In the stack it appears that the kernel is trying to do 472 void send_sigio(struct fown_struct *fown, int fd, int band) 473 { 474 struct task_struct *p; 475 int pid; 476 477 read_lock(&fown->lock); <====== deadlock here 478 pid = fown->pid; This is strange because that lock is hardly ever taken in write mode. We need a dump to see what's really going on. Please use a stock Red Hat kernel, so that we can analyze more easily. Martin