Bug 432052

Summary: [LSI] CPU lockup with kill_fasync interface
Product: Red Hat Enterprise Linux 5 Reporter: Atul Mukker <atul.mukker>
Component: kernelAssignee: Tom Coughlan <coughlan>
Status: CLOSED CANTFIX QA Contact: Martin Jenner <mjenner>
Severity: low Docs Contact:
Priority: low    
Version: 5.1CC: andriusb, bibhu.das, coldwell
Target Milestone: rcKeywords: OtherQA
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-03 17:29:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 217105    
Attachments:
Description Flags
Kernel Log none

Description Atul Mukker 2008-02-08 16:11:23 UTC
Description of problem:


Version-Release number of selected component (if applicable):
kernel-2.6.18-53.1.6.el5

How reproducible:


We are frequently seeing a kernel panic whenever our software RAID driver sends
event notification to management applications using the kill_fasync() interface.
The kernel trace is shown below, while the complete kernel log is attached with
the defect report.

NMI Watchdog detected LOCKUP on CPU 5
CPU 5 
Modules linked in: autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U)
ipt_REJECT(U) ip6t_REJECT(U) xt_tcpudp(U) x_tables(U) ipv6(U) dm_multipath(U)
video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U)
acpi_memhotplug(U) ac(U) lp(U) shpchp(U) 3c59x(U) ide_cd(U) cdrom(U) mii(U)
floppy(U) i2c_i801(U) parport_pc(U) parport(U) serio_raw(U) i2c_core(U)
pcspkr(U) sg(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) ata_piix(U)
libata(U) megasr(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U)
uhci_hcd(U)
Pid: 4609, comm: dd Tainted: P      2.6.18-prep #1
RIP: 0010:[<ffffffff8006221d>]  [<ffffffff8006221d>] __read_lock_failed+0x5/0x14
RSP: 0018:ffff81003fe6bb10  EFLAGS: 00000097
RAX: 0000000000000005 RBX: ffff81002c24fa00 RCX: 0000000000000002
RDX: 0000000000000000 RSI: ffffffff802a59ba RDI: ffff81002c24fa00
RBP: ffff81002c24fa00 R08: 0000000000000002 R09: 0000000000000001
R10: ffff81003e5b45f0 R11: ffff81002c24fa18 R12: 0000000000020001
R13: 0000000000000005 R14: ffff81003e4f9de9 R15: 0000000000000067
FS:  0000000000000000(0000) GS:ffff81003ff1e2c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003743614cdb CR3: 0000000038e70000 CR4: 00000000000006e0
Process dd (pid: 4609, threadinfo ffff81003cf82000, task ffff810039072100)
Stack:  ffffffff8000c945 ffff8100223973d8 ffffffff800e65d6 ffffffff803007b8
 ffff8100223973d8 0000000000020001 000000000000001d ffff81003e5b4570
 ffffffff8004dcd2 0000000000020001 ffffffff8812f290 000000000000001d
Call Trace:
 <IRQ>  [<ffffffff8000c945>] _raw_read_lock+0x27/0x29
 [<ffffffff800e65d6>] send_sigio+0x18/0xb6
 [<ffffffff8004dcd2>] __kill_fasync+0x46/0x54
 [<ffffffff800445bf>] kill_fasync+0x2c/0x41
 [<ffffffff880b6b3f>] :megasr:osl_post_aen+0x2a/0x2e
 [<ffffffff880ca97f>] :megasr:rcl_evtmgr_log_event+0x1564/0x16c7
 [<ffffffff8004641b>] try_to_wake_up+0x407/0x418
 [<ffffffff880c831e>] :megasr:rx_update_auto_resume_data+0x467/0x481
 [<ffffffff880e0e83>] :megasr:r5_update_st_prog+0x1b/0x50
 [<ffffffff880e1fa9>] :megasr:r5_do_next+0x1c8/0x41a
 [<ffffffff880ec6bd>] :megasr:hal_ahci_complete_commands+0x293/0x2bb
 [<ffffffff880f1461>] :megasr:hal_ahci_isr+0xc27/0xca5
 [<ffffffff880b6d9b>] :megasr:lsraid_isr+0x20/0x49
 [<ffffffff80010ca5>] handle_IRQ_event+0x22/0x56
 [<ffffffff800bdea9>] __do_IRQ+0xa5/0x107
 [<ffffffff80012204>] __do_softirq+0x95/0xf3
 [<ffffffff8006d8fc>] do_IRQ+0xf6/0x104
 [<ffffffff8005d652>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff8001600c>] __bitmap_empty+0x58/0x64
 [<ffffffff800769fd>] flush_tlb_mm+0xa8/0xd5
 [<ffffffff800211f3>] sys_mprotect+0x6b9/0x858
 [<ffffffff8005d2a6>] tracesys+0xd5/0xdf


Code: 83 3f 01 78 f9 f0 ff 0f 0f 88 ed ff ff ff c3 89 d1 f3 a5 c3 
Kernel panic - not syncing: nmi watchdog
 <0>BUG: spinlock lockup on CPU#3, kjournald/615, ffffffff802efea0 (Tainted: P     )

Call Trace:
 <NMI>  [<ffffffff800077ac>] _raw_spin_lock+0xcd/0xeb
 [<ffffffff80066411>] oops_begin+0x3a/0x70
 [<ffffffff80066654>] die_nmi+0xf/0xbc
 [<ffffffff80066d61>] nmi_watchdog_tick+0x107/0x1fb
 [<ffffffff800668fa>] default_do_nmi+0x86/0x214
 [<ffffffff80066eb6>] do_nmi+0x61/0x94
 [<ffffffff800661f3>] nmi+0x7f/0x80
 [<ffffffff8001532c>] sync_buffer+0x0/0x42
 [<ffffffff8000774c>] _raw_spin_lock+0x6d/0xeb
 <<EOE>>  [<ffffffff8807cd90>] :scsi_mod:scsi_request_fn+0x197/0x395
 [<ffffffff8005a789>] generic_unplug_device+0x18/0x25
 [<ffffffff8816abd2>] :dm_mod:dm_table_unplug_all+0x33/0x41
 [<ffffffff88168d57>] :dm_mod:dm_unplug_all+0x1d/0x28
 [<ffffffff8005c2bd>] blk_backing_dev_unplug+0x54/0x5a
 [<ffffffff80015365>] sync_buffer+0x39/0x42
 [<ffffffff80063aa1>] __wait_on_bit+0x40/0x6e
 [<ffffffff8001532c>] sync_buffer+0x0/0x42
 [<ffffffff80063b3b>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009f1df>] wake_bit_function+0x0/0x23
 [<ffffffff88035a9f>] :jbd:journal_commit_transaction+0x5eb/0x1160
 [<ffffffff88039d85>] :jbd:kjournald+0xc2/0x21d
 [<ffffffff8009f1b1>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88039cc3>] :jbd:kjournald+0x0/0x21d
 [<ffffffff8009effd>] keventd_create_kthread+0x0/0x66
 [<ffffffff80033491>] kthread+0xfe/0x132
 [<ffffffff8005e079>] child_rip+0xa/0x11
 [<ffffffff8005d6a8>] restore_args+0x0/0x30
 [<ffffffff80033393>] kthread+0x0/0x132
 [<ffffffff8005e06f>] child_rip+0x0/0x11

Comment 1 Atul Mukker 2008-02-08 16:11:23 UTC
Created attachment 294370 [details]
Kernel Log

Comment 2 Bibhu Das 2008-02-21 16:20:48 UTC
Any update on this?

Comment 5 Tom Coughlan 2008-03-05 14:26:41 UTC
(In reply to comment #0)

> Pid: 4609, comm: dd Tainted: P      2.6.18-prep #1

To recap earlier mail discussion: 1) Chip indicated that it is ok to call
kill_fasync() from ISR. 2) The trace indicates that there is a deadlock
involving your code. He suggested turning on lockdep, by building against the
RHEL-5 kernel-debug package. The additional debug information might help you to
identify the problem in your closed source driver code. 

This is about as far as we are prepared to go in helping you to solve this
problem in your driver. If you can reproduce this problem on a kernel that we
ship, or even with a driver whose source code is available to us, we maybe able
to justify spending some time analyzing this. Otherwise we will consult with you
on this as time permits, but we will not have the time to do any detailed analysis. 

I will leave this BZ open for a while, in case you have specific follow-up
questions resulting from your analysis of the trace. 



Comment 6 Bibhu Das 2008-03-05 15:09:25 UTC
Hi,
  We already run the test with the RHEL-5 kernel debug package and the log is 
already attached here with. Could you please have a look at the log?

Comment 7 Tom Coughlan 2008-03-05 15:51:14 UTC
(In reply to comment #6)

> Could you please have a look at the log?

No. We expect you to debug your closed source driver.  



Comment 8 Martin Wilck 2008-03-05 16:59:28 UTC
Atul:

Please run the megasr driver with the CL_IOCTL log flag, and try to generate a
kdump when the deadlock has occured. In the stack it appears that the kernel is
trying to do 

472 void send_sigio(struct fown_struct *fown, int fd, int band)
473 {
474         struct task_struct *p;
475         int pid;
476         
477         read_lock(&fown->lock);           <====== deadlock here
478         pid = fown->pid;

This is strange because that lock is hardly ever taken in write mode. We need a
dump to see what's really going on. Please use a stock Red Hat kernel, so that
we can analyze more easily.

Martin