Description of problem: When the mptctl_fasync() function has been called to enable async notification, the fasync_struct which is allocated is never freed. It remains on mptctl's event list even after the file is closed and released. After the file is closed, the fasync_struct now has an invalid file pointer which is dereferenced when the mptctl module calls kill_fasync() to report any events. The use of the invalid file pointer resulted in a deadlock on one system. The function send_sigio() tries to acquire the rwlock in the file's f_owner field. Since the memory has been freed and reused, the memory location which was the lock containes a value which causes the read_lock() attempt to spin forever. (taken from CAS core 20101125041947) PID: 0 TASK: ffff8100091c0100 CPU: 1 COMMAND: "swapper" #0 [ffff8100091e2f20] crash_nmi_callback at ffffffff8007a3bf #1 [ffff8100091e2f40] do_nmi at ffffffff8006585a #2 [ffff8100091e2f50] nmi at ffffffff80064ebf [exception RIP: __read_lock_failed+5] RIP: ffffffff8006216d RSP: ffff8100091dfd18 RFLAGS: 00000297 RAX: 0000000000000000 RBX: ffff8100151c2d68 RCX: 0000000000000000 RDX: 0000000000020001 RSI: 0000000000000012 RDI: ffff81001423ebc0 RBP: ffff81001423ebc0 R8: 0000000032140ba8 R9: 0000000000000020 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000020001 R13: 0000000000000012 R14: ffff81001fb80140 R15: 0000000000000002 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #3 [ffff8100091dfd18] __read_lock_failed at ffffffff8006216d #4 [ffff8100091dfd18] _read_lock at ffffffff80064b45 #5 [ffff8100091dfd20] send_sigio at ffffffff800e6a96 #6 [ffff8100091dfd50] __kill_fasync at ffffffff8004e93c #7 [ffff8100091dfd70] kill_fasync at ffffffff800452df #8 [ffff8100091dfd90] mptctl_event_process at ffffffff884718e8 #9 [ffff8100091dfdb0] mpt_base_reply at ffffffff880c8f60 #10 [ffff8100091dfe70] mpt_interrupt at ffffffff880cf85c #11 [ffff8100091dff20] handle_IRQ_event at ffffffff80010b81 #12 [ffff8100091dff50] __do_IRQ at ffffffff800b9808 #13 [ffff8100091dff90] do_IRQ at ffffffff8006c997 --- <IRQ stack> --- #14 [ffff8100091d9e48] ret_from_intr at ffffffff8005d615 [exception RIP: mwait_idle+54] RIP: ffffffff800571f4 RSP: ffff8100091d9ef0 RFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff80301698 RBP: ffff8100091c02f0 R8: ffff8100091d8000 R9: ffff81001eba7380 R10: 0000000000000002 R11: 0000000000000008 R12: 00000000029f0b8d R13: 00000bdc90695279 R14: ffff8100091c0860 R15: ffff8100091c0100 ORIG_RAX: ffffffffffffff56 CS: 0010 SS: 0018 #15 [ffff8100091d9ef0] cpu_idle at ffffffff8004939e The cpu is now stuck trying to to acquire an invalid lock that will never be released. The mptctl module has no release() function in its file_operations. It thus never knows about the file being closed and doesn't make a call to fasync_helper() to free up the fasync_struct for the closing file. Version-Release number of selected component (if applicable): kernel-2.6.18-164 Steps to reproduce: To reproduce 1: the mptctl's device file would be opened 2: fcntl() to set FASYNC on the device file handle 3: close the file 4: An mpt event to cause an async notification must occur. Actual results: kernel deadlocks from mptctl using an invalid pointer. Expected results: mptctl would free up the fasync_struct and not cause a deadlock.
Created attachment 467583 [details] add a mptctl_release function to clean up fasync state I've attached an untested patch. The upstream kernel's mptctl driver also looks to fail to clean up the fasync_struct at file close.
Kashyap, the patch seem to correct the issue. Is the the solution good from your point of view?
(In reply to comment #1) > Created attachment 467583 [details] > add a mptctl_release function to clean up fasync state > > I've attached an untested patch. The upstream kernel's mptctl driver also > looks to fail to clean up the fasync_struct at file close. David, I couldn't find it posted, has it already been posted?
(In reply to comment #8) > David, > I couldn't find it posted, has it already been posted? Sorry, this is confusing - meant posted upstream for example in scsi-misc
(In reply to comment #4) > Kashyap, > the patch seem to correct the issue. Is the the solution good from your point > of view? Ping Kashyap, any thoughts?
(In reply to comment #10) > (In reply to comment #4) > > Kashyap, > > the patch seem to correct the issue. Is the the solution good from your point > > of view? > > Ping Kashyap, > any thoughts? Tomas, Sorry for delay...! Patch absolutely fine.! I have to post this to uptream as well..!
(In reply to comment #13) > Tomas, Sorry for delay...! Patch absolutely fine.! I have to post this to > uptream as well..! Kashyp, thanks and for posting this upstream as well. Let me know when you will have this posted.
in kernel-2.6.18-244.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Calling the mptctl_fasync() function to enable async notification caused the fasync_struct data structure, which was allocated, to never be freed. fasync_struct remained on the event list of the mptctl module even after a file was closed and released. After the file was closed, fasync_struct had an invalid file pointer which was dereferenced when the mptctl module called the kill_fasync() function to report any events. The use of the invalid file pointer could result in a deadlock on the system because the send_sigio() function tried to acquire the rwlock in the f_owner field of the previously closed file. With this update, a release callback function has been added for the file operations in the mptctl module. fasync_struct is now properly freed when a file is closed, no longer causing a deadlock.
Code reviewed, patch found in kernel-2.6.18-272.el5 Sanity Only
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html