Description of problem: We instantly get a problem with <defunct> processes. After further investigation it turned out that one of threads call exit_aio() and probably never returns. There were no disks issues which might hang AIO operations. Process simply called _exit() function while executing and unfortunatelly never actually exited. Version-Release number of selected component (if applicable): 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:54:53 EST 2006 i686 athlon i386 GNU/Linux How reproducible: Hard to tell. It happens rarely but once it happens the only solution is to reboot the system - especially that the process holds resources such as ports. Steps to Reproduce: 1. _exit() from multithreaded application using AIO Actual results: Process never terminates and the only solution is to reboot the system. Expected results: Process should exit successfuly. Additional info: Backtraces of all threads belonging to the <defunct> process: Oct 11 04:15:15 hydra48 kernel: testLoader.ex X C0135214 1584 31576 1 31705 4014 (L-TLB) Oct 11 04:15:15 hydra48 kernel: f0978f90 00000046 f743ac00 c0135214 7fffffff f69bd2f0 00000000 00000000 Oct 11 04:15:15 hydra48 kernel: f69bd2f0 00000000 c2fb6740 c2fb5de0 00000000 0021e0b8 eb1ca504 0000c628 Oct 11 04:15:15 hydra48 kernel: f69bd2f0 ddaa61f0 ddaa635c 00000000 f0978f7c ddaa6730 f7ffb600 ddaa61f0 Oct 11 04:15:15 hydra48 kernel: Call Trace: Oct 11 04:15:15 hydra48 kernel: [<c0135214>] sys_futex+0x101/0x10c Oct 11 04:15:15 hydra48 kernel: [<c01246e0>] do_exit+0x3fa/0x404 Oct 11 04:15:15 hydra48 kernel: [<c01247d5>] sys_exit_group+0x0/0xd Oct 11 04:15:15 hydra48 kernel: [<c02d2657>] syscall_call+0x7/0xb Oct 11 04:15:15 hydra48 kernel: testLoader.ex D 00000001 3324 31705 1 13369 31576 (L-TLB) Oct 11 04:15:15 hydra48 kernel: e0465e64 00000046 00000001 00000001 00000246 00000000 00000000 f68a90cc Oct 11 04:15:15 hydra48 kernel: c0151c73 00000000 f743ac00 c2fc5de0 00000002 00001104 eb225bbc 0000c628 Oct 11 04:15:15 hydra48 kernel: f7f10b30 ddaa6770 ddaa68dc e0465e6c 00000000 ddaa6770 eb0ffb80 e0465e80 Oct 11 04:15:15 hydra48 kernel: Call Trace: Oct 11 04:15:15 hydra48 kernel: [<c0151c73>] anon_vma_unlink+0x40/0x55 Oct 11 04:15:15 hydra48 kernel: [<c01797b1>] wait_for_all_aios+0x4b/0x64 Oct 11 04:15:15 hydra48 kernel: [<c011e71b>] default_wake_function+0x0/0xc Oct 11 04:15:15 hydra48 kernel: [<c0179832>] exit_aio+0x2e/0x7b Oct 11 04:15:15 hydra48 kernel: [<c0120531>] mmput+0x47/0x72 Oct 11 04:15:15 hydra48 kernel: [<c01244f5>] do_exit+0x20f/0x404 Oct 11 04:15:15 hydra48 kernel: [<c01247d5>] sys_exit_group+0x0/0xd Oct 11 04:15:15 hydra48 kernel: [<c012ca28>] get_signal_to_deliver+0x31e/0x346 Oct 11 04:15:15 hydra48 kernel: [<c0105bb8>] do_signal+0x55/0xd9 Oct 11 04:15:15 hydra48 kernel: [<c011e71b>] default_wake_function+0x0/0xc Oct 11 04:15:15 hydra48 kernel: [<c01350e2>] do_futex+0x29/0x5a Oct 11 04:15:15 hydra48 kernel: [<c0135214>] sys_futex+0x101/0x10c Oct 11 04:15:15 hydra48 kernel: [<c0105c64>] do_notify_resume+0x28/0x38 Oct 11 04:15:15 hydra48 kernel: [<c02d26a2>] work_notifysig+0x13/0x15
Problem shows up pretty often. Here is another dump of hang process: Oct 12 06:06:06 hydra38 kernel: testLoader.ex D 00000001 1652 28154 1 7288 11479 (L-TLB) Oct 12 06:06:06 hydra38 kernel: f48a5f4c 00000046 00000001 00000001 00000246 00000000 00000000 f7290f94 Oct 12 06:06:06 hydra38 kernel: c0151c73 00000000 f0736400 c2fbdde0 00000001 000015c3 fc3d3c5b 00038462 Oct 12 06:06:06 hydra38 kernel: f7f110b0 cc0759b0 cc075b1c f48a5f54 00000000 cc0759b0 cca10a40 f48a5f68 Oct 12 06:06:06 hydra38 kernel: Call Trace: Oct 12 06:06:06 hydra38 kernel: [<c0151c73>] anon_vma_unlink+0x40/0x55 Oct 12 06:06:06 hydra38 kernel: [<c01797b1>] wait_for_all_aios+0x4b/0x64 Oct 12 06:06:06 hydra38 kernel: [<c011e71b>] default_wake_function+0x0/0xc Oct 12 06:06:06 hydra38 kernel: [<c0179832>] exit_aio+0x2e/0x7b Oct 12 06:06:06 hydra38 kernel: [<c0120531>] mmput+0x47/0x72 Oct 12 06:06:06 hydra38 kernel: [<c01244f5>] do_exit+0x20f/0x404 Oct 12 06:06:06 hydra38 kernel: [<c01247d5>] sys_exit_group+0x0/0xd Oct 12 06:06:06 hydra38 kernel: [<c02d2657>] syscall_call+0x7/0xb
cc'ing jmoyer since he's been doing a lot of AIO work lately... I'm wondering whether this problem is a duplicate of, or will be affected by the fix in bz 202186. That patch is geared toward fixing a problem where an I/O can get "lost" when a device is unreachable. This case is a bit different, but the fact that it's hanging in wait_for_all_aios makes me wonder if the problem is similar in that we're waiting indefinitely for I/O's that have already completed. Can you or the customer test a development build with the patch from that BZ (-42.5 or later), and let me know if the problem is still there? If so, then please see if you can force a crash and collect a vmcore when the box is in this state.
Can you attach the reproducer? Is it doing O_DIRECT I/O? It only makes sense to try the patch from 202186 if direct I/O is being used. -Jeff
Yes (In reply to comment #3) > Can you attach the reproducer? Is it doing O_DIRECT I/O? > > It only makes sense to try the patch from 202186 if direct I/O is being used. > > -Jeff Yes O_DIRECT flag is being used. We have upgraded kernel version to 2.6.9-42.0.2.ELsmp to see if problem persists. If the bug will show up in this release I will let you know.
Unfortunatelly I cannot attach the reproducer code :(.
Thanks for the reply. I'm going to put the bug back into NEEDINFO state, pending your testing results. Thanks!
Just a note that kernel 2.6.9-42.0.2 does not contain the patch for BZ 202186. So if that kernel shows the problem, please retest with one of the development RHEL4 kernels from jbaron's people page: http://people.redhat.com/jbaron/rhel4/ ...they should have the patch for that BZ.
I just got this bug on kernel: 2.6.9-42.0.2.ELsmp Backtraces: Nov 10 10:07:57 hydra38 kernel: testLoader.ex X C01355C0 1580 22888 1 23003 11289 (L-TLB) Nov 10 10:07:57 hydra38 kernel: d967eea8 00000046 f7dfe600 c01355c0 7fffffff f49434b0 00000000 00000000 Nov 10 10:07:57 hydra38 kernel: 00000000 00000000 c2fbe7c0 c2fbdde0 00000001 00000000 726b2600 00153052 Nov 10 10:07:57 hydra38 kernel: f49434b0 d5f0adb0 d5f0af1c d967ee94 d967ee94 d5f0b2f0 f7ffb600 d5f0adb0 Nov 10 10:07:57 hydra38 kernel: Call Trace: Nov 10 10:07:57 hydra38 kernel: [<c01355c0>] sys_futex+0x101/0x10c Nov 10 10:07:57 hydra38 kernel: [<c0124924>] do_exit+0x3fa/0x404 Nov 10 10:07:57 hydra38 kernel: [<c0124a19>] sys_exit_group+0x0/0xd Nov 10 10:07:57 hydra38 kernel: [<c012cd46>] get_signal_to_deliver+0x31e/0x346 Nov 10 10:07:57 hydra38 kernel: [<c0105bd4>] do_signal+0x55/0xd9 Nov 10 10:07:57 hydra38 kernel: [<c011e794>] default_wake_function+0x0/0xc Nov 10 10:07:57 hydra38 kernel: [<c013548e>] do_futex+0x29/0x5a Nov 10 10:07:57 hydra38 kernel: [<c01355c0>] sys_futex+0x101/0x10c Nov 10 10:07:57 hydra38 kernel: [<c0105c80>] do_notify_resume+0x28/0x38 Nov 10 10:07:57 hydra38 kernel: [<c02d4822>] work_notifysig+0x13/0x15 Nov 10 10:07:57 hydra38 kernel: testLoader.ex D 0000013F 2172 23003 1 1142 22888 (L-TLB) Nov 10 10:07:57 hydra38 kernel: d1faae64 00000046 00012080 0000013f c0415e2c c0415e20 d1faae44 00000001 Nov 10 10:07:57 hydra38 kernel: c2fbe7c0 f7f11630 d1faae54 c2fbdde0 00000001 00000000 727a6840 00153052 Nov 10 10:07:57 hydra38 kernel: f7f110b0 d5f0a2b0 d5f0a41c d1faae6c d6dd6ad4 d5f0a2b0 d6dd6ac0 d1faae80 Nov 10 10:07:57 hydra38 kernel: Call Trace: Nov 10 10:07:57 hydra38 kernel: [<c017a615>] wait_for_all_aios+0x4b/0x64 Nov 10 10:07:57 hydra38 kernel: [<c011e794>] default_wake_function+0x0/0xc Nov 10 10:07:57 hydra38 kernel: [<c017a696>] exit_aio+0x2e/0x7b Nov 10 10:07:57 hydra38 kernel: [<c0120795>] mmput+0x47/0x72 Nov 10 10:07:57 hydra38 kernel: [<c0124739>] do_exit+0x20f/0x404 Nov 10 10:07:57 hydra38 kernel: [<c0124a19>] sys_exit_group+0x0/0xd Nov 10 10:07:57 hydra38 kernel: [<c012cd46>] get_signal_to_deliver+0x31e/0x346 Nov 10 10:07:57 hydra38 kernel: [<c0105bd4>] do_signal+0x55/0xd9 Nov 10 10:07:57 hydra38 kernel: [<c011e794>] default_wake_function+0x0/0xc Nov 10 10:07:57 hydra38 kernel: [<c013548e>] do_futex+0x29/0x5a Nov 10 10:07:57 hydra38 kernel: [<c01355c0>] sys_futex+0x101/0x10c Nov 10 10:07:57 hydra38 kernel: [<c0105c80>] do_notify_resume+0x28/0x38 Nov 10 10:07:57 hydra38 kernel: [<c02d4822>] work_notifysig+0x13/0x15 I will try to upgrade OS to the debug kernel version and see if problem persists.
If the problem persists, please remember to force a crashdump and get that to support. Thanks.
Any testing results on this?
It seems we have another customer running into this problem with the patch mentioned earlier here. As such, I'd like to get a vmcore if at all possible.
(In reply to comment #11) > It seems we have another customer running into this problem with the patch > mentioned earlier here. As such, I'd like to get a vmcore if at all possible. We have recompiled kernel with flag CONFIG_PROC_VMCORE. If problem shows up again I will copy /proc/vmcore and attach to this bug report.
(In reply to comment #12) > We have recompiled kernel with flag CONFIG_PROC_VMCORE. If problem shows up > again I will copy /proc/vmcore and attach to this bug report. You should not need to recompile your kernel at all. Simply enable netdump or diskdump in your environment. I wasn't aware that CONFIG_PROC_VMCORE was even an option in a RHEL 4 environment.
Created attachment 145468 [details] Fix a race between aio completion and exit paths Have you been able to reproduce the issue at all? Would you be willing to try the attached patch to see if it helps? Thanks!
I have test kernels available at: http://people.redhat.com/jmoyer/dio/ They contain a potential fix for this problem. Any testing you can provide would be greatly appreciated. Thanks.
2.6.9-42.27.EL.dio.2smp still have same problem.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
OK, I have seen this happen in the following circumstances: An AIO read is issued The DIO layer returns less data than requested (a short read, common if you try to read beyond EOF) The io is cancelled, either via io_destroy or due to process exit Where the problem lies may be the subject of debate. I think that aio_pread should not return -EIOCBRETRY in the case of a short read, as it does here: if (!S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode)) ret = -EIOCBRETRY; However, it may be that there are valid cases for this, of which I am simply not aware. A safer approach to fixing the problem may be to add an aio_queue_work call to aio_run_iocb, like so: if (-EIOCBRETRY == ret) { /* * OK, now that we are done with this iteration * and know that there is more left to go, * this is where we let go so that a subsequent * "kick" can start the next iteration */ /* will make __queue_kicked_iocb succeed from here on */ INIT_LIST_HEAD(&iocb->ki_run_list); /* we must queue the next iteration ourselves, if it * has already been kicked */ if (kiocbIsKicked(iocb) && __queue_kicked_iocb(iocb)) aio_queue_work(ctx); } I tested this approach, and it does appear to make the problem go away in my test setup. I'll upload a patch and build a kernel with this change in the next day or so.
Created attachment 152070 [details] be sure to schedule the aio work queue after a short read
I built new kernel rpms with the upstream version of this patch. They are available at: http://people.redhat.com/jmoyer/dio/ The version is 2.6.9-52.EL.dio.5. Please give this new kernel a try when you get a chance. Thanks!
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
committed in stream U6 build 55.10. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
A fix for this issue should have been included in the packages contained in the RHEL4.6 Beta released on RHN (also available at partners.redhat.com). Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should have been included in the packages contained in the RHEL4.6-Snapshot1 on partners.redhat.com. Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in RHEL4.6-Snapshot2--available soon on partners.redhat.com. Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should have been included in the packages contained in the RHEL4.6-Snapshot3 on partners.redhat.com. Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in the packages contained in RHEL4.6-Snapshot4--available now on partners.redhat.com. Please verify that your issue is fixed ASAP to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in the packages contained in RHEL4.6-Snapshot5--available now on partners.redhat.com. Please verify that your issue is fixed ASAP to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in the packages contained in RHEL4.6-Snapshot6--available now on partners.redhat.com. Please verify that your issue is fixed ASAP to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in the packages contained in RHEL4.6-Snapshot7--available now on partners.redhat.com. IMPORTANT: This is the last opportunity to confirm that your issue is fixed in the RHEL4.6 update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html