Description of problem: While under heavy load (significant numbers of asynchronous direct I/Os to multipe storage devices), I've seen this oops a handful of times (5?) over the past year (on varous versions of RHEL4). Version-Release number of selected component (if applicable): 2.6.9-34.EL How reproducible: Not very - as noted above, only seen it a handful of times. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Oops information: Debug: sleeping function called from invalid context at kernel/workqueue.c:264 in_atomic():1[expected: 0], irqs_disabled():0 Call Trace: [<a000000100016ba0>] show_stack+0x80/0xa0 sp=e0000040fe80f9d0 bsp=e0000040fe809480 [<a000000100016bf0>] dump_stack+0x30/0x60 sp=e0000040fe80fba0 bsp=e0000040fe809468 [<a000000100068050>] __might_sleep+0x190/0x260 sp=e0000040fe80fba0 bsp=e0000040fe809440 [<a0000001000a1770>] flush_workqueue+0x30/0x140 sp=e0000040fe80fbb0 bsp=e0000040fe809418 [<a00000010017b800>] __put_ioctx+0xa0/0x1a0 sp=e0000040fe80fbb0 bsp=e0000040fe8093e0 [<a00000010017cd40>] aio_complete+0x480/0x4a0 sp=e0000040fe80fbb0 bsp=e0000040fe809380 [<a0000001001784f0>] finished_one_bio+0x190/0x220 sp=e0000040fe80fbb0 bsp=e0000040fe809348 [<a000000100178a20>] dio_bio_complete+0x1c0/0x200 sp=e0000040fe80fbb0 bsp=e0000040fe8092f0 [<a000000100178ac0>] dio_bio_end_aio+0x60/0x80 sp=e0000040fe80fbb0 bsp=e0000040fe8092d0 [<a00000010012edf0>] bio_endio+0x110/0x1c0 sp=e0000040fe80fbb0 bsp=e0000040fe809298 [<a000000100362560>] __end_that_request_first+0x2c0/0x4a0 sp=e0000040fe80fbb0 bsp=e0000040fe809230 [<a0000001003627d0>] end_that_request_chunk+0x30/0x60 sp=e0000040fe80fbb0 bsp=e0000040fe809200 [<a000000200078f10>] scsi_end_request+0x50/0x2e0 [scsi_mod] sp=e0000040fe80fbb0 bsp=e0000040fe8091b0 [<a000000200079610>] scsi_io_completion+0x2b0/0xa00 [scsi_mod] sp=e0000040fe80fbb0 bsp=e0000040fe809130 [<a000000200026710>] sd_rw_intr+0x110/0x700 [sd_mod] sp=e0000040fe80fbb0 bsp=e0000040fe8090e0 [<a00000020006c230>] scsi_finish_command+0x2d0/0x300 [scsi_mod] sp=e0000040fe80fbb0 bsp=e0000040fe8090b0 [<a00000020006c520>] scsi_softirq+0x2c0/0x300 [scsi_mod] sp=e0000040fe80fbb0 bsp=e0000040fe809070 [<a000000100082510>] __do_softirq+0x1f0/0x240 sp=e0000040fe80fbc0 bsp=e0000040fe808fd8 [<a0000001000825d0>] do_softirq+0x70/0xc0 sp=e0000040fe80fbc0 bsp=e0000040fe808f78 [<a000000100015bd0>] ia64_handle_irq+0x1b0/0x1e0 sp=e0000040fe80fbc0 bsp=e0000040fe808f30 [<a00000010000f5c0>] ia64_leave_kernel+0x0/0x260 sp=e0000040fe80fbc0 bsp=e0000040fe808f30 [<a0000001000160c0>] ia64_pal_call_static+0xa0/0xc0 sp=e0000040fe80fd90 bsp=e0000040fe808ee0 [<a000000100017740>] default_idle+0x140/0x1e0 sp=e0000040fe80fd90 bsp=e0000040fe808e90 [<a000000100017900>] cpu_idle+0x120/0x2c0 sp=e0000040fe80fe30 bsp=e0000040fe808e48 [<a00000010005b150>] start_secondary+0x2b0/0x2e0 sp=e0000040fe80fe30 bsp=e0000040fe808e10 [<a000000100008180>] __end_ivt_text+0x260/0x290 sp=e0000040fe80fe30 bsp=e0000040fe808e10
Do you actually get a system crash, or anything going wrong? or just the messages?
Nope - messages just logged, and the system continues onwards. I _believe_ might_sleep is just a warning mechanism: meaning that one should _not_ be sleeping in this context, and the fact that we _might_ sleep means things aren't quite right. [Meaning: somebody above me in the call stack is doing something inherently wrong.]
This is indeed a corner case. The last user of the ioctx is the I/O path (meaning that the calling process either closed the context or went away before the I/O completed). I'll give this some thought.
It looks like someone ran into this on a 2.6.18.4 kernel. See the thread at: http://marc.theaimsgroup.com/?l=linux-ia64&m=116594483721437&w=2
Created attachment 144812 [details] aio_complete should not drop the last reference to an ioctx This is the fix Kenneth Chen posted for this problem. Please try it out if you get the chance.
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
committed in stream U6 build 55.10. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html