Description of problem: Triggered a "possible circular locking dependency detected" warning when loading a usb mass storage driver under kernel-debug on amd-pence-01.lab.bos.redhat.com. Full console boot log: http://rhts.redhat.com/testlogs/2010/04/148476/383195/3098427/console.txt Version-Release number of selected component (if applicable): kernel-debug-2.6.18-196.el5 How reproducible: Just boot the system
======================================================= usb-storage: device scan complete [ INFO: possible circular locking dependency detected ] 2.6.18-196.el5debug #1 ------------------------------------------------------- usb-stor-scan/765 is trying to acquire lock: (cfq_exit_lock){--..}, at: [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69 but task is already holding lock: (&q->__queue_lock){.+..}, at: [<ffffffff801528f4>] blk_get_request+0x27/0x68 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&q->__queue_lock){.+..}: [<ffffffff800aad5d>] __lock_acquire+0x9bc/0xaee [<ffffffff800aaee4>] lock_acquire+0x55/0x6f [<ffffffff80034933>] cfq_exit_io_context+0x80/0x10d [<ffffffff800684d6>] _spin_lock+0x26/0x52 [<ffffffff80034933>] cfq_exit_io_context+0x80/0x10d [<ffffffff8005d565>] exit_io_context+0xa3/0xae [<ffffffff80016a2d>] do_exit+0x8f8/0x934 [<ffffffff8004b7b5>] debug_mutex_init+0x0/0x3b [<ffffffff80060116>] system_call+0x7e/0x83 [<ffffffffffffffff>] 0xffffffffffffffff -> #0 (cfq_exit_lock){--..}: [<ffffffff800aac73>] __lock_acquire+0x8d2/0xaee [<ffffffff800aaee4>] lock_acquire+0x55/0x6f [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69 [<ffffffff800684d6>] _spin_lock+0x26/0x52 [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69 [<ffffffff80159a75>] cfq_cic_rb_lookup+0x28/0x4a [<ffffffff8015b97b>] cfq_may_queue+0x20/0x112 [<ffffffff80013ddc>] get_request+0x29/0x34a [<ffffffff8002a166>] get_request_wait+0x21/0x11d [<ffffffff801528f4>] blk_get_request+0x27/0x68 [<ffffffff8807aca6>] scsi_execute+0x3a/0xeb [scsi_mod] [<ffffffff8807adfc>] scsi_execute_req+0xa5/0xce [scsi_mod] [<ffffffff8807c131>] scsi_probe_and_add_lun+0x207/0x9c9 [scsi_mod] [<ffffffff8807cc9a>] scsi_alloc_target+0x274/0x338 [scsi_mod] [<ffffffff8807cea6>] __scsi_scan_target+0xc3/0x5cd [scsi_mod] [<ffffffff800aa04c>] trace_hardirqs_on+0x11b/0x13f [<ffffffff8807d3f5>] scsi_scan_channel+0x45/0x70 [scsi_mod] [<ffffffff8807d4e5>] scsi_scan_host_selected+0xc5/0xff [scsi_mod] [<ffffffff800a5461>] keventd_create_kthread+0x0/0xc9 [<ffffffff88100853>] usb_stor_scan_thread+0x145/0x16d [usb_storage] [<ffffffff800a5678>] autoremove_wake_function+0x0/0x2e [<ffffffff8810070e>] usb_stor_scan_thread+0x0/0x16d [usb_storage] [<ffffffff8003487f>] kthread+0xfe/0x132 [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37 [<ffffffff80061079>] child_rip+0xa/0x11 [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27 [<ffffffff800606a8>] restore_args+0x0/0x30 [<ffffffff8004fdcc>] run_workqueue+0x16/0xf4 [<ffffffff80034781>] kthread+0x0/0x132 [<ffffffff8006106f>] child_rip+0x0/0x11 [<ffffffffffffffff>] 0xffffffffffffffff other info that might help us debug this: 2 locks held by usb-stor-scan/765: #0: (&shost->scan_mutex){--..}, at: [<ffffffff8807d4c0>] scsi_scan_host_selected+0xa0/0xff [scsi_mod] #1: (&q->__queue_lock){.+..}, at: [<ffffffff801528f4>] blk_get_request+0x27/0x68 stack backtrace: Call Trace: [<ffffffff800a962b>] print_circular_bug_tail+0x65/0x6e [<ffffffff800aac73>] __lock_acquire+0x8d2/0xaee [<ffffffff800aaee4>] lock_acquire+0x55/0x6f [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69 [<ffffffff800684d6>] _spin_lock+0x26/0x52 [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69 [<ffffffff80159a75>] cfq_cic_rb_lookup+0x28/0x4a [<ffffffff8015b97b>] cfq_may_queue+0x20/0x112 [<ffffffff80013ddc>] get_request+0x29/0x34a [<ffffffff8002a166>] get_request_wait+0x21/0x11d [<ffffffff801528f4>] blk_get_request+0x27/0x68 [<ffffffff8807aca6>] :scsi_mod:scsi_execute+0x3a/0xeb [<ffffffff8807adfc>] :scsi_mod:scsi_execute_req+0xa5/0xce [<ffffffff8807c131>] :scsi_mod:scsi_probe_and_add_lun+0x207/0x9c9 [<ffffffff8807cc9a>] :scsi_mod:scsi_alloc_target+0x274/0x338 [<ffffffff8807cea6>] :scsi_mod:__scsi_scan_target+0xc3/0x5cd [<ffffffff800aa04c>] trace_hardirqs_on+0x11b/0x13f [<ffffffff8807d3f5>] :scsi_mod:scsi_scan_channel+0x45/0x70 [<ffffffff8807d4e5>] :scsi_mod:scsi_scan_host_selected+0xc5/0xff [<ffffffff800a5461>] keventd_create_kthread+0x0/0xc9 [<ffffffff88100853>] :usb_storage:usb_stor_scan_thread+0x145/0x16d [<ffffffff800a5678>] autoremove_wake_function+0x0/0x2e [<ffffffff8810070e>] :usb_storage:usb_stor_scan_thread+0x0/0x16d [<ffffffff8003487f>] kthread+0xfe/0x132 [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37 [<ffffffff80061079>] child_rip+0xa/0x11 [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27 [<ffffffff800606a8>] restore_args+0x0/0x30 [<ffffffff8004fdcc>] run_workqueue+0x16/0xf4 [<ffffffff80034781>] kthread+0x0/0x132 [<ffffffff8006106f>] child_rip+0x0/0x11
Including Stratus to this bugzilla - they are hitting this via a difficult reproduction methodology, but still hitting it regardless. From Stratus: In two cases Stratus reproduced this with surprise disk removals, in another case the removal of a IO assembly. It looks like patch "linux-2.6-block-cfq-iosched-get-rid-of-cfqq-hash.patch" is what introduced this bug in RHEL 5.5. There are a bunch of other patches in the same area as this one, so the fix probably isn't as simple as just backing out this patch.
The following commit should take care of this, though there is a kabi issue to be overcome. commit fc46379daf90dce57bf765c81d3b39f55150aac2 Author: Jens Axboe <axboe> Date: Tue Aug 29 09:05:44 2006 +0200 [PATCH] cfq-iosched: kill cfq_exit_lock cfq_exit_lock is protecting two things now: - The per-ioc rbtree of cfq_io_contexts - The per-cfqd linked list of cfq_io_contexts The per-cfqd linked list can be protected by the queue lock, as it is (by definition) per cfqd as the queue lock is. The per-ioc rbtree is mainly used and updated by the process itself only. The only outside use is the io priority changing. If we move the priority changing to not browsing the rbtree, we can remove any locking from the rbtree updates and lookup completely. Let the sys_ioprio syscall just mark processes as having the iopriority changed and lazily update the private cfq io contexts the next time io is queued, and we can remove this locking as well.
We'll also need this commit: commit c1b707d253fe918b92882cff1dbd926b47e14fd2 Author: Jens Axboe <jens.axboe> Date: Mon Oct 30 19:54:23 2006 +0100 [PATCH] CFQ: bad locking in changed_ioprio() When the ioprio code recently got juggled a bit, a bug was introduced. changed_ioprio() is no longer called with interrupts disabled, so using plain spin_lock() on the queue_lock is a bug.
Created attachment 419377 [details] Get rid of the cfq_exit_lock completely.
Created attachment 419378 [details] Fix the poor irq assumptions in the last patch
A test kernel with the above two patches can be found at: http://people.redhat.com/jmoyer/bz582435/ If you need a kernel for an architecture other than x86_64, please let me know soon. I tested this internally on the system that produced the problem, and the log messages are gone. Andrius, please provide test results from Stratus when you have them. Thanks!
Chas, can you please test this test kernel?
Jeff - BTW, Stratus only offers x86_64 machines, so I don't think they know or care about anything else...
Stratus will test this and get back to you shortly (probably Monday to give us a weekend long run). Andrius is correct, Stratus only cares about x86_64. Thanks for the help and quick response!
Would it be possible to post the devel RPM that goes along with this test kernel? We need it to build our drivers against the kernel. (If it's not possible, I'll just test without our drivers).
(In reply to comment #12) > Would it be possible to post the devel RPM that goes along with this test > kernel? We need it to build our drivers against the kernel. (If it's not > possible, I'll just test without our drivers). Done.
This fix tested successfully at Stratus on three systems running over the weekend running various storage configurations with stress and Stratus' automated surprise removal and surprise insertion tests.
This is a regression from 5.4 - Stratus is requesting this be included in 5.5.z as well.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Requesting for 5.5.z since this is a regression.
in kernel-2.6.18-206.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
*** Bug 618211 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html