Bug 582435 - [Stratus 5.6 bug] Circular lock dep warning on cfq_exit_lock
Summary: [Stratus 5.6 bug] Circular lock dep warning on cfq_exit_lock
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 5.6
Assignee: Jeff Moyer
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
: 618211 (view as bug list)
Depends On:
Blocks: 557597 607483
TreeView+ depends on / blocked
 
Reported: 2010-04-14 21:37 UTC by Jarod Wilson
Modified: 2018-11-14 19:11 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 21:27:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Get rid of the cfq_exit_lock completely. (8.73 KB, patch)
2010-06-03 13:49 UTC, Jeff Moyer
no flags Details | Diff
Fix the poor irq assumptions in the last patch (1.51 KB, patch)
2010-06-03 13:49 UTC, Jeff Moyer
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Jarod Wilson 2010-04-14 21:37:49 UTC
Description of problem:
Triggered a "possible circular locking dependency detected" warning when loading a usb mass storage driver under kernel-debug on amd-pence-01.lab.bos.redhat.com.

Full console boot log:

http://rhts.redhat.com/testlogs/2010/04/148476/383195/3098427/console.txt

Version-Release number of selected component (if applicable):
kernel-debug-2.6.18-196.el5

How reproducible:
Just boot the system

Comment 1 Jarod Wilson 2010-04-14 21:39:24 UTC
=======================================================
usb-storage: device scan complete
[ INFO: possible circular locking dependency detected ]
2.6.18-196.el5debug #1
-------------------------------------------------------
usb-stor-scan/765 is trying to acquire lock:
 (cfq_exit_lock){--..}, at: [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69

but task is already holding lock:
 (&q->__queue_lock){.+..}, at: [<ffffffff801528f4>] blk_get_request+0x27/0x68

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&q->__queue_lock){.+..}:
       [<ffffffff800aad5d>] __lock_acquire+0x9bc/0xaee
       [<ffffffff800aaee4>] lock_acquire+0x55/0x6f
       [<ffffffff80034933>] cfq_exit_io_context+0x80/0x10d
       [<ffffffff800684d6>] _spin_lock+0x26/0x52
       [<ffffffff80034933>] cfq_exit_io_context+0x80/0x10d
       [<ffffffff8005d565>] exit_io_context+0xa3/0xae
       [<ffffffff80016a2d>] do_exit+0x8f8/0x934
       [<ffffffff8004b7b5>] debug_mutex_init+0x0/0x3b
       [<ffffffff80060116>] system_call+0x7e/0x83
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (cfq_exit_lock){--..}:
       [<ffffffff800aac73>] __lock_acquire+0x8d2/0xaee
       [<ffffffff800aaee4>] lock_acquire+0x55/0x6f
       [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69
       [<ffffffff800684d6>] _spin_lock+0x26/0x52
       [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69
       [<ffffffff80159a75>] cfq_cic_rb_lookup+0x28/0x4a
       [<ffffffff8015b97b>] cfq_may_queue+0x20/0x112
       [<ffffffff80013ddc>] get_request+0x29/0x34a
       [<ffffffff8002a166>] get_request_wait+0x21/0x11d
       [<ffffffff801528f4>] blk_get_request+0x27/0x68
       [<ffffffff8807aca6>] scsi_execute+0x3a/0xeb [scsi_mod]
       [<ffffffff8807adfc>] scsi_execute_req+0xa5/0xce [scsi_mod]
       [<ffffffff8807c131>] scsi_probe_and_add_lun+0x207/0x9c9 [scsi_mod]
       [<ffffffff8807cc9a>] scsi_alloc_target+0x274/0x338 [scsi_mod]
       [<ffffffff8807cea6>] __scsi_scan_target+0xc3/0x5cd [scsi_mod]
       [<ffffffff800aa04c>] trace_hardirqs_on+0x11b/0x13f
       [<ffffffff8807d3f5>] scsi_scan_channel+0x45/0x70 [scsi_mod]
       [<ffffffff8807d4e5>] scsi_scan_host_selected+0xc5/0xff [scsi_mod]
       [<ffffffff800a5461>] keventd_create_kthread+0x0/0xc9
       [<ffffffff88100853>] usb_stor_scan_thread+0x145/0x16d [usb_storage]
       [<ffffffff800a5678>] autoremove_wake_function+0x0/0x2e
       [<ffffffff8810070e>] usb_stor_scan_thread+0x0/0x16d [usb_storage]
       [<ffffffff8003487f>] kthread+0xfe/0x132
       [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37
       [<ffffffff80061079>] child_rip+0xa/0x11
       [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27
       [<ffffffff800606a8>] restore_args+0x0/0x30
       [<ffffffff8004fdcc>] run_workqueue+0x16/0xf4
       [<ffffffff80034781>] kthread+0x0/0x132
       [<ffffffff8006106f>] child_rip+0x0/0x11
       [<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

2 locks held by usb-stor-scan/765:
 #0:  (&shost->scan_mutex){--..}, at: [<ffffffff8807d4c0>] scsi_scan_host_selected+0xa0/0xff [scsi_mod]
 #1:  (&q->__queue_lock){.+..}, at: [<ffffffff801528f4>] blk_get_request+0x27/0x68

stack backtrace:

Call Trace:
 [<ffffffff800a962b>] print_circular_bug_tail+0x65/0x6e
 [<ffffffff800aac73>] __lock_acquire+0x8d2/0xaee
 [<ffffffff800aaee4>] lock_acquire+0x55/0x6f
 [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69
 [<ffffffff800684d6>] _spin_lock+0x26/0x52
 [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69
 [<ffffffff80159a75>] cfq_cic_rb_lookup+0x28/0x4a
 [<ffffffff8015b97b>] cfq_may_queue+0x20/0x112
 [<ffffffff80013ddc>] get_request+0x29/0x34a
 [<ffffffff8002a166>] get_request_wait+0x21/0x11d
 [<ffffffff801528f4>] blk_get_request+0x27/0x68
 [<ffffffff8807aca6>] :scsi_mod:scsi_execute+0x3a/0xeb
 [<ffffffff8807adfc>] :scsi_mod:scsi_execute_req+0xa5/0xce
 [<ffffffff8807c131>] :scsi_mod:scsi_probe_and_add_lun+0x207/0x9c9
 [<ffffffff8807cc9a>] :scsi_mod:scsi_alloc_target+0x274/0x338
 [<ffffffff8807cea6>] :scsi_mod:__scsi_scan_target+0xc3/0x5cd
 [<ffffffff800aa04c>] trace_hardirqs_on+0x11b/0x13f
 [<ffffffff8807d3f5>] :scsi_mod:scsi_scan_channel+0x45/0x70
 [<ffffffff8807d4e5>] :scsi_mod:scsi_scan_host_selected+0xc5/0xff
 [<ffffffff800a5461>] keventd_create_kthread+0x0/0xc9
 [<ffffffff88100853>] :usb_storage:usb_stor_scan_thread+0x145/0x16d
 [<ffffffff800a5678>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8810070e>] :usb_storage:usb_stor_scan_thread+0x0/0x16d
 [<ffffffff8003487f>] kthread+0xfe/0x132
 [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80061079>] child_rip+0xa/0x11
 [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800606a8>] restore_args+0x0/0x30
 [<ffffffff8004fdcc>] run_workqueue+0x16/0xf4
 [<ffffffff80034781>] kthread+0x0/0x132
 [<ffffffff8006106f>] child_rip+0x0/0x11

Comment 2 Andrius Benokraitis 2010-06-02 12:54:16 UTC
Including Stratus to this bugzilla - they are hitting this via a difficult reproduction methodology, but still hitting it regardless. 

From Stratus:

In two cases Stratus reproduced this with surprise disk removals, in another case the removal of a IO assembly.

It looks like patch "linux-2.6-block-cfq-iosched-get-rid-of-cfqq-hash.patch" is what introduced this bug in RHEL 5.5. There are a bunch of other patches in the same area as this one, so the fix probably isn't as simple as just backing out this patch.

Comment 3 Jeff Moyer 2010-06-02 14:51:57 UTC
The following commit should take care of this, though there is a kabi issue to be overcome.

commit fc46379daf90dce57bf765c81d3b39f55150aac2
Author: Jens Axboe <axboe>
Date:   Tue Aug 29 09:05:44 2006 +0200

    [PATCH] cfq-iosched: kill cfq_exit_lock
    
    cfq_exit_lock is protecting two things now:
    
    - The per-ioc rbtree of cfq_io_contexts
    
    - The per-cfqd linked list of cfq_io_contexts
    
    The per-cfqd linked list can be protected by the queue lock, as it is (by
    definition) per cfqd as the queue lock is.
    
    The per-ioc rbtree is mainly used and updated by the process itself only.
    The only outside use is the io priority changing. If we move the
    priority changing to not browsing the rbtree, we can remove any locking
    from the rbtree updates and lookup completely. Let the sys_ioprio syscall
    just mark processes as having the iopriority changed and lazily update
    the private cfq io contexts the next time io is queued, and we can
    remove this locking as well.

Comment 4 Jeff Moyer 2010-06-02 15:04:41 UTC
We'll also need this commit:

commit c1b707d253fe918b92882cff1dbd926b47e14fd2
Author: Jens Axboe <jens.axboe>
Date:   Mon Oct 30 19:54:23 2006 +0100

    [PATCH] CFQ: bad locking in changed_ioprio()
    
    When the ioprio code recently got juggled a bit, a bug was introduced.
    changed_ioprio() is no longer called with interrupts disabled, so using
    plain spin_lock() on the queue_lock is a bug.

Comment 5 Jeff Moyer 2010-06-03 13:49:14 UTC
Created attachment 419377 [details]
Get rid of the cfq_exit_lock completely.

Comment 6 Jeff Moyer 2010-06-03 13:49:51 UTC
Created attachment 419378 [details]
Fix the poor irq assumptions in the last patch

Comment 7 Jeff Moyer 2010-06-03 14:00:00 UTC
A test kernel with the above two patches can be found at:
  http://people.redhat.com/jmoyer/bz582435/

If you need a kernel for an architecture other than x86_64, please let me know soon.  I tested this internally on the system that produced the problem, and the log messages are gone.

Andrius, please provide test results from Stratus when you have them.

Thanks!

Comment 8 Andrius Benokraitis 2010-06-03 15:45:28 UTC
Chas, can you please test this test kernel?

Comment 9 Andrius Benokraitis 2010-06-03 15:48:30 UTC
Jeff - BTW, Stratus only offers x86_64 machines, so I don't think they know or care about anything else...

Comment 10 Chas Horvath 2010-06-03 15:53:30 UTC
Stratus will test this and get back to you shortly (probably Monday to give us a weekend long run).  Andrius is correct, Stratus only cares about x86_64.  Thanks for the help and quick response!

Comment 12 nate.dailey 2010-06-04 18:00:53 UTC
Would it be possible to post the devel RPM that goes along with this test kernel? We need it to build our drivers against the kernel. (If it's not possible, I'll just test without our drivers).

Comment 13 Jeff Moyer 2010-06-04 18:44:36 UTC
(In reply to comment #12)
> Would it be possible to post the devel RPM that goes along with this test
> kernel? We need it to build our drivers against the kernel. (If it's not
> possible, I'll just test without our drivers).    

Done.

Comment 14 Chas Horvath 2010-06-07 14:33:29 UTC
This fix tested successfully at Stratus on three systems running over the weekend running various storage configurations with stress and Stratus' automated surprise removal and surprise insertion tests.

Comment 15 Andrius Benokraitis 2010-06-08 18:04:43 UTC
This is a regression from 5.4 - Stratus is requesting this be included in 5.5.z as well.

Comment 18 RHEL Program Management 2010-06-14 18:09:16 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 19 Andrius Benokraitis 2010-06-14 18:37:18 UTC
Requesting for 5.5.z since this is a regression.

Comment 23 Jarod Wilson 2010-07-12 15:45:57 UTC
in kernel-2.6.18-206.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 25 Jeff Moyer 2010-07-26 20:33:16 UTC
*** Bug 618211 has been marked as a duplicate of this bug. ***

Comment 28 errata-xmlrpc 2011-01-13 21:27:35 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.