Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 582435

Summary:

[Stratus 5.6 bug] Circular lock dep warning on cfq_exit_lock

Product:

Red Hat Enterprise Linux 5

Reporter:

Jarod Wilson <jarod>

Component:

kernel

Assignee:

Jeff Moyer <jmoyer>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.6

CC:

andriusb, bandan.das, bmr, chas.horvath, cward, dhoward, jburke, jparadis, jpirko, nate.dailey, robert.evans, tao

Target Milestone:

Keywords:

OtherQA, Regression, ZStream

Target Release:

5.6

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-01-13 21:27:35 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

557597, 607483

Attachments:

Description	Flags
Get rid of the cfq_exit_lock completely.	none
Fix the poor irq assumptions in the last patch	none

Description Jarod Wilson 2010-04-14 21:37:49 UTC

Description of problem:
Triggered a "possible circular locking dependency detected" warning when loading a usb mass storage driver under kernel-debug on amd-pence-01.lab.bos.redhat.com.

Full console boot log:

http://rhts.redhat.com/testlogs/2010/04/148476/383195/3098427/console.txt

Version-Release number of selected component (if applicable):
kernel-debug-2.6.18-196.el5

How reproducible:
Just boot the system

Comment 1 Jarod Wilson 2010-04-14 21:39:24 UTC

=======================================================
usb-storage: device scan complete
[ INFO: possible circular locking dependency detected ]
2.6.18-196.el5debug #1
-------------------------------------------------------
usb-stor-scan/765 is trying to acquire lock:
 (cfq_exit_lock){--..}, at: [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69

but task is already holding lock:
 (&q->__queue_lock){.+..}, at: [<ffffffff801528f4>] blk_get_request+0x27/0x68

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&q->__queue_lock){.+..}:
       [<ffffffff800aad5d>] __lock_acquire+0x9bc/0xaee
       [<ffffffff800aaee4>] lock_acquire+0x55/0x6f
       [<ffffffff80034933>] cfq_exit_io_context+0x80/0x10d
       [<ffffffff800684d6>] _spin_lock+0x26/0x52
       [<ffffffff80034933>] cfq_exit_io_context+0x80/0x10d
       [<ffffffff8005d565>] exit_io_context+0xa3/0xae
       [<ffffffff80016a2d>] do_exit+0x8f8/0x934
       [<ffffffff8004b7b5>] debug_mutex_init+0x0/0x3b
       [<ffffffff80060116>] system_call+0x7e/0x83
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (cfq_exit_lock){--..}:
       [<ffffffff800aac73>] __lock_acquire+0x8d2/0xaee
       [<ffffffff800aaee4>] lock_acquire+0x55/0x6f
       [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69
       [<ffffffff800684d6>] _spin_lock+0x26/0x52
       [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69
       [<ffffffff80159a75>] cfq_cic_rb_lookup+0x28/0x4a
       [<ffffffff8015b97b>] cfq_may_queue+0x20/0x112
       [<ffffffff80013ddc>] get_request+0x29/0x34a
       [<ffffffff8002a166>] get_request_wait+0x21/0x11d
       [<ffffffff801528f4>] blk_get_request+0x27/0x68
       [<ffffffff8807aca6>] scsi_execute+0x3a/0xeb [scsi_mod]
       [<ffffffff8807adfc>] scsi_execute_req+0xa5/0xce [scsi_mod]
       [<ffffffff8807c131>] scsi_probe_and_add_lun+0x207/0x9c9 [scsi_mod]
       [<ffffffff8807cc9a>] scsi_alloc_target+0x274/0x338 [scsi_mod]
       [<ffffffff8807cea6>] __scsi_scan_target+0xc3/0x5cd [scsi_mod]
       [<ffffffff800aa04c>] trace_hardirqs_on+0x11b/0x13f
       [<ffffffff8807d3f5>] scsi_scan_channel+0x45/0x70 [scsi_mod]
       [<ffffffff8807d4e5>] scsi_scan_host_selected+0xc5/0xff [scsi_mod]
       [<ffffffff800a5461>] keventd_create_kthread+0x0/0xc9
       [<ffffffff88100853>] usb_stor_scan_thread+0x145/0x16d [usb_storage]
       [<ffffffff800a5678>] autoremove_wake_function+0x0/0x2e
       [<ffffffff8810070e>] usb_stor_scan_thread+0x0/0x16d [usb_storage]
       [<ffffffff8003487f>] kthread+0xfe/0x132
       [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37
       [<ffffffff80061079>] child_rip+0xa/0x11
       [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27
       [<ffffffff800606a8>] restore_args+0x0/0x30
       [<ffffffff8004fdcc>] run_workqueue+0x16/0xf4
       [<ffffffff80034781>] kthread+0x0/0x132
       [<ffffffff8006106f>] child_rip+0x0/0x11
       [<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

2 locks held by usb-stor-scan/765:
 #0:  (&shost->scan_mutex){--..}, at: [<ffffffff8807d4c0>] scsi_scan_host_selected+0xa0/0xff [scsi_mod]
 #1:  (&q->__queue_lock){.+..}, at: [<ffffffff801528f4>] blk_get_request+0x27/0x68

stack backtrace:

Call Trace:
 [<ffffffff800a962b>] print_circular_bug_tail+0x65/0x6e
 [<ffffffff800aac73>] __lock_acquire+0x8d2/0xaee
 [<ffffffff800aaee4>] lock_acquire+0x55/0x6f
 [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69
 [<ffffffff800684d6>] _spin_lock+0x26/0x52
 [<ffffffff801599fd>] cfq_drop_dead_cic+0x19/0x69
 [<ffffffff80159a75>] cfq_cic_rb_lookup+0x28/0x4a
 [<ffffffff8015b97b>] cfq_may_queue+0x20/0x112
 [<ffffffff80013ddc>] get_request+0x29/0x34a
 [<ffffffff8002a166>] get_request_wait+0x21/0x11d
 [<ffffffff801528f4>] blk_get_request+0x27/0x68
 [<ffffffff8807aca6>] :scsi_mod:scsi_execute+0x3a/0xeb
 [<ffffffff8807adfc>] :scsi_mod:scsi_execute_req+0xa5/0xce
 [<ffffffff8807c131>] :scsi_mod:scsi_probe_and_add_lun+0x207/0x9c9
 [<ffffffff8807cc9a>] :scsi_mod:scsi_alloc_target+0x274/0x338
 [<ffffffff8807cea6>] :scsi_mod:__scsi_scan_target+0xc3/0x5cd
 [<ffffffff800aa04c>] trace_hardirqs_on+0x11b/0x13f
 [<ffffffff8807d3f5>] :scsi_mod:scsi_scan_channel+0x45/0x70
 [<ffffffff8807d4e5>] :scsi_mod:scsi_scan_host_selected+0xc5/0xff
 [<ffffffff800a5461>] keventd_create_kthread+0x0/0xc9
 [<ffffffff88100853>] :usb_storage:usb_stor_scan_thread+0x145/0x16d
 [<ffffffff800a5678>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8810070e>] :usb_storage:usb_stor_scan_thread+0x0/0x16d
 [<ffffffff8003487f>] kthread+0xfe/0x132
 [<ffffffff80067fea>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80061079>] child_rip+0xa/0x11
 [<ffffffff800688ed>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800606a8>] restore_args+0x0/0x30
 [<ffffffff8004fdcc>] run_workqueue+0x16/0xf4
 [<ffffffff80034781>] kthread+0x0/0x132
 [<ffffffff8006106f>] child_rip+0x0/0x11

Comment 2 Andrius Benokraitis 2010-06-02 12:54:16 UTC

Including Stratus to this bugzilla - they are hitting this via a difficult reproduction methodology, but still hitting it regardless. 

From Stratus:

In two cases Stratus reproduced this with surprise disk removals, in another case the removal of a IO assembly.

It looks like patch "linux-2.6-block-cfq-iosched-get-rid-of-cfqq-hash.patch" is what introduced this bug in RHEL 5.5. There are a bunch of other patches in the same area as this one, so the fix probably isn't as simple as just backing out this patch.

Comment 3 Jeff Moyer 2010-06-02 14:51:57 UTC

The following commit should take care of this, though there is a kabi issue to be overcome.

commit fc46379daf90dce57bf765c81d3b39f55150aac2
Author: Jens Axboe <axboe>
Date:   Tue Aug 29 09:05:44 2006 +0200

    [PATCH] cfq-iosched: kill cfq_exit_lock
    
    cfq_exit_lock is protecting two things now:
    
    - The per-ioc rbtree of cfq_io_contexts
    
    - The per-cfqd linked list of cfq_io_contexts
    
    The per-cfqd linked list can be protected by the queue lock, as it is (by
    definition) per cfqd as the queue lock is.
    
    The per-ioc rbtree is mainly used and updated by the process itself only.
    The only outside use is the io priority changing. If we move the
    priority changing to not browsing the rbtree, we can remove any locking
    from the rbtree updates and lookup completely. Let the sys_ioprio syscall
    just mark processes as having the iopriority changed and lazily update
    the private cfq io contexts the next time io is queued, and we can
    remove this locking as well.

Comment 4 Jeff Moyer 2010-06-02 15:04:41 UTC

We'll also need this commit:

commit c1b707d253fe918b92882cff1dbd926b47e14fd2
Author: Jens Axboe <jens.axboe>
Date:   Mon Oct 30 19:54:23 2006 +0100

    [PATCH] CFQ: bad locking in changed_ioprio()
    
    When the ioprio code recently got juggled a bit, a bug was introduced.
    changed_ioprio() is no longer called with interrupts disabled, so using
    plain spin_lock() on the queue_lock is a bug.

Comment 5 Jeff Moyer 2010-06-03 13:49:14 UTC

Created attachment 419377 [details]
Get rid of the cfq_exit_lock completely.

Comment 6 Jeff Moyer 2010-06-03 13:49:51 UTC

Created attachment 419378 [details]
Fix the poor irq assumptions in the last patch

Comment 7 Jeff Moyer 2010-06-03 14:00:00 UTC

A test kernel with the above two patches can be found at:
  http://people.redhat.com/jmoyer/bz582435/

If you need a kernel for an architecture other than x86_64, please let me know soon.  I tested this internally on the system that produced the problem, and the log messages are gone.

Andrius, please provide test results from Stratus when you have them.

Thanks!

Comment 8 Andrius Benokraitis 2010-06-03 15:45:28 UTC

Chas, can you please test this test kernel?

Comment 9 Andrius Benokraitis 2010-06-03 15:48:30 UTC

Jeff - BTW, Stratus only offers x86_64 machines, so I don't think they know or care about anything else...

Comment 10 Chas Horvath 2010-06-03 15:53:30 UTC

Stratus will test this and get back to you shortly (probably Monday to give us a weekend long run).  Andrius is correct, Stratus only cares about x86_64.  Thanks for the help and quick response!

Comment 12 nate.dailey 2010-06-04 18:00:53 UTC

Would it be possible to post the devel RPM that goes along with this test kernel? We need it to build our drivers against the kernel. (If it's not possible, I'll just test without our drivers).

Comment 13 Jeff Moyer 2010-06-04 18:44:36 UTC

(In reply to comment #12)
> Would it be possible to post the devel RPM that goes along with this test
> kernel? We need it to build our drivers against the kernel. (If it's not
> possible, I'll just test without our drivers).    

Done.

Comment 14 Chas Horvath 2010-06-07 14:33:29 UTC

This fix tested successfully at Stratus on three systems running over the weekend running various storage configurations with stress and Stratus' automated surprise removal and surprise insertion tests.

Comment 15 Andrius Benokraitis 2010-06-08 18:04:43 UTC

This is a regression from 5.4 - Stratus is requesting this be included in 5.5.z as well.

Comment 18 RHEL Program Management 2010-06-14 18:09:16 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 19 Andrius Benokraitis 2010-06-14 18:37:18 UTC

Requesting for 5.5.z since this is a regression.

Comment 23 Jarod Wilson 2010-07-12 15:45:57 UTC

in kernel-2.6.18-206.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 25 Jeff Moyer 2010-07-26 20:33:16 UTC

*** Bug 618211 has been marked as a duplicate of this bug. ***

Comment 28 errata-xmlrpc 2011-01-13 21:27:35 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html