Bug 238571 - HP5750: cpu_chain & cache_chain_mutex deadlock
HP5750: cpu_chain & cache_chain_mutex deadlock
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.0
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Peter Zijlstra
Martin Jenner
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-05-01 11:17 EDT by Prarit Bhargava
Modified: 2014-08-11 01:40 EDT (History)
6 users (show)

See Also:
Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 14:48:32 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 1 Prarit Bhargava 2007-05-02 09:42:10 EDT
Further info from 2.6.18-16.el5:

CPU 1: synchronized TSC with CPU 0 (last diff -1 cycles, maxerr 582 cycles)

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.18-16.el5 #1
-------------------------------------------------------
swapper/1 is trying to acquire lock:
 ((cpu_chain).rwsem){..--}, at: [<ffffffff8009ae12>]
blocking_notifier_call_chain+0x13/0x36

but task is already holding lock:
 (cache_chain_mutex){--..}, at: [<ffffffff800d85a7>] cpuup_callback+0x3f/0x408

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (cache_chain_mutex){--..}:
       [<ffffffff800a596a>] __lock_acquire+0x8a8/0x9d6
       [<ffffffff800a6000>] lock_acquire+0x44/0x5d
       [<ffffffff800d85a7>] cpuup_callback+0x3f/0x408
       [<ffffffff800d85a7>] cpuup_callback+0x3f/0x408
       [<ffffffff80064689>] __mutex_lock_slowpath+0xd9/0x247
       [<ffffffff800d85a7>] cpuup_callback+0x3f/0x408
       [<ffffffff800a3389>] lockdep_init_map+0x99/0x2d8
       [<ffffffff800a27c3>] hrtimer_cpu_notify+0x6a/0x152
       [<ffffffff80068586>] notifier_call_chain+0x20/0x32
       [<ffffffff8009ae21>] blocking_notifier_call_chain+0x22/0x36
       [<ffffffff800a8e6e>] _cpu_up+0x3e/0xd0
       [<ffffffff800a8f29>] cpu_up+0x29/0x3d
       [<ffffffff8042c92a>] init+0xb5/0x2fe
       [<ffffffff800660d2>] _spin_unlock_irq+0x24/0x27
       [<ffffffff800659a1>] trace_hardirqs_on_thunk+0x35/0x37
       [<ffffffff8005e9b5>] child_rip+0xa/0x11
       [<ffffffff800660d2>] _spin_unlock_irq+0x24/0x27
       [<ffffffff8005dfe4>] restore_args+0x0/0x30
       [<ffffffff8042c875>] init+0x0/0x2fe
       [<ffffffff8005e9ab>] child_rip+0x0/0x11
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #0 ((cpu_chain).rwsem){..--}:
       [<ffffffff800a587e>] __lock_acquire+0x7bc/0x9d6
       [<ffffffff800a6000>] lock_acquire+0x44/0x5d
       [<ffffffff8009ae12>] blocking_notifier_call_chain+0x13/0x36
       [<ffffffff8009ae12>] blocking_notifier_call_chain+0x13/0x36
       [<ffffffff800a2a8d>] down_read+0x37/0x40
       [<ffffffff8009ae12>] blocking_notifier_call_chain+0x13/0x36
       [<ffffffff800a8ef0>] _cpu_up+0xc0/0xd0
       [<ffffffff800a8f29>] cpu_up+0x29/0x3d
       [<ffffffff8042c92a>] init+0xb5/0x2fe
       [<ffffffff800660d2>] _spin_unlock_irq+0x24/0x27
       [<ffffffff800659a1>] trace_hardirqs_on_thunk+0x35/0x37
       [<ffffffff8005e9b5>] child_rip+0xa/0x11
       [<ffffffff800660d2>] _spin_unlock_irq+0x24/0x27
       [<ffffffff8005dfe4>] restore_args+0x0/0x30
       [<ffffffff8042c875>] init+0x0/0x2fe
       [<ffffffff8005e9ab>] child_rip+0x0/0x11
       [<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:
2 locks held by swapper/1:
 #0:  (cpu_add_remove_lock){--..}, at: [<ffffffff800a8f19>] cpu_up+0x19/0x3d
 #1:  (cache_chain_mutex){--..}, at: [<ffffffff800d85a7>] cpuup_callback+0x3f/0x408

stack backtrace:

Call Trace:
 [<ffffffff800a4359>] print_circular_bug_tail+0x65/0x6e
 [<ffffffff800a587e>] __lock_acquire+0x7bc/0x9d6
 [<ffffffff800a6000>] lock_acquire+0x44/0x5d
 [<ffffffff8009ae12>] blocking_notifier_call_chain+0x13/0x36
 [<ffffffff8009ae12>] blocking_notifier_call_chain+0x13/0x36
 [<ffffffff800a2a8d>] down_read+0x37/0x40
 [<ffffffff8009ae12>] blocking_notifier_call_chain+0x13/0x36
 [<ffffffff800a8ef0>] _cpu_up+0xc0/0xd0
 [<ffffffff800a8f29>] cpu_up+0x29/0x3d
 [<ffffffff8042c92a>] init+0xb5/0x2fe
 [<ffffffff800660d2>] _spin_unlock_irq+0x24/0x27
 [<ffffffff800659a1>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff8005e9b5>] child_rip+0xa/0x11
 [<ffffffff800660d2>] _spin_unlock_irq+0x24/0x27
 [<ffffffff8005dfe4>] restore_args+0x0/0x30
 [<ffffffff8042c875>] init+0x0/0x2fe
 [<ffffffff8005e9ab>] child_rip+0x0/0x11
Comment 2 Prarit Bhargava 2007-05-02 09:57:47 EDT
Hmmm ... this is odd.  I seem to be hitting two separate issues.  The first (in
comment #1) is when I boot a self-built kernel.  The seconds (comment #2) is
when I boot the "stock" 2.6.18-16.el5 kernel ...

P.
Comment 3 Prarit Bhargava 2007-05-02 11:09:47 EDT
I checked out a fresh 2.6.18-16.el5 tree and rebuilt.  I no longer hit the issue
in comment #1.  Comment #2's lock issue is still valid.
Comment 4 Prarit Bhargava 2007-05-02 11:22:20 EDT
Ignore comment #1 ... making it private.  Actual issue is in comment #2.
Comment 5 Prarit Bhargava 2007-05-02 11:27:22 EDT
Tracked issue down to

linux-2.6-cpu-hotplug-make-and-module-insertion-cause-panic.patch

Broken patch.  

Adding Regression flag, increasing severity, and adding konradr, dzickus, &
bnagendr to cc line.
Comment 6 RHEL Product and Program Management 2007-05-02 11:28:12 EDT
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.
Comment 7 Konrad Rzeszutek 2007-05-02 11:47:52 EDT
Prarit,

How is the upstream kernel working on that box?
Comment 8 Prarit Bhargava 2007-05-02 13:36:06 EDT
Konrad, 

Upstream (2.6.21) boots fine.  But, IIRC, there were large locking fixes that
went in during 2.6.20 ...

Adding pzijlstr -- Peter, dzickus mentioned you were looking for the patch that
was causing this warning.

P.
Comment 9 Peter Zijlstra 2007-05-03 03:13:27 EDT
Yeah, I have a series of patches for a bunch of interrelated BZs here:
 http://programming.kicks-ass.net/sekrit/rhel5/

0001-convert-cpu-hotplug-notifiers-to-use-raw_notifier-instead-of-blocking_notifier.patch

fixes this problem. I'm going to attempt to brew build a rhel5 kernel with all
these patches and test on the various machines that had trouble.
Comment 10 Prarit Bhargava 2007-05-03 06:27:18 EDT
(In reply to comment #9)
> Yeah, I have a series of patches for a bunch of interrelated BZs here:
>  http://programming.kicks-ass.net/sekrit/rhel5/
> 

Peter, could you ping me with a link to the brew build?  I'll test on the 5750 ...

P.
Comment 12 Don Zickus 2007-08-21 14:36:21 EDT
in 2.6.18-42.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 15 errata-xmlrpc 2007-11-07 14:48:32 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.