201191 – DLM Bad Unlock Balance Detected

Bug 201191 - DLM Bad Unlock Balance Detected

Summary: DLM Bad Unlock Balance Detected

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	David Teigland
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	242045 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-08-03 14:32 UTC by Robert Peterson
Modified:	2009-09-03 16:51 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-11-22 22:09:56 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Robert Peterson 2006-08-03 14:32:09 UTC

Description of problem:
I rebooted a four-node ("smoke") cluster, started the cluster
software.  When I started the clvmd service, I noticed a
strange dmesg relating to a DLM bad unlock balance on the
only 64-bit node in the cluster ("kool").

Version-Release number of selected component (if applicable):
FC6t1

How reproducible:
Unknown (happened once)

Steps to Reproduce:
1. Reboot all four machines cleanly.
2. On all nodes, do service cman start.
3. Use group_tool -v to ensure all nodes have a good status
   in the cluster.
4. On all nodes, do service clvmd start.
  
Actual results:
No error messages were received on the terminal, but the following
dmesgs appeared on the console:

dlm: clvmd: recover 1
dlm: clvmd: add member 5
dlm: clvmd: total members 1
dlm: clvmd: dlm_recover_directory
dlm: clvmd: dlm_recover_directory 0 entries

=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
dlm_recoverd/2632 is trying to release lock (&ls->ls_in_recovery) at:
[<ffffffff88481ef6>] dlm_recoverd+0x260/0x373 [dlm]
but there are no more locks to release!

other info that might help us debug this:
2 locks held by dlm_recoverd/2632:
 #0:  (&ls->ls_recoverd_active){--..}, at: [<ffffffff8026632f>] mutex_lock+0x2a/0x2e
 #1:  (&ls->ls_recover_lock){--..}, at: [<ffffffff88481ed7>]
dlm_recoverd+0x241/0x373 [dlm]

stack backtrace:

Call Trace:
 [<ffffffff8026e7fd>] show_trace+0xae/0x30e
 [<ffffffff8026ea72>] dump_stack+0x15/0x17
 [<ffffffff802a6c51>] print_unlock_inbalance_bug+0x108/0x118
 [<ffffffff802a8936>] lock_release_non_nested+0x8c/0x13a
 [<ffffffff802a8b0b>] lock_release+0x127/0x14a
 [<ffffffff802a59e3>] up_write+0x1e/0x2a
 [<ffffffff88481ef6>] :dlm:dlm_recoverd+0x260/0x373
 [<ffffffff80235440>] kthread+0x100/0x136
 [<ffffffff802613de>] child_rip+0x8/0x12
DWARF2 unwinder stuck at child_rip+0x8/0x12
Leftover inexact backtrace:
 [<ffffffff80267ab2>] _spin_unlock_irq+0x2b/0x31
 [<ffffffff80260a1b>] restore_args+0x0/0x30
 [<ffffffff8024f916>] run_workqueue+0x19/0xfb
 [<ffffffff80235340>] kthread+0x0/0x136
 [<ffffffff802613d6>] child_rip+0x0/0x12

dlm: clvmd: recover 1 done: 124 ms
dlm: clvmd: recover 3
dlm: clvmd: add member 3
dlm: Initiating association with node 3
dlm: got new/restarted association 1 nodeid 3
dlm: clvmd: ignoring recovery message 5 from 3
dlm: clvmd: dlm_wait_function aborted
dlm: clvmd: total members 2
dlm: clvmd: recover_members failed -4
dlm: clvmd: recover 3 error -4
dlm: clvmd: recover 5
dlm: clvmd: add member 4
dlm: Initiating association with node 4
dlm: clvmd: total members 3
dlm: clvmd: recover_members failed -4
dlm: clvmd: recover 5 error -4
dlm: clvmd: recover 7
dlm: clvmd: add member 2
dlm: Initiating association with node 2
dlm: clvmd: total members 4
dlm: clvmd: dlm_recover_directory
dlm: clvmd: dlm_recover_directory 0 entries
dlm: clvmd: recover 7 done: 60 ms


Expected results:
These dmesgs should not appear.

Comment 1 David Teigland 2006-08-03 14:39:06 UTC

I get this all the time, I don't think it's a bug
or a problem.  AFAICT, it's complaining because the
lock is being taken and released by different threads,
which is intentional.  I don't know enough about the
"unlock balance" checking to be certain that that's
what it's worried about, or to know if there's a way
to suppress the warning.

Comment 2 Robert Peterson 2006-08-03 14:54:14 UTC

If it's not a bug, then the dmesgs should not appear (they will
just excite users) or else they should be tamed down and not
say "BUG:" and no call trace should be given.

Comment 3 Corey Marthaler 2006-08-21 16:36:45 UTC

I'm seeing these as well. 
Bob is right, if it's not a bug then it shouldn't say "BUG" and shouldn't do a
backtrace.

Comment 4 Corey Marthaler 2006-10-18 16:29:04 UTC

Are we going to do anything about this issue? Still seeing this blatant BUG
message followed by a backtrace everytime I start clvmd. Customers aren't gonna
like seeing this everytime.

Comment 5 Christine Caulfield 2006-10-23 13:20:34 UTC

It should only happen if the kernels are built with lock debugging compiled in.
Are we really shipping debug kernels ?

Comment 6 David Teigland 2006-10-23 13:35:13 UTC

I emailed Ingo about this last week, asking if we could add 
down_write_non_owner() / up_write_non_owner() (to parallel the
read_non_owner variant).  He wanted to know why we were doing
the down/up from different threads which I explained.  Haven't
heard back yet.

Comment 8 Robert Peterson 2006-11-22 22:09:56 UTC

This message is no longer appearing with recent RHEL5 beta kernels.
Suggest we close it "CurrentRelease".

Comment 9 David Teigland 2007-06-01 13:53:58 UTC

*** Bug 242045 has been marked as a duplicate of this bug. ***

Comment 10 Nate Straz 2007-12-13 17:40:42 UTC

Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.