Description of problem: I rebooted a four-node ("smoke") cluster, started the cluster software. When I started the clvmd service, I noticed a strange dmesg relating to a DLM bad unlock balance on the only 64-bit node in the cluster ("kool"). Version-Release number of selected component (if applicable): FC6t1 How reproducible: Unknown (happened once) Steps to Reproduce: 1. Reboot all four machines cleanly. 2. On all nodes, do service cman start. 3. Use group_tool -v to ensure all nodes have a good status in the cluster. 4. On all nodes, do service clvmd start. Actual results: No error messages were received on the terminal, but the following dmesgs appeared on the console: dlm: clvmd: recover 1 dlm: clvmd: add member 5 dlm: clvmd: total members 1 dlm: clvmd: dlm_recover_directory dlm: clvmd: dlm_recover_directory 0 entries ===================================== [ BUG: bad unlock balance detected! ] ------------------------------------- dlm_recoverd/2632 is trying to release lock (&ls->ls_in_recovery) at: [<ffffffff88481ef6>] dlm_recoverd+0x260/0x373 [dlm] but there are no more locks to release! other info that might help us debug this: 2 locks held by dlm_recoverd/2632: #0: (&ls->ls_recoverd_active){--..}, at: [<ffffffff8026632f>] mutex_lock+0x2a/0x2e #1: (&ls->ls_recover_lock){--..}, at: [<ffffffff88481ed7>] dlm_recoverd+0x241/0x373 [dlm] stack backtrace: Call Trace: [<ffffffff8026e7fd>] show_trace+0xae/0x30e [<ffffffff8026ea72>] dump_stack+0x15/0x17 [<ffffffff802a6c51>] print_unlock_inbalance_bug+0x108/0x118 [<ffffffff802a8936>] lock_release_non_nested+0x8c/0x13a [<ffffffff802a8b0b>] lock_release+0x127/0x14a [<ffffffff802a59e3>] up_write+0x1e/0x2a [<ffffffff88481ef6>] :dlm:dlm_recoverd+0x260/0x373 [<ffffffff80235440>] kthread+0x100/0x136 [<ffffffff802613de>] child_rip+0x8/0x12 DWARF2 unwinder stuck at child_rip+0x8/0x12 Leftover inexact backtrace: [<ffffffff80267ab2>] _spin_unlock_irq+0x2b/0x31 [<ffffffff80260a1b>] restore_args+0x0/0x30 [<ffffffff8024f916>] run_workqueue+0x19/0xfb [<ffffffff80235340>] kthread+0x0/0x136 [<ffffffff802613d6>] child_rip+0x0/0x12 dlm: clvmd: recover 1 done: 124 ms dlm: clvmd: recover 3 dlm: clvmd: add member 3 dlm: Initiating association with node 3 dlm: got new/restarted association 1 nodeid 3 dlm: clvmd: ignoring recovery message 5 from 3 dlm: clvmd: dlm_wait_function aborted dlm: clvmd: total members 2 dlm: clvmd: recover_members failed -4 dlm: clvmd: recover 3 error -4 dlm: clvmd: recover 5 dlm: clvmd: add member 4 dlm: Initiating association with node 4 dlm: clvmd: total members 3 dlm: clvmd: recover_members failed -4 dlm: clvmd: recover 5 error -4 dlm: clvmd: recover 7 dlm: clvmd: add member 2 dlm: Initiating association with node 2 dlm: clvmd: total members 4 dlm: clvmd: dlm_recover_directory dlm: clvmd: dlm_recover_directory 0 entries dlm: clvmd: recover 7 done: 60 ms Expected results: These dmesgs should not appear.
I get this all the time, I don't think it's a bug or a problem. AFAICT, it's complaining because the lock is being taken and released by different threads, which is intentional. I don't know enough about the "unlock balance" checking to be certain that that's what it's worried about, or to know if there's a way to suppress the warning.
If it's not a bug, then the dmesgs should not appear (they will just excite users) or else they should be tamed down and not say "BUG:" and no call trace should be given.
I'm seeing these as well. Bob is right, if it's not a bug then it shouldn't say "BUG" and shouldn't do a backtrace.
Are we going to do anything about this issue? Still seeing this blatant BUG message followed by a backtrace everytime I start clvmd. Customers aren't gonna like seeing this everytime.
It should only happen if the kernels are built with lock debugging compiled in. Are we really shipping debug kernels ?
I emailed Ingo about this last week, asking if we could add down_write_non_owner() / up_write_non_owner() (to parallel the read_non_owner variant). He wanted to know why we were doing the down/up from different threads which I explained. Haven't heard back yet.
This message is no longer appearing with recent RHEL5 beta kernels. Suggest we close it "CurrentRelease".
*** Bug 242045 has been marked as a duplicate of this bug. ***
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.