Description of problem: Nate hit this during testing. It appears to be a deadlock between dlm_controld (attempting to remove a node), and dlm_send (attempting to connect to the node). dlm_send is trying to do its work. dlm_controld is waiting for all work to be done. dlm_send is waiting on a mutex in get_comm, it must be mutex_lock(&clusters_root.subsys.su_mutex); dlm_controld is holding that mutex?, I don't see how/where. May 23 14:10:36 smoke-02 fence_node[654]: fence smoke-01 success May 23 14:26:25 smoke-02 corosync[29818]: [TOTEM ] A processor failed, forming new configuration. May 23 14:26:27 smoke-02 corosync[29818]: [QUORUM] Members[1]: 2 May 23 14:26:27 smoke-02 corosync[29818]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 23 14:26:27 smoke-02 corosync[29818]: [CPG ] chosen downlist: sender r(0) ip(10.15.89.105) ; members(old:2 left:1) May 23 14:26:27 smoke-02 corosync[29818]: [MAIN ] Completed service synchronization, ready to provide service. May 23 14:26:27 smoke-02 rgmanager[31805]: State change: smoke-01 DOWN May 23 14:26:27 smoke-02 kernel: dlm: closing connection to node 1 May 23 14:29:00 smoke-02 kernel: INFO: task dlm_controld:29889 blocked for more than 120 seconds. May 23 14:29:00 smoke-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 23 14:29:00 smoke-02 kernel: dlm_controld D 0000000000000000 0 29889 1 0x00000084 May 23 14:29:00 smoke-02 kernel: ffff88013da83bb8 0000000000000082 ffff88013da83b38 ffffffff8127d4ba May 23 14:29:00 smoke-02 kernel: ffffffff81ff6f00 00000000000026ee 0000000000000020 0000000000000020 May 23 14:29:00 smoke-02 kernel: ffff88013d995058 ffff88013da83fd8 000000000000fb88 ffff88013d995058 May 23 14:29:00 smoke-02 kernel: Call Trace: May 23 14:29:00 smoke-02 kernel: [<ffffffff8127d4ba>] ? delay_tsc+0x4a/0x80 May 23 14:29:00 smoke-02 kernel: [<ffffffff8127d4ba>] ? delay_tsc+0x4a/0x80 May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe6b5>] schedule_timeout+0x215/0x2e0 May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe333>] wait_for_common+0x123/0x180 May 23 14:29:00 smoke-02 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe44d>] wait_for_completion+0x1d/0x20 May 23 14:29:00 smoke-02 kernel: [<ffffffff8108d093>] __cancel_work_timer+0x1b3/0x1e0 May 23 14:29:00 smoke-02 kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20 May 23 14:29:00 smoke-02 kernel: [<ffffffff8108d0f0>] cancel_work_sync+0x10/0x20 May 23 14:29:00 smoke-02 kernel: [<ffffffffa03236bc>] dlm_lowcomms_close+0x4c/0xb0 [dlm] May 23 14:29:00 smoke-02 kernel: [<ffffffffa03143c2>] drop_comm+0x22/0x60 [dlm] May 23 14:29:00 smoke-02 kernel: [<ffffffffa0305796>] client_drop_item+0x26/0x50 [configfs] May 23 14:29:00 smoke-02 kernel: [<ffffffffa03063a0>] configfs_rmdir+0x180/0x230 [configfs] May 23 14:29:00 smoke-02 kernel: [<ffffffff81188d0d>] vfs_rmdir+0xbd/0xf0 May 23 14:29:00 smoke-02 kernel: [<ffffffff8118795a>] ? lookup_hash+0x3a/0x50 May 23 14:29:00 smoke-02 kernel: [<ffffffff8118b2a3>] do_rmdir+0x103/0x120 May 23 14:29:00 smoke-02 kernel: [<ffffffff8117b92f>] ? vfs_read+0x12f/0x1a0 May 23 14:29:00 smoke-02 kernel: [<ffffffff810d69e2>] ? audit_syscall_entry+0x272/0x2a0 May 23 14:29:00 smoke-02 kernel: [<ffffffff8118b316>] sys_rmdir+0x16/0x20 May 23 14:29:00 smoke-02 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b May 23 14:29:00 smoke-02 kernel: INFO: task dlm_send:31816 blocked for more than 120 seconds. May 23 14:29:00 smoke-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 23 14:29:00 smoke-02 kernel: dlm_send D 0000000000000001 0 31816 2 0x00000080 May 23 14:29:00 smoke-02 kernel: ffff88013d997b90 0000000000000046 ffff880037f47580 ffff88007fc00040 May 23 14:29:00 smoke-02 kernel: ffff88013d997b30 0000000000000000 ffff88007be6fba0 ffffffff8100bdee May 23 14:29:00 smoke-02 kernel: ffff88013cdf7ab8 ffff88013d997fd8 000000000000fb88 ffff88013cdf7ab8 May 23 14:29:00 smoke-02 kernel: Call Trace: May 23 14:29:00 smoke-02 kernel: [<ffffffff8100bdee>] ? reschedule_interrupt+0xe/0x20 May 23 14:29:00 smoke-02 kernel: [<ffffffff814fee9e>] __mutex_lock_slowpath+0x13e/0x180 May 23 14:29:00 smoke-02 kernel: [<ffffffff812275f2>] ? selinux_netlbl_socket_post_create+0x52/0x70 May 23 14:29:00 smoke-02 kernel: [<ffffffff814fed3b>] mutex_lock+0x2b/0x50 May 23 14:29:00 smoke-02 kernel: [<ffffffffa0313b34>] get_comm+0x34/0x140 [dlm] May 23 14:29:01 smoke-02 kernel: [<ffffffffa0313d58>] dlm_nodeid_to_addr+0x18/0xd0 [dlm] May 23 14:29:01 smoke-02 kernel: [<ffffffffa0321a64>] tcp_connect_to_sock+0xf4/0x2d0 [dlm] May 23 14:29:01 smoke-02 kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80 May 23 14:29:01 smoke-02 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70 May 23 14:29:01 smoke-02 kernel: [<ffffffffa0324102>] process_send_sockets+0x1d2/0x260 [dlm] May 23 14:29:01 smoke-02 kernel: [<ffffffffa0323f30>] ? process_send_sockets+0x0/0x260 [dlm] May 23 14:29:01 smoke-02 kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0 May 23 14:29:01 smoke-02 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 May 23 14:29:01 smoke-02 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0 May 23 14:29:01 smoke-02 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0 May 23 14:29:01 smoke-02 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 May 23 14:29:01 smoke-02 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0 May 23 14:29:01 smoke-02 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Damn bugzilla line wrapping. > May 23 14:26:27 smoke-02 kernel: dlm: closing connection to node 1 > May 23 14:29:00 smoke-02 kernel: INFO: task dlm_controld:29889 blocked for more than 120 seconds. > May 23 14:29:00 smoke-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > May 23 14:29:00 smoke-02 kernel: dlm_controld D 0000000000000000 0 29889 1 0x00000084 > May 23 14:29:00 smoke-02 kernel: ffff88013da83bb8 0000000000000082 ffff88013da83b38 ffffffff8127d4ba > May 23 14:29:00 smoke-02 kernel: ffffffff81ff6f00 00000000000026ee 0000000000000020 0000000000000020 > May 23 14:29:00 smoke-02 kernel: ffff88013d995058 ffff88013da83fd8 000000000000fb88 ffff88013d995058 > May 23 14:29:00 smoke-02 kernel: Call Trace: > May 23 14:29:00 smoke-02 kernel: [<ffffffff8127d4ba>] ? delay_tsc+0x4a/0x80 > May 23 14:29:00 smoke-02 kernel: [<ffffffff8127d4ba>] ? delay_tsc+0x4a/0x80 > May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe6b5>] schedule_timeout+0x215/0x2e0 > May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe333>] wait_for_common+0x123/0x180 > May 23 14:29:00 smoke-02 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 > May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe44d>] wait_for_completion+0x1d/0x20 > May 23 14:29:00 smoke-02 kernel: [<ffffffff8108d093>] __cancel_work_timer+0x1b3/0x1e0 > May 23 14:29:00 smoke-02 kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20 > May 23 14:29:00 smoke-02 kernel: [<ffffffff8108d0f0>] cancel_work_sync+0x10/0x20 > May 23 14:29:00 smoke-02 kernel: [<ffffffffa03236bc>] dlm_lowcomms_close+0x4c/0xb0 [dlm] > May 23 14:29:00 smoke-02 kernel: [<ffffffffa03143c2>] drop_comm+0x22/0x60 [dlm] > May 23 14:29:00 smoke-02 kernel: [<ffffffffa0305796>] client_drop_item+0x26/0x50 [configfs] > May 23 14:29:00 smoke-02 kernel: [<ffffffffa03063a0>] configfs_rmdir+0x180/0x230 [configfs] > May 23 14:29:00 smoke-02 kernel: [<ffffffff81188d0d>] vfs_rmdir+0xbd/0xf0 > May 23 14:29:00 smoke-02 kernel: [<ffffffff8118795a>] ? lookup_hash+0x3a/0x50 > May 23 14:29:00 smoke-02 kernel: [<ffffffff8118b2a3>] do_rmdir+0x103/0x120 > May 23 14:29:00 smoke-02 kernel: [<ffffffff8117b92f>] ? vfs_read+0x12f/0x1a0 > May 23 14:29:00 smoke-02 kernel: [<ffffffff810d69e2>] ? audit_syscall_entry+0x272/0x2a0 > May 23 14:29:00 smoke-02 kernel: [<ffffffff8118b316>] sys_rmdir+0x16/0x20 > May 23 14:29:00 smoke-02 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b > May 23 14:29:00 smoke-02 kernel: INFO: task dlm_send:31816 blocked for more than 120 seconds. > May 23 14:29:00 smoke-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > May 23 14:29:00 smoke-02 kernel: dlm_send D 0000000000000001 0 31816 2 0x00000080 > May 23 14:29:00 smoke-02 kernel: ffff88013d997b90 0000000000000046 ffff880037f47580 ffff88007fc00040 > May 23 14:29:00 smoke-02 kernel: ffff88013d997b30 0000000000000000 ffff88007be6fba0 ffffffff8100bdee > May 23 14:29:00 smoke-02 kernel: ffff88013cdf7ab8 ffff88013d997fd8 000000000000fb88 ffff88013cdf7ab8 > May 23 14:29:00 smoke-02 kernel: Call Trace: > May 23 14:29:00 smoke-02 kernel: [<ffffffff8100bdee>] ? reschedule_interrupt+0xe/0x20 > May 23 14:29:00 smoke-02 kernel: [<ffffffff814fee9e>] __mutex_lock_slowpath+0x13e/0x180 > May 23 14:29:00 smoke-02 kernel: [<ffffffff812275f2>] ? selinux_netlbl_socket_post_create+0x52/0x70 > May 23 14:29:00 smoke-02 kernel: [<ffffffff814fed3b>] mutex_lock+0x2b/0x50 > May 23 14:29:00 smoke-02 kernel: [<ffffffffa0313b34>] get_comm+0x34/0x140 [dlm] > May 23 14:29:01 smoke-02 kernel: [<ffffffffa0313d58>] dlm_nodeid_to_addr+0x18/0xd0 [dlm] > May 23 14:29:01 smoke-02 kernel: [<ffffffffa0321a64>] tcp_connect_to_sock+0xf4/0x2d0 [dlm] > May 23 14:29:01 smoke-02 kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80 > May 23 14:29:01 smoke-02 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70 > May 23 14:29:01 smoke-02 kernel: [<ffffffffa0324102>] process_send_sockets+0x1d2/0x260 [dlm] > May 23 14:29:01 smoke-02 kernel: [<ffffffffa0323f30>] ? process_send_sockets+0x0/0x260 [dlm] > May 23 14:29:01 smoke-02 kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0 > May 23 14:29:01 smoke-02 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 > May 23 14:29:01 smoke-02 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0 > May 23 14:29:01 smoke-02 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0 > May 23 14:29:01 smoke-02 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 > May 23 14:29:01 smoke-02 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0 > May 23 14:29:01 smoke-02 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
I think the CLOSE flag is probably meant to prevent this sort of problem. We could check for the CLOSE flag earlier, e.g. process_send_sockets or tcp_connect_to_sock, but I'm not sure if that just shortens the race rather than solving it. I'm still hoping that the addition of some artificial delays will let me reproduce this.
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
Created attachment 591929 [details] possible patch, untested I still don't know what's ultimately causing this deadlock because I don't see what lock dlm_controld is holding that's blocking dlm_send... maybe it's burried somewhere in the configfs code. So, I'm not sure if this patch really fixes it or not, but it's what I'm planning to try for now. dlm_send: mutex_lock(&connections_lock) check CLOSE flag dlm_nodeid_to_addr mutex_unlock(&connections_lock) dlm_controld: mutex_lock(&connections_lock) find con set CLOSE flag mutex_unlock(&connections_lock) cancel_work_sync It would be more obviously correct to put the cancel_work_sync within the mutex, but cancel_work_sync may need to acquire the mutex itself.
They must be saving cluster/corosync messages in some other log file, maybe /var/log/cluster.log? Could we get a copy of that? Looking at all the dlm connect/disconnect messages, it appears that there is a larger cluster connectivity problem. If they solve that, the deadlock between dlm threads will most likely be a non-issue (although something we'll still work on).
Thanks for the update; this is definitely a problem, and I have the patch above that I'd like to test, but I need to reproduce the issue before the test will mean very much.
There is shared storage, but gfs2 is not being used, only rgmanager and qdisk. Are the rgmanager services using the shared storage, or is the shared storage only there for the sake of qdisk? When the storage is lost from one node, qdisk uses cman to kill corosync on the node without storage. The node exits the cluster, causing dlm recovery for the rgmanager lockspace, and gets into this deadlock. Is there a way we could try operating this cluster without qdisk? It appears to me that qdisk is making a mess of the membership. I need some sane sequence of events to try to replicate what's happening.
Yes, this is the same issue
cman_tool kill is not causing a problem for me, but rgmanager in my config is not holding any dlm locks which might be important.
Replacing rgmanager with a test program that takes locks, and cman_tool kill still does not cause any problem.
I've noticed in the two cases where I have the cluster.conf, the token value is being increased: <totem token="63000"/> <totem consensus="3000" join="60" token="54000" token_retransmits_before_loss_const="20"/> There was another bug similar to this one, bug 663397, where a large token timeout would lead to dlm_controld getting stuck. The problem there was nodes rejoining before the token timeout and before dlm_controld had closed the lowcomms connection. I've yet to figure out from the logs if something similar is happening here. But, can we check to verify that the fix for bug 663397 is included in the cases reported here?
cman_tool kill still does not cause any problems when tested with large token timeouts like those above.
config.c and lowcomms.c are interdependent, which leads to the deadlock. I've not been able to confirm which configfs lock is creating the deadlock, but in any case, it would be difficult to break this deadlock because the code paths involve configfs code in config.c and workqueue code in lowcomms.c. configfs and workqueue code are both things I can't easily control, so the best approach is probably to remove the dependence. This means having lowcomms quit calling into config for nodeid/addr mappings, and instead have config pass nodeid/addr mappings into lowcomms.
Created attachment 600809 [details] patch to test This is a patch that makes config.c push node addresses into lowcomms.c so that code paths in lowcomms do not need to call into configfs which can deadlock. It would be good if this could be tested by someone who can reliably replicate the problem (I've not been able to). A RHEL6 scratch build including this patch is https://brewweb.devel.redhat.com/taskinfo?taskID=4677553
posted to rhkernel http://post-office.corp.redhat.com/archives/rhkernel-list/2012-August/msg00051.html
Patch(es) available on kernel-2.6.32-304.el6
Verified patches are included in kernel-2.6.32-350.el6 and found no regressions during testing.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0496.html