Bug 824964
Summary: | dlm: deadlock between dlm_send and dlm_controld | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | David Teigland <teigland> | ||||||
Component: | kernel | Assignee: | David Teigland <teigland> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Nate Straz <nstraz> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 6.3 | CC: | amoralej, ccaulfie, dhoward, fhrbata, jentrena, lhh, nstraz, pzimek, rdassen, sauchter, sbradley | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | kernel-2.6.32-304.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: |
A deadlock sometimes occurred between the dlm_controld daemon closing a lowcomms connection through the configfs file system and the dlm_send process looking up the address for a new connection in configfs. With this update, the node addresses are saved within the lowcomms code so that the lowcomms work queue does not need to use configfs to get a node address.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2013-02-21 06:14:53 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 782183, 840683, 849051 | ||||||||
Attachments: |
|
Description
David Teigland
2012-05-24 17:26:21 UTC
Damn bugzilla line wrapping.
> May 23 14:26:27 smoke-02 kernel: dlm: closing connection to node 1
> May 23 14:29:00 smoke-02 kernel: INFO: task dlm_controld:29889 blocked for more than 120 seconds.
> May 23 14:29:00 smoke-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 23 14:29:00 smoke-02 kernel: dlm_controld D 0000000000000000 0 29889 1 0x00000084
> May 23 14:29:00 smoke-02 kernel: ffff88013da83bb8 0000000000000082 ffff88013da83b38 ffffffff8127d4ba
> May 23 14:29:00 smoke-02 kernel: ffffffff81ff6f00 00000000000026ee 0000000000000020 0000000000000020
> May 23 14:29:00 smoke-02 kernel: ffff88013d995058 ffff88013da83fd8 000000000000fb88 ffff88013d995058
> May 23 14:29:00 smoke-02 kernel: Call Trace:
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8127d4ba>] ? delay_tsc+0x4a/0x80
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8127d4ba>] ? delay_tsc+0x4a/0x80
> May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe6b5>] schedule_timeout+0x215/0x2e0
> May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe333>] wait_for_common+0x123/0x180
> May 23 14:29:00 smoke-02 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe44d>] wait_for_completion+0x1d/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8108d093>] __cancel_work_timer+0x1b3/0x1e0
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8108d0f0>] cancel_work_sync+0x10/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffffa03236bc>] dlm_lowcomms_close+0x4c/0xb0 [dlm]
> May 23 14:29:00 smoke-02 kernel: [<ffffffffa03143c2>] drop_comm+0x22/0x60 [dlm]
> May 23 14:29:00 smoke-02 kernel: [<ffffffffa0305796>] client_drop_item+0x26/0x50 [configfs]
> May 23 14:29:00 smoke-02 kernel: [<ffffffffa03063a0>] configfs_rmdir+0x180/0x230 [configfs]
> May 23 14:29:00 smoke-02 kernel: [<ffffffff81188d0d>] vfs_rmdir+0xbd/0xf0
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8118795a>] ? lookup_hash+0x3a/0x50
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8118b2a3>] do_rmdir+0x103/0x120
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8117b92f>] ? vfs_read+0x12f/0x1a0
> May 23 14:29:00 smoke-02 kernel: [<ffffffff810d69e2>] ? audit_syscall_entry+0x272/0x2a0
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8118b316>] sys_rmdir+0x16/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> May 23 14:29:00 smoke-02 kernel: INFO: task dlm_send:31816 blocked for more than 120 seconds.
> May 23 14:29:00 smoke-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 23 14:29:00 smoke-02 kernel: dlm_send D 0000000000000001 0 31816 2 0x00000080
> May 23 14:29:00 smoke-02 kernel: ffff88013d997b90 0000000000000046 ffff880037f47580 ffff88007fc00040
> May 23 14:29:00 smoke-02 kernel: ffff88013d997b30 0000000000000000 ffff88007be6fba0 ffffffff8100bdee
> May 23 14:29:00 smoke-02 kernel: ffff88013cdf7ab8 ffff88013d997fd8 000000000000fb88 ffff88013cdf7ab8
> May 23 14:29:00 smoke-02 kernel: Call Trace:
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8100bdee>] ? reschedule_interrupt+0xe/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffff814fee9e>] __mutex_lock_slowpath+0x13e/0x180
> May 23 14:29:00 smoke-02 kernel: [<ffffffff812275f2>] ? selinux_netlbl_socket_post_create+0x52/0x70
> May 23 14:29:00 smoke-02 kernel: [<ffffffff814fed3b>] mutex_lock+0x2b/0x50
> May 23 14:29:00 smoke-02 kernel: [<ffffffffa0313b34>] get_comm+0x34/0x140 [dlm]
> May 23 14:29:01 smoke-02 kernel: [<ffffffffa0313d58>] dlm_nodeid_to_addr+0x18/0xd0 [dlm]
> May 23 14:29:01 smoke-02 kernel: [<ffffffffa0321a64>] tcp_connect_to_sock+0xf4/0x2d0 [dlm]
> May 23 14:29:01 smoke-02 kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80
> May 23 14:29:01 smoke-02 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
> May 23 14:29:01 smoke-02 kernel: [<ffffffffa0324102>] process_send_sockets+0x1d2/0x260 [dlm]
> May 23 14:29:01 smoke-02 kernel: [<ffffffffa0323f30>] ? process_send_sockets+0x0/0x260 [dlm]
> May 23 14:29:01 smoke-02 kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
> May 23 14:29:01 smoke-02 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> May 23 14:29:01 smoke-02 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
> May 23 14:29:01 smoke-02 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
> May 23 14:29:01 smoke-02 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> May 23 14:29:01 smoke-02 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> May 23 14:29:01 smoke-02 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
I think the CLOSE flag is probably meant to prevent this sort of problem. We could check for the CLOSE flag earlier, e.g. process_send_sockets or tcp_connect_to_sock, but I'm not sure if that just shortens the race rather than solving it. I'm still hoping that the addition of some artificial delays will let me reproduce this. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Created attachment 591929 [details]
possible patch, untested
I still don't know what's ultimately causing this deadlock because I don't see what lock dlm_controld is holding that's blocking dlm_send... maybe it's burried somewhere in the configfs code.
So, I'm not sure if this patch really fixes it or not, but it's what I'm planning to try for now.
dlm_send:
mutex_lock(&connections_lock)
check CLOSE flag
dlm_nodeid_to_addr
mutex_unlock(&connections_lock)
dlm_controld:
mutex_lock(&connections_lock)
find con
set CLOSE flag
mutex_unlock(&connections_lock)
cancel_work_sync
It would be more obviously correct to put the cancel_work_sync within the mutex, but cancel_work_sync may need to acquire the mutex itself.
They must be saving cluster/corosync messages in some other log file, maybe /var/log/cluster.log? Could we get a copy of that? Looking at all the dlm connect/disconnect messages, it appears that there is a larger cluster connectivity problem. If they solve that, the deadlock between dlm threads will most likely be a non-issue (although something we'll still work on). Thanks for the update; this is definitely a problem, and I have the patch above that I'd like to test, but I need to reproduce the issue before the test will mean very much. There is shared storage, but gfs2 is not being used, only rgmanager and qdisk. Are the rgmanager services using the shared storage, or is the shared storage only there for the sake of qdisk? When the storage is lost from one node, qdisk uses cman to kill corosync on the node without storage. The node exits the cluster, causing dlm recovery for the rgmanager lockspace, and gets into this deadlock. Is there a way we could try operating this cluster without qdisk? It appears to me that qdisk is making a mess of the membership. I need some sane sequence of events to try to replicate what's happening. Yes, this is the same issue cman_tool kill is not causing a problem for me, but rgmanager in my config is not holding any dlm locks which might be important. Replacing rgmanager with a test program that takes locks, and cman_tool kill still does not cause any problem. I've noticed in the two cases where I have the cluster.conf, the token value is being increased: <totem token="63000"/> <totem consensus="3000" join="60" token="54000" token_retransmits_before_loss_const="20"/> There was another bug similar to this one, bug 663397, where a large token timeout would lead to dlm_controld getting stuck. The problem there was nodes rejoining before the token timeout and before dlm_controld had closed the lowcomms connection. I've yet to figure out from the logs if something similar is happening here. But, can we check to verify that the fix for bug 663397 is included in the cases reported here? cman_tool kill still does not cause any problems when tested with large token timeouts like those above. config.c and lowcomms.c are interdependent, which leads to the deadlock. I've not been able to confirm which configfs lock is creating the deadlock, but in any case, it would be difficult to break this deadlock because the code paths involve configfs code in config.c and workqueue code in lowcomms.c. configfs and workqueue code are both things I can't easily control, so the best approach is probably to remove the dependence. This means having lowcomms quit calling into config for nodeid/addr mappings, and instead have config pass nodeid/addr mappings into lowcomms. Created attachment 600809 [details] patch to test This is a patch that makes config.c push node addresses into lowcomms.c so that code paths in lowcomms do not need to call into configfs which can deadlock. It would be good if this could be tested by someone who can reliably replicate the problem (I've not been able to). A RHEL6 scratch build including this patch is https://brewweb.devel.redhat.com/taskinfo?taskID=4677553 posted to rhkernel http://post-office.corp.redhat.com/archives/rhkernel-list/2012-August/msg00051.html Patch(es) available on kernel-2.6.32-304.el6 Verified patches are included in kernel-2.6.32-350.el6 and found no regressions during testing. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0496.html |