Bug 824964

Summary:

dlm: deadlock between dlm_send and dlm_controld

Product:

Red Hat Enterprise Linux 6

Reporter:

David Teigland <teigland>

Component:

kernel

Assignee:

David Teigland <teigland>

Status:

CLOSED ERRATA

QA Contact:

Nate Straz <nstraz>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.3

CC:

amoralej, ccaulfie, dhoward, fhrbata, jentrena, lhh, nstraz, pzimek, rdassen, sauchter, sbradley

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-304.el6

Doc Type:

Bug Fix

Doc Text:

A deadlock sometimes occurred between the dlm_controld daemon closing a lowcomms connection through the configfs file system and the dlm_send process looking up the address for a new connection in configfs. With this update, the node addresses are saved within the lowcomms code so that the lowcomms work queue does not need to use configfs to get a node address.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-02-21 06:14:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

782183, 840683, 849051

Attachments:

Description	Flags
possible patch, untested	none
patch to test	none

Description David Teigland 2012-05-24 17:26:21 UTC

Description of problem:

Nate hit this during testing.  It appears to be a deadlock between dlm_controld (attempting to remove a node), and dlm_send (attempting to connect to the node).

dlm_send is trying to do its work.

dlm_controld is waiting for all work to be done.

dlm_send is waiting on a mutex in get_comm,
it must be mutex_lock(&clusters_root.subsys.su_mutex);

dlm_controld is holding that mutex?, I don't see how/where.


May 23 14:10:36 smoke-02 fence_node[654]: fence smoke-01 success
May 23 14:26:25 smoke-02 corosync[29818]:   [TOTEM ] A processor failed, forming new configuration.
May 23 14:26:27 smoke-02 corosync[29818]:   [QUORUM] Members[1]: 2
May 23 14:26:27 smoke-02 corosync[29818]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 23 14:26:27 smoke-02 corosync[29818]:   [CPG   ] chosen downlist: sender r(0) ip(10.15.89.105) ; members(old:2 left:1)
May 23 14:26:27 smoke-02 corosync[29818]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 23 14:26:27 smoke-02 rgmanager[31805]: State change: smoke-01 DOWN
May 23 14:26:27 smoke-02 kernel: dlm: closing connection to node 1

May 23 14:29:00 smoke-02 kernel: INFO: task dlm_controld:29889 blocked for more than 120 seconds.
May 23 14:29:00 smoke-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 23 14:29:00 smoke-02 kernel: dlm_controld  D 0000000000000000     0 29889      1 0x00000084
May 23 14:29:00 smoke-02 kernel: ffff88013da83bb8 0000000000000082 ffff88013da83b38 ffffffff8127d4ba
May 23 14:29:00 smoke-02 kernel: ffffffff81ff6f00 00000000000026ee 0000000000000020 0000000000000020
May 23 14:29:00 smoke-02 kernel: ffff88013d995058 ffff88013da83fd8 000000000000fb88 ffff88013d995058
May 23 14:29:00 smoke-02 kernel: Call Trace:
May 23 14:29:00 smoke-02 kernel: [<ffffffff8127d4ba>] ? delay_tsc+0x4a/0x80
May 23 14:29:00 smoke-02 kernel: [<ffffffff8127d4ba>] ? delay_tsc+0x4a/0x80
May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe6b5>] schedule_timeout+0x215/0x2e0
May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe333>] wait_for_common+0x123/0x180
May 23 14:29:00 smoke-02 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe44d>] wait_for_completion+0x1d/0x20
May 23 14:29:00 smoke-02 kernel: [<ffffffff8108d093>] __cancel_work_timer+0x1b3/0x1e0
May 23 14:29:00 smoke-02 kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20
May 23 14:29:00 smoke-02 kernel: [<ffffffff8108d0f0>] cancel_work_sync+0x10/0x20
May 23 14:29:00 smoke-02 kernel: [<ffffffffa03236bc>] dlm_lowcomms_close+0x4c/0xb0 [dlm]
May 23 14:29:00 smoke-02 kernel: [<ffffffffa03143c2>] drop_comm+0x22/0x60 [dlm]
May 23 14:29:00 smoke-02 kernel: [<ffffffffa0305796>] client_drop_item+0x26/0x50 [configfs]
May 23 14:29:00 smoke-02 kernel: [<ffffffffa03063a0>] configfs_rmdir+0x180/0x230 [configfs]
May 23 14:29:00 smoke-02 kernel: [<ffffffff81188d0d>] vfs_rmdir+0xbd/0xf0
May 23 14:29:00 smoke-02 kernel: [<ffffffff8118795a>] ? lookup_hash+0x3a/0x50
May 23 14:29:00 smoke-02 kernel: [<ffffffff8118b2a3>] do_rmdir+0x103/0x120
May 23 14:29:00 smoke-02 kernel: [<ffffffff8117b92f>] ? vfs_read+0x12f/0x1a0
May 23 14:29:00 smoke-02 kernel: [<ffffffff810d69e2>] ? audit_syscall_entry+0x272/0x2a0
May 23 14:29:00 smoke-02 kernel: [<ffffffff8118b316>] sys_rmdir+0x16/0x20
May 23 14:29:00 smoke-02 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

May 23 14:29:00 smoke-02 kernel: INFO: task dlm_send:31816 blocked for more than 120 seconds.
May 23 14:29:00 smoke-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 23 14:29:00 smoke-02 kernel: dlm_send      D 0000000000000001     0 31816      2 0x00000080
May 23 14:29:00 smoke-02 kernel: ffff88013d997b90 0000000000000046 ffff880037f47580 ffff88007fc00040
May 23 14:29:00 smoke-02 kernel: ffff88013d997b30 0000000000000000 ffff88007be6fba0 ffffffff8100bdee
May 23 14:29:00 smoke-02 kernel: ffff88013cdf7ab8 ffff88013d997fd8 000000000000fb88 ffff88013cdf7ab8
May 23 14:29:00 smoke-02 kernel: Call Trace:
May 23 14:29:00 smoke-02 kernel: [<ffffffff8100bdee>] ? reschedule_interrupt+0xe/0x20
May 23 14:29:00 smoke-02 kernel: [<ffffffff814fee9e>] __mutex_lock_slowpath+0x13e/0x180
May 23 14:29:00 smoke-02 kernel: [<ffffffff812275f2>] ? selinux_netlbl_socket_post_create+0x52/0x70
May 23 14:29:00 smoke-02 kernel: [<ffffffff814fed3b>] mutex_lock+0x2b/0x50
May 23 14:29:00 smoke-02 kernel: [<ffffffffa0313b34>] get_comm+0x34/0x140 [dlm]
May 23 14:29:01 smoke-02 kernel: [<ffffffffa0313d58>] dlm_nodeid_to_addr+0x18/0xd0 [dlm]
May 23 14:29:01 smoke-02 kernel: [<ffffffffa0321a64>] tcp_connect_to_sock+0xf4/0x2d0 [dlm]
May 23 14:29:01 smoke-02 kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80
May 23 14:29:01 smoke-02 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
May 23 14:29:01 smoke-02 kernel: [<ffffffffa0324102>] process_send_sockets+0x1d2/0x260 [dlm]
May 23 14:29:01 smoke-02 kernel: [<ffffffffa0323f30>] ? process_send_sockets+0x0/0x260 [dlm]
May 23 14:29:01 smoke-02 kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
May 23 14:29:01 smoke-02 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
May 23 14:29:01 smoke-02 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
May 23 14:29:01 smoke-02 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
May 23 14:29:01 smoke-02 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
May 23 14:29:01 smoke-02 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
May 23 14:29:01 smoke-02 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2012-05-24 17:30:11 UTC

Damn bugzilla line wrapping.


> May 23 14:26:27 smoke-02 kernel: dlm: closing connection to node 1
> May 23 14:29:00 smoke-02 kernel: INFO: task dlm_controld:29889 blocked for more than 120 seconds.
> May 23 14:29:00 smoke-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 23 14:29:00 smoke-02 kernel: dlm_controld  D 0000000000000000     0 29889      1 0x00000084
> May 23 14:29:00 smoke-02 kernel: ffff88013da83bb8 0000000000000082 ffff88013da83b38 ffffffff8127d4ba
> May 23 14:29:00 smoke-02 kernel: ffffffff81ff6f00 00000000000026ee 0000000000000020 0000000000000020
> May 23 14:29:00 smoke-02 kernel: ffff88013d995058 ffff88013da83fd8 000000000000fb88 ffff88013d995058
> May 23 14:29:00 smoke-02 kernel: Call Trace:
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8127d4ba>] ? delay_tsc+0x4a/0x80
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8127d4ba>] ? delay_tsc+0x4a/0x80
> May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe6b5>] schedule_timeout+0x215/0x2e0
> May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe333>] wait_for_common+0x123/0x180
> May 23 14:29:00 smoke-02 kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffff814fe44d>] wait_for_completion+0x1d/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8108d093>] __cancel_work_timer+0x1b3/0x1e0
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8108d0f0>] cancel_work_sync+0x10/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffffa03236bc>] dlm_lowcomms_close+0x4c/0xb0 [dlm]
> May 23 14:29:00 smoke-02 kernel: [<ffffffffa03143c2>] drop_comm+0x22/0x60 [dlm]
> May 23 14:29:00 smoke-02 kernel: [<ffffffffa0305796>] client_drop_item+0x26/0x50 [configfs]
> May 23 14:29:00 smoke-02 kernel: [<ffffffffa03063a0>] configfs_rmdir+0x180/0x230 [configfs]
> May 23 14:29:00 smoke-02 kernel: [<ffffffff81188d0d>] vfs_rmdir+0xbd/0xf0
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8118795a>] ? lookup_hash+0x3a/0x50
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8118b2a3>] do_rmdir+0x103/0x120
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8117b92f>] ? vfs_read+0x12f/0x1a0
> May 23 14:29:00 smoke-02 kernel: [<ffffffff810d69e2>] ? audit_syscall_entry+0x272/0x2a0
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8118b316>] sys_rmdir+0x16/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> May 23 14:29:00 smoke-02 kernel: INFO: task dlm_send:31816 blocked for more than 120 seconds.
> May 23 14:29:00 smoke-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 23 14:29:00 smoke-02 kernel: dlm_send      D 0000000000000001     0 31816      2 0x00000080
> May 23 14:29:00 smoke-02 kernel: ffff88013d997b90 0000000000000046 ffff880037f47580 ffff88007fc00040
> May 23 14:29:00 smoke-02 kernel: ffff88013d997b30 0000000000000000 ffff88007be6fba0 ffffffff8100bdee
> May 23 14:29:00 smoke-02 kernel: ffff88013cdf7ab8 ffff88013d997fd8 000000000000fb88 ffff88013cdf7ab8
> May 23 14:29:00 smoke-02 kernel: Call Trace:
> May 23 14:29:00 smoke-02 kernel: [<ffffffff8100bdee>] ? reschedule_interrupt+0xe/0x20
> May 23 14:29:00 smoke-02 kernel: [<ffffffff814fee9e>] __mutex_lock_slowpath+0x13e/0x180
> May 23 14:29:00 smoke-02 kernel: [<ffffffff812275f2>] ? selinux_netlbl_socket_post_create+0x52/0x70
> May 23 14:29:00 smoke-02 kernel: [<ffffffff814fed3b>] mutex_lock+0x2b/0x50
> May 23 14:29:00 smoke-02 kernel: [<ffffffffa0313b34>] get_comm+0x34/0x140 [dlm]
> May 23 14:29:01 smoke-02 kernel: [<ffffffffa0313d58>] dlm_nodeid_to_addr+0x18/0xd0 [dlm]
> May 23 14:29:01 smoke-02 kernel: [<ffffffffa0321a64>] tcp_connect_to_sock+0xf4/0x2d0 [dlm]
> May 23 14:29:01 smoke-02 kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80
> May 23 14:29:01 smoke-02 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
> May 23 14:29:01 smoke-02 kernel: [<ffffffffa0324102>] process_send_sockets+0x1d2/0x260 [dlm]
> May 23 14:29:01 smoke-02 kernel: [<ffffffffa0323f30>] ? process_send_sockets+0x0/0x260 [dlm]
> May 23 14:29:01 smoke-02 kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0
> May 23 14:29:01 smoke-02 kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
> May 23 14:29:01 smoke-02 kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0
> May 23 14:29:01 smoke-02 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
> May 23 14:29:01 smoke-02 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> May 23 14:29:01 smoke-02 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> May 23 14:29:01 smoke-02 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

Comment 3 David Teigland 2012-05-24 22:34:46 UTC

I think the CLOSE flag is probably meant to prevent this sort of problem.  We could check for the CLOSE flag earlier, e.g. process_send_sockets or tcp_connect_to_sock, but I'm not sure if that just shortens the race rather than solving it.

I'm still hoping that the addition of some artificial delays will let me reproduce this.

Comment 4 RHEL Program Management 2012-05-30 05:47:14 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 6 David Teigland 2012-06-14 21:17:24 UTC

Created attachment 591929 [details]
possible patch, untested

I still don't know what's ultimately causing this deadlock because I don't see what lock dlm_controld is holding that's blocking dlm_send... maybe it's burried somewhere in the configfs code.

So, I'm not sure if this patch really fixes it or not, but it's what I'm planning to try for now.

dlm_send:
mutex_lock(&connections_lock)
check CLOSE flag
dlm_nodeid_to_addr
mutex_unlock(&connections_lock)

dlm_controld:
mutex_lock(&connections_lock)
find con
set CLOSE flag
mutex_unlock(&connections_lock)
cancel_work_sync

It would be more obviously correct to put the cancel_work_sync within the mutex,  but cancel_work_sync may need to acquire the mutex itself.

Comment 9 David Teigland 2012-06-15 20:25:33 UTC

They must be saving cluster/corosync messages in some other log file, maybe /var/log/cluster.log?  Could we get a copy of that?  Looking at all the dlm connect/disconnect messages, it appears that there is a larger cluster connectivity problem.  If they solve that, the deadlock between dlm threads will most likely be a non-issue (although something we'll still work on).

Comment 17 David Teigland 2012-07-19 14:33:12 UTC

Thanks for the update; this is definitely a problem, and I have the patch above that I'd like to test, but I need to reproduce the issue before the test will mean very much.

Comment 18 David Teigland 2012-07-19 16:07:19 UTC

There is shared storage, but gfs2 is not being used, only rgmanager and qdisk.  Are the rgmanager services using the shared storage, or is the shared storage only there for the sake of qdisk?

When the storage is lost from one node, qdisk uses cman to kill corosync on the node without storage.  The node exits the cluster, causing dlm recovery for the rgmanager lockspace, and gets into this deadlock.

Is there a way we could try operating this cluster without qdisk?  It appears to me that qdisk is making a mess of the membership.  I need some sane sequence of events to try to replicate what's happening.

Comment 20 David Teigland 2012-07-23 16:05:16 UTC

Yes, this is the same issue

Comment 23 David Teigland 2012-07-24 20:42:42 UTC

cman_tool kill is not causing a problem for me, but rgmanager in my config is not holding any dlm locks which might be important.

Comment 24 David Teigland 2012-07-24 20:49:26 UTC

Replacing rgmanager with a test program that takes locks, and cman_tool kill still does not cause any problem.

Comment 25 David Teigland 2012-07-24 21:55:41 UTC

I've noticed in the two cases where I have the cluster.conf, the token value is being increased:

<totem token="63000"/>

<totem consensus="3000" join="60" token="54000" token_retransmits_before_loss_const="20"/>

There was another bug similar to this one, bug 663397, where a large token timeout would lead to dlm_controld getting stuck.  The problem there was nodes rejoining before the token timeout and before dlm_controld had closed the lowcomms connection.  I've yet to figure out from the logs if something similar is happening here.  But, can we check to verify that the fix for bug 663397 is included in the cases reported here?

Comment 27 David Teigland 2012-07-24 22:17:59 UTC

cman_tool kill still does not cause any problems when tested with large token timeouts like those above.

Comment 28 David Teigland 2012-07-25 19:46:37 UTC

config.c and lowcomms.c are interdependent, which leads to the deadlock.  I've not been able to confirm which configfs lock is creating the deadlock, but in any case, it would be difficult to break this deadlock because the code paths involve configfs code in config.c and workqueue code in lowcomms.c.  configfs and workqueue code are both things I can't easily control, so the best approach is probably to remove the dependence.  This means having lowcomms quit calling into config for nodeid/addr mappings, and instead have config pass nodeid/addr mappings into lowcomms.

Comment 29 David Teigland 2012-07-27 15:14:19 UTC

Created attachment 600809 [details]
patch to test

This is a patch that makes config.c push node addresses into lowcomms.c so that code paths in lowcomms do not need to call into configfs which can deadlock.

It would be good if this could be tested by someone who can reliably replicate the problem (I've not been able to).  A RHEL6 scratch build including this patch is https://brewweb.devel.redhat.com/taskinfo?taskID=4677553

Comment 31 David Teigland 2012-08-01 16:56:38 UTC

posted to rhkernel
http://post-office.corp.redhat.com/archives/rhkernel-list/2012-August/msg00051.html

Comment 35 Jarod Wilson 2012-08-31 18:43:56 UTC

Patch(es) available on kernel-2.6.32-304.el6

Comment 38 Nate Straz 2013-01-21 20:47:11 UTC

Verified patches are included in kernel-2.6.32-350.el6 and found no regressions during testing.

Comment 40 errata-xmlrpc 2013-02-21 06:14:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html