Bug 239614 - clvmd operations deadlock while waiting for a dlm lock
Summary: clvmd operations deadlock while waiting for a dlm lock
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: lvm2-cluster
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-05-09 21:36 UTC by Corey Marthaler
Modified: 2010-05-17 20:34 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-05-17 20:34:35 UTC
Embargoed:


Attachments (Terms of Use)
kern dump from link-02 (86.49 KB, text/plain)
2007-05-09 21:53 UTC, Corey Marthaler
no flags Details
kern dump from link-04 (70.36 KB, text/plain)
2007-05-09 21:58 UTC, Corey Marthaler
no flags Details
kern dump from link-07 (89.95 KB, text/plain)
2007-05-09 21:58 UTC, Corey Marthaler
no flags Details
kern dump from link-08 (94.06 KB, application/octet-stream)
2007-05-09 22:23 UTC, Corey Marthaler
no flags Details

Description Corey Marthaler 2007-05-09 21:36:06 UTC
Description of problem:
I was running many clvmd operations (create/convert/deactivate/delete) from all
nodes in the link cluster (link-0[2478]) when this deadlock happened.


[root@link-02 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   M   link-04
   2    1    4   M   link-02
   3    1    4   M   link-07
   4    1    4   M   link-08


[root@link-02 ~]# cat /proc/cluster/dlm_locks
DLM lockspace 'clvmd'

Resource 0000010024ebde18 (parent 0000000000000000). Name (len=13) "V_lock_stress"
Local Copy, Master is node 3
Granted Queue
Conversion Queue
Waiting Queue
000100c0 -- (PW) Master:     000100ef  LQ: 0,0x0
00010027 -- (PW) Master:     0001029d  LQ: 0,0x0
000103a1 -- (PR) Master:     00010331  LQ: 0,0x0
000203bd -- (PR) Master:     00010164  LQ: 0,0x0


[root@link-04 ~]# cat /proc/cluster/dlm_locks
DLM lockspace 'clvmd'

Resource 000001003964fe58 (parent 0000000000000000). Name (len=13) "V_lock_stress"
Local Copy, Master is node 3
Granted Queue
Conversion Queue
Waiting Queue
00010381 -- (PW) Master:     000100ae  LQ: 0,0x0
00010182 -- (PW) Master:     0001017b  LQ: 0,0x0
0001022c -- (PR) Master:     0001013a  LQ: 0,0x0


[root@link-07 ~]# cat /proc/cluster/dlm_locks
DLM lockspace 'clvmd'

Resource 000001001dbb3e18 (parent 0000000000000000). Name (len=13) "V_lock_stress"
Master Copy
Granted Queue
0001017f PR Remote:   4 000100db
Conversion Queue
Waiting Queue
000100ef -- (PW) Remote:   2 000100c0  LQ: 0,0x0
0001029d -- (PW) Remote:   2 00010027  LQ: 0,0x0
000100af -- (PW)  LQ: 0,0x0
000101bd -- (PW)  LQ: 0,0x0
00010160 -- (PW) Remote:   4 00010087  LQ: 0,0x0
000100ae -- (PW) Remote:   1 00010381  LQ: 0,0x0
0001017b -- (PW) Remote:   1 00010182  LQ: 0,0x0
0001003d -- (PR)  LQ: 0,0x0
0001004f -- (PR) Remote:   4 000101bf  LQ: 0,0x0
000102f8 -- (PR) Remote:   4 00010308  LQ: 0,0x0
0001021f -- (PR)  LQ: 0,0x0
00010331 -- (PR) Remote:   2 000103a1  LQ: 0,0x0
0001013a -- (PR) Remote:   1 0001022c  LQ: 0,0x0
00010164 -- (PR) Remote:   2 000203bd  LQ: 0,0x0


[root@link-08 ~]# cat /proc/cluster/dlm_locks
DLM lockspace 'clvmd'

Resource 0000010013c97e98 (parent 0000000000000000). Name (len=13) "V_lock_stress"
Local Copy, Master is node 3
Granted Queue
000100db PR Master:     0001017f
Conversion Queue
Waiting Queue
00010087 -- (PW) Master:     00010160  LQ: 0,0x0
000101bf -- (PR) Master:     0001004f  LQ: 0,0x0
00010308 -- (PR) Master:     000102f8  LQ: 0,0x0


Posting the irc discussion on this bug:

<visegrips> pjc: would looking at the clvmd threads help?
<visegrips> pjc: 3 threads are stuck in dlm:send_cluster_request
<visegrips> pjc: one of them must have got through... but all the processes
(vgs, etc) are stuck waiting for locks
<visegrips> pjc: the one that seems of interest is:
--- visegrips is now known as |
<|> May  9 06:22:34 link-08 kernel: clvmd         S 000001001d2b5e18     0  9075
     1          9079  9064 (NOTLB)
<|> May  9 06:22:34 link-08 kernel: 000001001d2b5d98 0000000000000006
0000010036ca67f0 0000000000000077
<|> May  9 06:22:34 link-08 kernel:        000001001d2b5d58 000000008013237d
0000010001009f80 00000000a02ad650
<|> May  9 06:22:34 link-08 kernel:        000001000eabc030 0000000000000fa4
<|> May  9 06:22:34 link-08 kernel: Call
Trace:<ffffffff8013240b>{activate_task+124} <ffffffff8030bd18>{schedule_timeout+224}
<|> May  9 06:22:34 link-08 kernel:       
<ffffffff8016bfd2>{find_extend_vma+22} <ffffffff801357e0>{add_wait_queue+18}
<|> May  9 06:22:34 link-08 kernel:        <ffffffff8014cb6e>{do_futex+413}
<ffffffff80133fec>{default_wake_function+0}
--- | is now known as visegrips
<visegrips> May  9 06:22:34 link-08 kernel:       
<ffffffff80133fec>{default_wake_function+0} <ffffffff8014cf57>{sys_futex+203}
<visegrips> May  9 06:22:34 link-08 kernel:       
<ffffffff801795f1>{sys_write+96} <ffffffff8011026a>{system_call+126}
<pjc> I can't see the dlm in there 
<visegrips> pjc: how do I switch threads in gdb?
<pjc> 'info thr' shows the numbers
<pjc> then 'thr <num>'
<visegrips> pjc: all threads blocking on pthread_cond_wait except 2... I'm
guessing the main and ....
<visegrips> (gdb) bt
<visegrips> #0  0x000000344df0b19f in __read_nocancel () from
/lib64/tls/libpthread.so.0
<visegrips> #1  0x0000000000446814 in do_dlm_dispatch ()
<visegrips> #2  0x00000000004472c2 in dlm_recv_thread ()
<visegrips> #3  0x000000344df06137 in start_thread () from
/lib64/tls/libpthread.so.0
<visegrips> #4  0x000000344d6c7113 in clone () from /lib64/tls/libc.so.6
<pjc> yeah, that's normal
<visegrips> pjc: I've done a bt on all the threads...
<visegrips> pjc: 3 are stuck after _sync_lock
<visegrips> pjc: the other at pre_and_post_thread
<pjc> OK, that's consistent with the DLM lock state 
<pjc> 3 waiting, one granted
<visegrips> pjc: yeah, not really sure what I'm looking for here...
<pjc> so there are 5 lvm threads in clvmd, but only 4 commands showing in PS
<pjc> (that I can see)
<visegrips> pjc: 7 threads: 1 main, 1 ?, 5 commands?
<pjc> yes
<visegrips> what is ?
<pjc> sorry, brain fade - there are only 4
<pjc> but all the lvm commands look like they are waiting for locks ??
<pjc> my guess is that it's the vgs command
<pjc> thats the oldest reader there
<visegrips> pjc: how can you tell that?
<pjc> ps -ef shows rough start dates
<pjc> and lvs and lvscan are dated today
<pjc> lvcreate & vgs yesterday
<pjc> it's not lvcreate that that needs a write lock
<visegrips> pjc: nice, ok
<pjc> if I gdb the vgs command, it just bt's as '__pthread_do_exit'
<pjc> can we do a sysrq-t ?
<visegrips> pjc: /root/messages.txt
<visegrips> pjc: I already dumped that to ^
<pjc> great :)
<pjc> hmm, it's waiting for the socket
<pjc> probably the clvmd one
<visegrips> pjc: could it be possible to have to entries with the same name in
the vglist, then trying to grab both of them?
<pjc> I don't think so. it should do them in turn rather than in parallel
<visegrips> you are right
<pjc> my guess is that vgs is still waiting a reply from clvmd
<pjc> but all clvmd should have done is to get the DLM lock then return
<pjc> actually, I don't beleive half of these backtraces. they're rubbish on x86_64
<pjc> but one clvmd thread seems to be stuck in sys_write ...
<pjc> ... which is consistent with the other end being stuck in read
<pjc> not sure. I don't know why they should be stuck like that (if they are),
if one is reading and one writing then they should be waiting for each other
<pjc> /should/should not/


Version-Release number of selected component (if applicable):
[root@link-07 ~]# uname -ar
Linux link-07 2.6.9-55.ELlargesmp #1 SMP Fri Apr 20 16:46:56 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux
[root@link-07 ~]# rpm -q lvm2-cluster
lvm2-cluster-2.02.21-7.el4

Comment 1 Corey Marthaler 2007-05-09 21:53:19 UTC
Created attachment 154430 [details]
kern dump from link-02

Comment 2 Corey Marthaler 2007-05-09 21:58:03 UTC
Created attachment 154431 [details]
kern dump from link-04

Comment 3 Corey Marthaler 2007-05-09 21:58:58 UTC
Created attachment 154432 [details]
kern dump from link-07

Comment 4 Corey Marthaler 2007-05-09 22:23:58 UTC
Created attachment 154437 [details]
kern dump from link-08

Comment 5 Jonathan Earl Brassow 2007-05-22 13:48:43 UTC
were the converts up or down?


Comment 6 Jonathan Earl Brassow 2007-05-22 13:51:22 UTC
I don't remember getting /var/log/messages for this bug.  (If *.debug was turned
off, you wouldn't see anything though.)

Could be the same as bug 239856.



Note You need to log in before you can comment on or make changes to this bug.