Bug 483685

Summary: lvm commands hang on connect() to clvmd
Product: [Retired] Red Hat Cluster Suite Reporter: Eduardo Damato <edamato>
Component: dlm-kernelAssignee: David Teigland <teigland>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4CC: agk, bmr, casmith, ccaulfie, cfeist, cluster-maint, cmaiolin, cmarthal, curtis, daniel.oberdick, dejohnso, djuran, dwysocha, edamato, grimme, jbrassow, jfenal, jkortus, jwilleford, mbroz, pdemauro, prockai, sbradley, schlegel, tao, teigland
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-17 02:59:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 546022, 564454    
Attachments:
Description Flags
locks for node psu250 27th Feb 09
none
locks for node psu252 27th Feb 09
none
dlm debugging patch
none
debugging patch
none
patch that doesn't work yet...
none
patch that works none

Description Eduardo Damato 2009-02-03 00:01:20 UTC
Description of problem:

On RHEL4.7 cluster (latest components) clvmd randomly stops responding to lvm commands.  
When this takes place, all lvm commands stay blocked on connect() to clvmd, never returning. 
No reproducers have been identified so far.

Version-Release number of selected component (if applicable):

RHEL4:
device-mapper-1.02.25-2.el4-i386
lvm2-2.02.37-3.el4-i386
lvm2-cluster-2.02.37-3.el4-i386

note: 

the packages above contain the fixes for the bugs below, which had very similar behaviour to this:

https://bugzilla.redhat.com/show_bug.cgi?id=435491
https://bugzilla.redhat.com/show_bug.cgi?id=435341

How reproducible:

Not reproducible, happens randomly.
  
Actual results:

When the bug is triggered, and locking_type=3, all lvm commands block on connect() to clvmd and stay in that state forever.

Expected results:

Lvm commands to connect to clvmd, perform the necessary actions and return as expected.

Additional info:

As a workaround it was generally possible to kill clvmd on all nodes, and start it again [on all nodes], but in some cases this workaround does not work, namely when the following happens upon launch of clvmd with:

# /usr/sbin/clvmd

Jan 15 11:48:33 node2 kernel: dlm: clvmd: lockspace already in use
Jan 15 11:48:33 node2 clvmd: Cluster LVM daemon started - connected to CMAN
Jan 15 11:48:33 node2 kernel: CMAN: WARNING no listener for port 11 on node node1
Jan 15 11:49:29 node2 clvmd: Cluster LVM daemon started - connected to CMAN
Jan 15 11:50:08 node2 clvmd[30460]: Refreshing context
Jan 15 11:50:08 node2 clvmd[30460]: Reloading config files
Jan 15 11:50:08 node2 clvmd[30460]: Wiping internal VG cache


Jan 15 11:49:24 node1 kernel: dlm: clvmd: lockspace already in use
Jan 15 11:49:24 node1 clvmd: Cluster LVM daemon started - connected to CMAN
Jan 15 11:49:24 node1 kernel: CMAN: WARNING no listener for port 11 on node node2

Comment 1 Eduardo Damato 2009-02-03 00:03:16 UTC
As expected vgscan blocks on the unfinished connect() syscall:

# vgscan -vvvvvv
    ...
    0.000072 write(2, "#config/config.c:950   ", 23#config/config.c:950   ) = 23 <0.000014>
    0.000069 write(2, "Setting global/locking_type to 3", 32Setting global/locking_type to 3) = 32 <0.000014>
    0.000072 write(2, "\n", 1)         = 1 <0.000013>
    0.000058 write(2, "#locking/locking.c:245   ", 25#locking/locking.c:245   ) = 25 <0.000014>
    0.000071 write(2, "Cluster locking selected.", 25Cluster locking selected.) = 25 <0.000014>
    0.000072 write(2, "\n", 1)         = 1 <0.000014>
    0.000066 socket(PF_FILE, SOCK_STREAM, 0) = 4 <0.000018>
    0.000069 connect(4, {sa_family=AF_FILE, path=@clvmd}, 110 <unfinished ...>

On the clvmd side, we do not see any process look after the accept() call

# strace -f -T -r -p <pid of clvmd>

Process 1197 attached with 11 threads - interrupt to quit
[pid 29675]      0.000000 futex(0xedffabe8, FUTEX_WAIT, 1197, NULL <unfinished ...>
[pid 29706]      0.000077 read(8,  <unfinished ...>
[pid 29707]      0.000052 futex(0x80b13c4, FUTEX_WAIT, 3402281, NULL <unfinished ...>
[pid  2407]      0.000049 futex(0xf4dff1b4, FUTEX_WAIT, 1, NULL <unfinished ...>
[pid  2413]      0.000050 futex(0xf07fe1b4, FUTEX_WAIT, 1, NULL <unfinished ...>
[pid  4560]      0.000050 futex(0xf2fff1b4, FUTEX_WAIT, 1, NULL <unfinished ...>
[pid 20696]      0.000046 futex(0xf11ff1b4, FUTEX_WAIT, 1, NULL <unfinished ...>
[pid 17842]      0.000051 futex(0xefdfd1b4, FUTEX_WAIT, 1, NULL <unfinished ...>
[pid 19984]      0.000049 futex(0xee9fb1b4, FUTEX_WAIT, 1, NULL <unfinished ...>
[pid 25536]      0.000048 futex(0xef3fc1b4, FUTEX_WAIT, 1, NULL <unfinished ...>
[pid  1197]      0.000049 futex(0xedffa1b4, FUTEX_WAIT, 1, NULL <unfinished ...>
Process 29675 detached
Process 29706 detached
Process 29707 detached
Process 2407 detached
Process 2413 detached
Process 4560 detached
Process 20696 detached
Process 17842 detached
Process 19984 detached
Process 25536 detached
Process 1197 detached

Comment 2 Eduardo Damato 2009-02-03 00:06:03 UTC
FYI: this gdb trace is not from the same occurence of the strace

(gdb) thread apply all bt

Thread 10 (process 23461):
#0  0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00471297 in pthread_join () from /lib/tls/libpthread.so.0
#2  0x080546ab in local_sock_callback (thisfd=0xf459f100, buf=0xfee743f0 "\021", len=1500, csid=0xfee74a60 "8\004",  new_client=0xfee74a44) at clvmd.c:921
#3  0x08055f07 in main (argc=1, argv=0xfee74f04) at clvmd.c:692

Thread 9 (process 14995):
#0  0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0x0805b4b4 in _sync_lock (resource=0xf12fcee3 "V_vg_log", mode=3, flags=0, lockid=0xea0fa2c8) at clvmd-cman.c:437
#3  0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987
#4  0x08052fbb in do_pre_command (client=0xf459f100) at clvmd-command.c:224
#5  0x080537cd in pre_and_post_thread (arg=0xf459f100) at clvmd.c:1418
#6  0x004703cc in start_thread () from /lib/tls/libpthread.so.0
#7  0x003061ae in mremap () from /lib/tls/libc.so.6

Thread 8 (process 812):
#0  0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0x0805b4b4 in _sync_lock (resource=0xf12fcfe3 "V_vg_log", mode=3, flags=0, lockid=0xeaafb2c8) at clvmd-cman.c:437
#3  0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987
#4  0x08052fbb in do_pre_command (client=0xf459efb8) at clvmd-command.c:224
#5  0x080537cd in pre_and_post_thread (arg=0xf459efb8) at clvmd.c:1418
#6  0x004703cc in start_thread () from /lib/tls/libpthread.so.0
#7  0x003061ae in mremap () from /lib/tls/libc.so.6

Thread 7 (process 8337):
#0  0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0x0805b4b4 in _sync_lock (resource=0xf12fcfbb "V_vg_log", mode=3, flags=0, lockid=0xeb4fc2c8) at clvmd-cman.c:437
#3  0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987
#4  0x08052fbb in do_pre_command (client=0xf45b0760) at clvmd-command.c:224
#5  0x080537cd in pre_and_post_thread (arg=0xf45b0760) at clvmd.c:1418
#6  0x004703cc in start_thread () from /lib/tls/libpthread.so.0
#7  0x003061ae in mremap () from /lib/tls/libc.so.6

Thread 6 (process 15257):
#0  0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0x0805b4b4 in _sync_lock (resource=0xf12fcce3 "V_vg_log", mode=3, flags=0, lockid=0xebefd2c8) at clvmd-cman.c:437
#3  0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987
#4  0x08052fbb in do_pre_command (client=0xf45b05f0) at clvmd-command.c:224
#5  0x080537cd in pre_and_post_thread (arg=0xf45b05f0) at clvmd.c:1418
#6  0x004703cc in start_thread () from /lib/tls/libpthread.so.0
#7  0x003061ae in mremap () from /lib/tls/libc.so.6

Thread 5 (process 18581):
#0  0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0x0805b4b4 in _sync_lock (resource=0xf12fce2b "V_vg_log", mode=3, flags=0, lockid=0xec8fe2c8) at clvmd-cman.c:437
#3  0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc)
    at clvmd.c:1987
#4  0x08052fbb in do_pre_command (client=0xf45c4678) at clvmd-command.c:224
#5  0x080537cd in pre_and_post_thread (arg=0xf45c4678) at clvmd.c:1418
#6  0x004703cc in start_thread () from /lib/tls/libpthread.so.0
#7  0x003061ae in mremap () from /lib/tls/libc.so.6

Thread 4 (process 2716):
#0  0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0x0805b4b4 in _sync_lock (resource=0xf45c4663 "V_vg_log", mode=3, flags=0, lockid=0xefaff2c8) at clvmd-cman.c:437
#3  0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987
#4  0x08052fbb in do_pre_command (client=0xf45c4598) at clvmd-command.c:224
#5  0x080537cd in pre_and_post_thread (arg=0xf45c4598) at clvmd.c:1418
#6  0x004703cc in start_thread () from /lib/tls/libpthread.so.0
#7  0x003061ae in mremap () from /lib/tls/libc.so.6

Thread 3 (process 2703):
#0  0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0x0805b4b4 in _sync_lock (resource=0xf12fcf93 "V_vg_log", mode=3, flags=0, lockid=0xed2ff2c8) at clvmd-cman.c:437
#3  0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987
#4  0x08052fbb in do_pre_command (client=0xf12fcd60) at clvmd-command.c:224
#5  0x080537cd in pre_and_post_thread (arg=0xf12fcd60) at clvmd.c:1418
#6  0x004703cc in start_thread () from /lib/tls/libpthread.so.0
#7  0x003061ae in mremap () from /lib/tls/libc.so.6

Thread 2 (process 23463):
#0  0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2  0x08054edc in lvm_thread_fn (arg=0x0) at clvmd.c:1777
#3  0x004703cc in start_thread () from /lib/tls/libpthread.so.0
#4  0x003061ae in mremap () from /lib/tls/libc.so.6

Thread 1 (process 23462):
#0  0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0047569b in __read_nocancel () from /lib/tls/libpthread.so.0
#2  0x0809d722 in do_dlm_dispatch ()
#3  0x0809e037 in dlm_recv_thread ()
#4  0x004703cc in start_thread () from /lib/tls/libpthread.so.0
#5  0x003061ae in mremap () from /lib/tls/libc.so.6
(gdb) 
(gdb)

Comment 4 Christine Caulfield 2009-02-03 10:06:23 UTC
What seems to be happening is that an LVM command quits while its clvmd thread is waiting for a lock.

This means that the main loop gets stuck in phthread_join trying to tidy up and cannot service any more connect requests.

That bring up two more questions though:

1. Why does the lock never seem to get granted (assuming that clvmd waits forever which I'm inferring here). This is a VG lock so it's a pretty seriously large lock for it to never be released! If we can solve this one I expect the rest will go away.

2. Why does lvm allow the thread to exit while waiting for a lock when all signals are blocked? The only way I can see this happening is via "kill -9". Is this happening?


So I suppose the next thing to do is to get a DLM lock dump of the clvmd lockspace. Then try to ascertain what the process holding the lock is doing.

By "process holding the lock" I don't really mean clvmd ... I know that's the actual PID holding the lock. I mean the lvm command that is supposedly doing something useful under protection of that lock.

Comment 6 Christine Caulfield 2009-02-04 14:04:30 UTC
Thanks, the lockdump on its own isn't information as such though .. it's just a way of locating the node that has the hung command. 

From there we need to know which command is hanging and why. Most likely it'll be the longest running LVM command on the system ... unless it's been killed with -9 too :S

Comment 10 Christine Caulfield 2009-02-23 10:19:32 UTC
Do we have any more information about this ? I can't reproduce it at all. What commands are they running on these LVs? Ideally I need to know the first command that seems to hang - as that will be the one that's holding the VG lock. 

THEN we need to know why that particular command is hanging. But without knowing what it is it's almost impossible.

Comment 11 Gerald Marie-Nelly 2009-03-03 16:45:32 UTC
Created attachment 333895 [details]
 locks for node psu250 27th Feb 09

Comment 12 Gerald Marie-Nelly 2009-03-03 16:46:13 UTC
Created attachment 333896 [details]
 locks for node psu252 27th Feb 09

Comment 19 David Teigland 2009-03-11 17:14:53 UTC
If the lockspace is stuck in recovery, that could possibly explain the locks
on the waiting queue.  Otherwise, I don't have any great suggestions, I'd be
suspicious of the kill -9 causing it somehow.  The way that dlm_close() clears
locks after kill may lead to some problems, especially if it tries to cancel.
So, collecting info before killing clvmd seems like an important step to avoid
mixing the effects of the original problem with the possible effects of the kill.

Comment 20 tubesenf 2009-03-13 17:17:27 UTC
hi,
it seems to be that we are facing the same issue.
we are running rhel 5.2 / 5.3 and sometimes our cluster lvm get stuck and we are not able to modify any lv.

i opened a sr 1883878 but the workaround will always cause downtime, because we have to stop all applications before killing clvmd and remove the lvm cache.

so i'm really interested in what is causing the problem and how we can avoid/ fix it.

greetings
daniel

Comment 49 Curtis 2009-08-05 16:37:30 UTC
I am also having this same issue on Redhat 5.3.

I sent this email to the linux cluster list. We do not have an entitlement to lvm2-cluster, so I am using the one from centos, but everything else is from RHEL 5. 

Thanks,
Curtis.

=====

Hi,

I've been experiencing hangs with clvmd (and all associated commands: lvs, 
vgdsplay, ect) on a 3 node cluster.

I am using cman and clvmd to create clustered lvs from a san backend.

I recently upgraded from 5.2 to 5.3 because occasionally commands began to 
hang, and would not work until I rebooted either 1 or 2 of the nodes, and there 
seemed to be several fixes in 5.3.

These are the versions of cman and lvm2-cluster I am using. I'm using the 
lvm2-cluster from centos, rather than redhat.

[r...@cluster6 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.3 (Tikanga)
[r...@cluster6 ~]# rpm -qa cman
cman-2.0.98-1.el5_3.4
[r...@cluster6 ~]# rpm -qa lvm2-cluster
lvm2-cluster-2.02.40-7.el5 <- from centos packages

and this is what nodes there are:

[r...@cluster6 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M   2336   2009-08-04 14:39:42  cluster7
   2   M   2336   2009-08-04 14:39:42  cluster5
   3   M   2332   2009-08-04 14:39:19  cluster6
[r...@cluster6 ~]# cman_tool services
type             level name     id       state       
fence            0     default  00010003 none        
[1 2 3]
dlm              1     clvmd    00020003 none        
[1 2 3]

and this is what's hanging after "service clvmd start"

 7056 pts/0    S+     0:00 /usr/sbin/vgscan

it just sits there forever, and I can't run any "lvm" type commands. If I kill 
it the clvmd init script moves onto vgchange -ayl which also hangs.

Any thoughts? Is there some kind of human error misconfiguration that could be 
going on here?

Thanks,
Curtis.

Comment 52 Christine Caulfield 2009-10-16 10:02:20 UTC
I've manage to reproduce something that looks very similar to this on RHEL4, and it points to a DLM problem.

I started a 4 node cluster with 5 PVs containing 2 VGs.

on all nodes run
# while [ true ]; do lvs; done > /dev/null &

then repeatedly run 
# vgscan

on all nodes until it hangs on one or more nodes.

Looking at /proc/cluster/dlm_locks for clvmd I see this:

Resource 000001003add6de8 (parent 0000000000000000). Name (len=7) "V_guest"  
Master Copy
Granted Queue
Conversion Queue
Waiting Queue
00070050 -- (PR) 7930  LQ: 0,0x0
00020392 -- (PR) 7704 Remote:   3 00040302  LQ: 0,0x0
000503dc -- (PR) 8018  LQ: 0,0x0
0003016c -- (PR) 7908 Remote:   4 000502f1  LQ: 0,0x0
0004027a -- (PR) 7288 Remote:   1 0003027e  LQ: 0,0x0

ie .. all locks are waiting and none are granted.

/proc/cluster/dlm_debug has this (on the master node):

clvmd send einval to 3
clvmd (7860) req reply einval 7001f fr 4 r 4 V_VolGroup00
clvmd send einval to 4
clvmd send einval to 3
clvmd (7888) req reply einval 70275 fr 3 r 3 V_guest
clvmd (7900) req reply einval 500cf fr 3 r 3 V_guest
clvmd send einval to 3
clvmd (7902) req reply einval 303ac fr 3 r 3 V_guest
clvmd send einval to 4
clvmd send einval to 3
clvmd (7912) req reply einval 602a3 fr 3 r 3 V_test2
clvmd send einval to 3
clvmd send einval to 1
clvmd (7922) req reply einval 600dc fr 4 r 4 V_guest
clvmd (7924) req reply einval 40084 fr 1 r 1 V_VolGroup00
clvmd send einval to 3
clvmd send einval to 3
clvmd send einval to 4
clvmd (7954) req reply einval 600ca fr 4 r 4 V_guest
clvmd send einval to 3
clvmd send einval to 4
clvmd send einval to 3
clvmd (7992) req reply einval 500e3 fr 4 r 4 V_test2
clvmd send einval to 3
clvmd (7996) req reply einval 602f1 fr 3 r 3 V_test2
clvmd send einval to 4
clvmd (7930) req reply einval 70050 fr 4 r 4 V_guest
clvmd (8018) req reply einval 503dc fr 4 r 0 V_guest

So all LVM comands are hung waiting for the lock on volume group lock V_guest

Comment 53 David Teigland 2009-10-16 18:59:06 UTC
Created attachment 365083 [details]
dlm debugging patch

I'm hoping this will tell us whether the locks are waiting because:

1. there was a granted lock that has since been removed, but we somehow
missed granting the waiting locks afterward.

2. all these locks were on the waiting list not because there was a
granted lock, but because the dlm never established valid master nodeid,
and once the master became the local node, those waiting locks were never
processed.  (i.e. related to the einval nonsense.)

Comment 54 David Teigland 2009-10-20 21:07:29 UTC
I reproduced with the looping lvs and repeating some parallel vgscans.  The fast looping of the first command probably allows for the second command to overlap and race in just the right way.

The looping is enough to stress the einval/mastering problems by itself, and the second command combines this with a second area of weakness: multiple outstanding requests on one resource from the same node (gfs never does this).
The two local requests below are from 11404 and 11406.

The work around for the looping problems is conversions to/from NL, which sounds too invasive for clvmd.  But what about serializing the requests in clvmd to work around the problems handling parallel requests?  Will continue looking for the specific dlm problem, of course.

Resource 0000010016a53ca0 (parent 0000000000000000). Name (len=9) "V_cluster"
Master Copy 4295570810
Granted Queue last add 0 0 last rem 0 0
Conversion Queue
Waiting Queue
00140162 -- (PR) 11404 4295570810  LQ: 0,0x0
001702c2 -- (PR) 11406 4295570810  LQ: 0,0x0
000c004e -- (PR) 7850 4295570811 Remote:   1 000c03b2  LQ: 0,0x0
0017001d -- (PR) 7236 4295570854 Remote:   4 000e0144  LQ: 0,0x0
00110378 -- (PR) 8364 4295570941 Remote:   2 000e01a4  LQ: 0,0x0

clvmd (11404) req reply einval 140162 fr 2 r 2 V_cluster
clvmd (11406) req reply einval 1702c2 fr 2 r 0 V_cluster

Comment 55 Christine Caulfield 2009-10-21 12:35:53 UTC
Is it just me or does it sound a bit odd being asked to serialise access to a DLM ?!

Comment 56 Christine Caulfield 2009-10-21 13:28:17 UTC
The only thing I can think of for clvmd that might work around this is to keep a NL lock for each VG as it's accessed. that way we wouldn't get the ping-ponging of locks across nodes.

It's a grotesque thing to do and we can't let that code into RHEL5 or later ... it would have to be a separate patch, outside of CVS.

Comment 58 David Teigland 2009-10-21 20:24:22 UTC
We've narrowed down what's happening in the dlm most of the way, but not all the way yet.  We know what's happening, just not the specific lines of code that are racing.  Once we've found that we should be able to fix or work around the issue in the dlm.  We're also looking at a specific work around in clvmd in case the dlm work around isn't sufficient.

Comment 60 David Teigland 2009-10-21 23:37:55 UTC
Resource 00000100194e6e00 (parent 0000000000000000). Name (len=12) "V_VolGroup00
Master Copy 4296208138
Granted Queue last add 0 0 last rem 0 0
Conversion Queue
Waiting Queue
002b0249 -- (PR) 23450 4296208138  LQ: 0,0x0
002d00b7 -- (PR) 23453 4296208138  LQ: 0,0x0
00340170 -- (PR) 12518 4296208153 Remote:   1 001e03fd  LQ: 0,0x0
003000d4 -- (PR) 12280 4296208200 Remote:   4 001902ce  LQ: 0,0x0
00240021 -- (PR) 15551 4296208265 Remote:   2 00270221  LQ: 0,0x0

clvmd (23450) 4296208138 req reply einval 2b0249 fr 2 r 2 V_Vol
clvmd (23453) 4296208138 req reply einval 2d00b7 fr 2 r 0 V_Vol

All at 4296208138: both local locks get einval replies, get put on waiting queue and resource nodeid gets set to 0.  Now adding prints through the code around the einval replies to get a trace of exactly what path we're going through.


Resource 000001001bfa8a20 (parent 0000000000000000). Name (len=9) "V_cluster"  
Master Copy 4295468866
Granted Queue last add 0 0 last rem 0 0
Conversion Queue
Waiting Queue
0003005a -- (PR) 5019 4295468866  LQ: 0,0x0
00030240 -- (PR) 5113 4295468895 Remote:   1 000402fe  LQ: 0,0x0
000103f2 -- (PR) 5022 4295468906  LQ: 0,0x0
00040003 -- (PR) 4531 4295468941 Remote:   4 000301bf  LQ: 0,0x0
0005029e -- (PR) 4716 4295468994 Remote:   2 000103fa  LQ: 0,0x0

clvmd (5020) 4295468810 rrep einval 203d9 4 fr 4 r 4 V_cluster
clvmd rr 203d9 b
clvmd rr 203d9 m 0
clvmd (5019) 4295468849 rrep einval 30086 2 fr 2 r 2 V_VolGroup
clvmd rr 30086 b
clvmd rr 30086 m 0
clvmd (5019) 4295468866 rrep einval 3005a 2 fr 2 r 2 V_cluster
clvmd rr 3005a b
clvmd rr 3005a m 0
clvmd send einval to 4
clvmd (5022) 4295468906 rrep einval 103f2 2 fr 2 r 0 V_cluster
clvmd rr 103f2 c

Comment 61 David Teigland 2009-10-21 23:41:38 UTC
Created attachment 365621 [details]
debugging patch

patch with debug printks for prev comment
(and a bunch of other locking junk that I tried but haven't removed)

Comment 62 David Teigland 2009-10-22 22:25:28 UTC
I believe I've identified the dlm problem and have a patch that fixes the test above, but the patch will require a lot of testing because it's at the heart of all lock operations and could have unknown negative effects.  We may want to consider an option to enable/disable the new code, or limit it to the clvmd lockspace.

The problem is in the _can_be_granted() logic regarding the waitqueue.
Normally, a lock request cannot be granted if there's another lock already on the waitqueue.  This gives FIFO fairness.  One of the big problems in the RHEL4 dlm is that the waitqueue is misused, and locks are kept there while waiting for on a remote lock request.  This can mess up the normal lock granting logic.

Here is my current understanding of specifically what's going wrong:

Requests for lkb1 and lkb2 are both sent to another node we think is master.
Both are kept on the rsb waitqueue while waiting for a reply.
The remote node ends up not being the master and returns EINVAL for lkb1.

Local processing of lkb1 after getting the EINVAL reply:
- lkb1 is removed from the waitqueue
- we look up the master again (resdir is local for this rsb)
- the master ends up being us
- rsb nodeid set to 0
- lkb1 is passed into dlm_lock_stage2 and dlm_lock_stage3
- lkb1 is passed into _can_be_granted
- _can_be_granted sees lkb2 on the waitqueue so says lkb1 cannot be granted
- so lkb1 is added to the waitqueue (in the proper sense, i.e. not because it's waiting for a remote reply but because of the master granting logic)

Local processing of lkb2 after getting its EINVAL reply:
- lkb2 removed from the waitqueue
- we see we are now the master, rsb nodid is 0
- lkb2 is passed into dlm_lock_stage2 and dlm_lock_stage3
- lkb2 is passed into _can_be_granted
- _can_be_granted sees lkb1 on the waitqueue so says lkb2 cannot be granted
- so lkb2 is added to the waitqueue (again in the proper sense)

Other lock requests then arrive for this rsb, and all continue to be added to the waitqueue because it's not empty.

The patch I've tried and that seems to fix this one specific test, is to make _can_be_granted return TRUE for locks, like lkb1, that are being requested/tested by the grant logic for the first time since being requested, i.e. they are not already on the waitqueue.


@@ -328,6 +333,33 @@ static int _can_be_granted(struct dlm_rsb *r, struct dlm_lk
            list_empty(&r->res_waitqueue))
                return TRUE;
 
+       if (now && !conv && list_empty(&r->res_convertqueue) &&
+           list_empty(&r->res_grantqueue))
+               return TRUE;

Comment 64 David Teigland 2009-10-23 21:41:13 UTC
Created attachment 365897 [details]
patch that doesn't work yet...

This patch applies the new _can_be_granted() code to all user lockspaces.
In case it actually breaks some userspace app, it can be disabled with,
echo 0 > /proc/cluster/config/dlm/user_grant_now

(patch actually oopses in dlm_write and I don't see any obvious reason; will have to fix that up next week)

Comment 65 Debbie Johnson 2009-10-27 14:46:23 UTC
Dave and Chrisstie,

Any update on this one?

Debbie

Comment 69 David Teigland 2009-10-29 17:37:13 UTC
Created attachment 366668 [details]
patch that works

This patch works and seems to fix the problem with the following test case based on what chrissie found to reproduce the bug.

4 nodes all running while [ true ]; do lvs; done > /dev/null &
and then also running vgscan every 1 to 6 seconds (with random
delay between each iteration).

Without the patch this test ran for around 6 iterations before hitting
the bug.

Comment 70 David Teigland 2009-10-29 17:41:05 UTC
patch from comment 69 pushed to cluster.git RHEL4 branch

http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commit;h=18411e7ca35ad8b970bb57192c89ef2b067c8432

We will want to pay close attention for regressions in the next QE cycle that includes this patch.

Comment 74 Carlos Maiolino 2009-11-26 13:57:54 UTC
Hi Dave,

I'm experiencing this problem with a RHEL 5.4 cluster.

Do you know if this patch can solve this same problem on RHEL 5.4 too ?

Thanks

Comment 75 David Teigland 2009-11-30 15:26:25 UTC
The code and bug in question only exist in RHEL4.  Similar symptoms on RHEL5 would be due to something else.

Comment 79 Jaroslav Kortus 2010-01-06 17:10:19 UTC
I'd say the fix is not complete. 

with /proc/cluster/config/dlm/user_grant_now == 0 and running "hangscript" described in the bug (lvs in while loop + vgscan) it freezes as expected. 

with /proc/cluster/config/dlm/user_grant_now set to 1 I'm also able to reproduce the bug (much easier with 3 script instances on each of 3 nodes). 

Error printed to terminal: 
Error reading data from clvmd: Connection reset by peer 
Can't lock vedder: skipping 
Error writing data to clvmd: Broken pipe 
Reading all physical volumes. This may take a while... 
After this there is high probability (70%+) of cluster-wide freeze. 
Tested on: Linux a1 2.6.9-89.0.16.EL #1 SMP Tue Oct 27 03:38:53 EDT 2009 ia64 ia64 ia64 GNU/Linux 
dlm-kernel-2.6.9-58.6.el4_8.9

After keeping it running for a while in "frozen" state one of the nodes (a1) got kernel panic and was fenced, rebooted and did not join the cluster (correct configured expected behaviour). So there were 2 nodes left and the operation continued for another ~25 minutes. Then a2 got kernel panic (the same problem as a1).

Relevant logging (a1 console log):
reply einval 50c03d3 fr 3 r -1 V_vedder
clvmd (2868) req reply einval 52b017d fr 3 r 3 V_VolGroup00
clvmd (2869) req reply einval 4eb0326 fr 3 r 0 V_VolGroup00
clvmd (2875) req reply einval 51301b6 fr 3 r 3 V_VolGroup00
clvmd (2882) req reply einval 4e203b3 fr 2 r 2 V_VolGroup00
clvmd (2878) req reply einval 4c00214 fr 3 r 0 V_VolGroup00
clvmd send einval to 3
clvmd send einval to 3
clvmd (2907) req reply einval 50301c2 fr 3 r 3 V_VolGroup00
clvmd send einval to 2
clvmd (2951) req reply einval 52f0245 fr 3 r 3 V_VolGroup00
clvmd (3011) req reply einval 4bb0183 fr 2 r 2 V_vedder
clvmd (3097) req reply einval 5500319 fr 3 r 3 V_VolGroup00
clvmd send einval to 2
clvmd (3182) req reply einval 4fb00db fr 2 r 2 V_VolGroup00
clvmd (3206) req reply einval 52f01e9 fr 2 r 2 V_VolGroup00
clvmd (3207) req reply einval 4ec013f fr 2 r 0 V_VolGroup00
clvmd (3214) req reply einval 4fe03bf fr 3 r 3 V_VolGroup00
clvmd (3215) req reply einval 4e200f8 fr 3 r 0 V_VolGroup00
clvmd (3234) req reply einval 53703c4 fr 3 r 3 V_vedder

DLM:  Assertion failed on line 659 of file /builddir/build/BUILD/dlm-kernel-2.6.9-58/up/src/locking.c
DLM:  assertion:  "rsb->res_nodeid != -1"
DLM:  time = 4298100099
dlm: lkb
id 4d503b0
remid 0
flags 0
status 0
rqmode 3
grmode -1
nodeid 3
lqstate 0
lqflags 0
dlm: rsb
name "V_vedder"
nodeid 0
flags 2
ref 1

kernel BUG at /builddir/build/BUILD/dlm-kernel-2.6.9-58/up/src/locking.c:659!
clvmd[3235]: bugcheck! 0 [1]
Modules linked in: lock_dlm(U) dm_cmirror(U) gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat button ohci_hcd ehci_hcd e1000 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod mptscsih mptsas lpfc scsi_transport_fc mptspi mptscsi mptbase sd_mod scsi_mod

Pid: 3235, CPU 1, comm:                clvmd
psr : 0000101008126010 ifs : 800000000000038c ip  : [<a000000200812da0>]    Not tainted
ip is at dlm_lock_stage2+0x1a0/0x1c0 [dlm]
unat: 0000000000000000 pfs : 000000000000038c rsc : 0000000000000003
rnat: a000000100804cec bsps: a00000010068f838 pr  : 00209a0005a99955
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000200812da0 b6  : a00000010026c7a0 b7  : a00000010026c7a0
f6  : 1003e00000000000000a0 f7  : 1003e0000000000000001
f8  : 1003e00000000000000a0 f9  : 10002a000000000000000
f10 : 0fffe9999999996900000 f11 : 1003e0000000000000000
r1  : a0000001009f13f0 r2  : 0000000000005094 r3  : a0000001007f1a70
r8  : 000000000000004e r9  : a000000100781bc0 r10 : a000000100781bb8
r11 : a0000001007f1648 r12 : e000000104277db0 r13 : e000000104270000
r14 : 0000000000004000 r15 : a00000010079a8c8 r16 : 0000000000000001
r17 : a0000001008058b8 r18 : 00000000000000ff r19 : e0000000010174e0
r20 : a000000100a72eb8 r21 : a000000100a6e610 r22 : 0000000000000000
r23 : a000000100a6e610 r24 : a000000100a6e610 r25 : 0000000000000000
r26 : a0000001008058c0 r27 : 0000001008126010 r28 : e000000003450504
r29 : 0000000000000000 r30 : 0000000000000000 r31 : a000000100804cc8

Call Trace:
 [<a000000100016e40>] show_stack+0x80/0xa0
                                sp=e000000104277920 bsp=e0000001042713d0
 [<a000000100017750>] show_regs+0x890/0x8c0
                                sp=e000000104277af0 bsp=e000000104271388
 [<a00000010003ecf0>] die+0x150/0x240
                                sp=e000000104277b10 bsp=e000000104271348
 [<a00000010003ee20>] die_if_kernel+0x40/0x60
                                sp=e000000104277b10 bsp=e000000104271318
 [<a00000010003efc0>] ia64_bad_break+0x180/0x600
                                sp=e000000104277b10 bsp=e0000001042712f0
 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260
                                sp=e000000104277be0 bsp=e0000001042712f0
 [<a000000200812da0>] dlm_lock_stage2+0x1a0/0x1c0 [dlm]
                                sp=e000000104277db0 bsp=e000000104271290
 [<a0000002008131c0>] dlm_lock_stage1+0x400/0x6e0 [dlm]
                                sp=e000000104277db0 bsp=e000000104271240
 [<a000000200814310>] dlm_lock+0x5d0/0x1520 [dlm]
                                sp=e000000104277dc0 bsp=e0000001042711a8
 [<a00000020080c0a0>] do_user_lock+0x3a0/0x880 [dlm]
                                sp=e000000104277dc0 bsp=e000000104271150
 [<a00000020080d650>] dlm_write+0xa10/0x1260 [dlm]
                                sp=e000000104277e00 bsp=e0000001042710e0
 [<a00000010012a7f0>] vfs_write+0x290/0x360
                                sp=e000000104277e20 bsp=e000000104271090
 [<a00000010012aa10>] sys_write+0x70/0xe0
                                sp=e000000104277e20 bsp=e000000104271018
 [<a00000010000f4a0>] ia64_ret_from_syscall+0x0/0x20


[root@a1 ~]# rpm -qva | grep cman
cman-kernel-debuginfo-2.6.9-56.7.el4_8.9
cman-1.0.27-1.el4
cman-kernel-2.6.9-56.7.el4_8.9
cman-debuginfo-1.0.27-1.el4
cman-kernheaders-2.6.9-56.7.el4_8.9
cman-devel-1.0.27-1.el4
cman-kernel-largesmp-2.6.9-56.7.el4_8.9
[root@a1 ~]# rpm -qva | grep dlm
dlm-kernel-largesmp-2.6.9-58.6.el4_8.9
dlm-kernel-2.6.9-58.6.el4_8.9
dlm-devel-1.0.7-1
dlm-kernel-debuginfo-2.6.9-58.6.el4_8.9
dlm-1.0.7-1
dlm-debuginfo-1.0.3-1
dlm-kernheaders-2.6.9-58.6.el4_8.9
[root@a1 ~]# uname -a
Linux a1 2.6.9-89.0.16.EL #1 SMP Tue Oct 27 03:38:53 EDT 2009 ia64 ia64 ia64 GNU/Linux

Comment 80 David Teigland 2010-01-06 22:54:31 UTC
I believe the assertion failure is a different problem than the bug being fixed by this patch.  What I need to find out is whether you managed to get locks on the waiting queue with no locks on the granted queue, like is shown in comment 54.

Comment 81 Jaroslav Kortus 2010-01-29 15:52:40 UTC
Dave, you are right.

The locks and queues are solved by the patch, no more queues with waiting and not granted locks.

We have traced the problem down to clvmd and it's being solved as bug 559999.