Bug 483685
Summary: | lvm commands hang on connect() to clvmd | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Eduardo Damato <edamato> | ||||||||||||||
Component: | dlm-kernel | Assignee: | David Teigland <teigland> | ||||||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||
Priority: | urgent | ||||||||||||||||
Version: | 4 | CC: | agk, bmr, casmith, ccaulfie, cfeist, cluster-maint, cmaiolin, cmarthal, curtis, daniel.oberdick, dejohnso, djuran, dwysocha, edamato, grimme, jbrassow, jfenal, jkortus, jwilleford, mbroz, pdemauro, prockai, sbradley, schlegel, tao, teigland | ||||||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2010-03-17 02:59:26 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 546022, 564454 | ||||||||||||||||
Attachments: |
|
Description
Eduardo Damato
2009-02-03 00:01:20 UTC
As expected vgscan blocks on the unfinished connect() syscall: # vgscan -vvvvvv ... 0.000072 write(2, "#config/config.c:950 ", 23#config/config.c:950 ) = 23 <0.000014> 0.000069 write(2, "Setting global/locking_type to 3", 32Setting global/locking_type to 3) = 32 <0.000014> 0.000072 write(2, "\n", 1) = 1 <0.000013> 0.000058 write(2, "#locking/locking.c:245 ", 25#locking/locking.c:245 ) = 25 <0.000014> 0.000071 write(2, "Cluster locking selected.", 25Cluster locking selected.) = 25 <0.000014> 0.000072 write(2, "\n", 1) = 1 <0.000014> 0.000066 socket(PF_FILE, SOCK_STREAM, 0) = 4 <0.000018> 0.000069 connect(4, {sa_family=AF_FILE, path=@clvmd}, 110 <unfinished ...> On the clvmd side, we do not see any process look after the accept() call # strace -f -T -r -p <pid of clvmd> Process 1197 attached with 11 threads - interrupt to quit [pid 29675] 0.000000 futex(0xedffabe8, FUTEX_WAIT, 1197, NULL <unfinished ...> [pid 29706] 0.000077 read(8, <unfinished ...> [pid 29707] 0.000052 futex(0x80b13c4, FUTEX_WAIT, 3402281, NULL <unfinished ...> [pid 2407] 0.000049 futex(0xf4dff1b4, FUTEX_WAIT, 1, NULL <unfinished ...> [pid 2413] 0.000050 futex(0xf07fe1b4, FUTEX_WAIT, 1, NULL <unfinished ...> [pid 4560] 0.000050 futex(0xf2fff1b4, FUTEX_WAIT, 1, NULL <unfinished ...> [pid 20696] 0.000046 futex(0xf11ff1b4, FUTEX_WAIT, 1, NULL <unfinished ...> [pid 17842] 0.000051 futex(0xefdfd1b4, FUTEX_WAIT, 1, NULL <unfinished ...> [pid 19984] 0.000049 futex(0xee9fb1b4, FUTEX_WAIT, 1, NULL <unfinished ...> [pid 25536] 0.000048 futex(0xef3fc1b4, FUTEX_WAIT, 1, NULL <unfinished ...> [pid 1197] 0.000049 futex(0xedffa1b4, FUTEX_WAIT, 1, NULL <unfinished ...> Process 29675 detached Process 29706 detached Process 29707 detached Process 2407 detached Process 2413 detached Process 4560 detached Process 20696 detached Process 17842 detached Process 19984 detached Process 25536 detached Process 1197 detached FYI: this gdb trace is not from the same occurence of the strace (gdb) thread apply all bt Thread 10 (process 23461): #0 0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00471297 in pthread_join () from /lib/tls/libpthread.so.0 #2 0x080546ab in local_sock_callback (thisfd=0xf459f100, buf=0xfee743f0 "\021", len=1500, csid=0xfee74a60 "8\004", new_client=0xfee74a44) at clvmd.c:921 #3 0x08055f07 in main (argc=1, argv=0xfee74f04) at clvmd.c:692 Thread 9 (process 14995): #0 0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x0805b4b4 in _sync_lock (resource=0xf12fcee3 "V_vg_log", mode=3, flags=0, lockid=0xea0fa2c8) at clvmd-cman.c:437 #3 0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987 #4 0x08052fbb in do_pre_command (client=0xf459f100) at clvmd-command.c:224 #5 0x080537cd in pre_and_post_thread (arg=0xf459f100) at clvmd.c:1418 #6 0x004703cc in start_thread () from /lib/tls/libpthread.so.0 #7 0x003061ae in mremap () from /lib/tls/libc.so.6 Thread 8 (process 812): #0 0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x0805b4b4 in _sync_lock (resource=0xf12fcfe3 "V_vg_log", mode=3, flags=0, lockid=0xeaafb2c8) at clvmd-cman.c:437 #3 0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987 #4 0x08052fbb in do_pre_command (client=0xf459efb8) at clvmd-command.c:224 #5 0x080537cd in pre_and_post_thread (arg=0xf459efb8) at clvmd.c:1418 #6 0x004703cc in start_thread () from /lib/tls/libpthread.so.0 #7 0x003061ae in mremap () from /lib/tls/libc.so.6 Thread 7 (process 8337): #0 0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x0805b4b4 in _sync_lock (resource=0xf12fcfbb "V_vg_log", mode=3, flags=0, lockid=0xeb4fc2c8) at clvmd-cman.c:437 #3 0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987 #4 0x08052fbb in do_pre_command (client=0xf45b0760) at clvmd-command.c:224 #5 0x080537cd in pre_and_post_thread (arg=0xf45b0760) at clvmd.c:1418 #6 0x004703cc in start_thread () from /lib/tls/libpthread.so.0 #7 0x003061ae in mremap () from /lib/tls/libc.so.6 Thread 6 (process 15257): #0 0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x0805b4b4 in _sync_lock (resource=0xf12fcce3 "V_vg_log", mode=3, flags=0, lockid=0xebefd2c8) at clvmd-cman.c:437 #3 0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987 #4 0x08052fbb in do_pre_command (client=0xf45b05f0) at clvmd-command.c:224 #5 0x080537cd in pre_and_post_thread (arg=0xf45b05f0) at clvmd.c:1418 #6 0x004703cc in start_thread () from /lib/tls/libpthread.so.0 #7 0x003061ae in mremap () from /lib/tls/libc.so.6 Thread 5 (process 18581): #0 0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x0805b4b4 in _sync_lock (resource=0xf12fce2b "V_vg_log", mode=3, flags=0, lockid=0xec8fe2c8) at clvmd-cman.c:437 #3 0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987 #4 0x08052fbb in do_pre_command (client=0xf45c4678) at clvmd-command.c:224 #5 0x080537cd in pre_and_post_thread (arg=0xf45c4678) at clvmd.c:1418 #6 0x004703cc in start_thread () from /lib/tls/libpthread.so.0 #7 0x003061ae in mremap () from /lib/tls/libc.so.6 Thread 4 (process 2716): #0 0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x0805b4b4 in _sync_lock (resource=0xf45c4663 "V_vg_log", mode=3, flags=0, lockid=0xefaff2c8) at clvmd-cman.c:437 #3 0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987 #4 0x08052fbb in do_pre_command (client=0xf45c4598) at clvmd-command.c:224 #5 0x080537cd in pre_and_post_thread (arg=0xf45c4598) at clvmd.c:1418 #6 0x004703cc in start_thread () from /lib/tls/libpthread.so.0 #7 0x003061ae in mremap () from /lib/tls/libc.so.6 Thread 3 (process 2703): #0 0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x0805b4b4 in _sync_lock (resource=0xf12fcf93 "V_vg_log", mode=3, flags=0, lockid=0xed2ff2c8) at clvmd-cman.c:437 #3 0x08056750 in sync_lock (resource=0xfffffffc <Address 0xfffffffc out of bounds>, mode=-4, flags=-4, lockid=0xfffffffc) at clvmd.c:1987 #4 0x08052fbb in do_pre_command (client=0xf12fcd60) at clvmd-command.c:224 #5 0x080537cd in pre_and_post_thread (arg=0xf12fcd60) at clvmd.c:1418 #6 0x004703cc in start_thread () from /lib/tls/libpthread.so.0 #7 0x003061ae in mremap () from /lib/tls/libc.so.6 Thread 2 (process 23463): #0 0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00472cf6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x08054edc in lvm_thread_fn (arg=0x0) at clvmd.c:1777 #3 0x004703cc in start_thread () from /lib/tls/libpthread.so.0 #4 0x003061ae in mremap () from /lib/tls/libc.so.6 Thread 1 (process 23462): #0 0x006907a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x0047569b in __read_nocancel () from /lib/tls/libpthread.so.0 #2 0x0809d722 in do_dlm_dispatch () #3 0x0809e037 in dlm_recv_thread () #4 0x004703cc in start_thread () from /lib/tls/libpthread.so.0 #5 0x003061ae in mremap () from /lib/tls/libc.so.6 (gdb) (gdb) What seems to be happening is that an LVM command quits while its clvmd thread is waiting for a lock. This means that the main loop gets stuck in phthread_join trying to tidy up and cannot service any more connect requests. That bring up two more questions though: 1. Why does the lock never seem to get granted (assuming that clvmd waits forever which I'm inferring here). This is a VG lock so it's a pretty seriously large lock for it to never be released! If we can solve this one I expect the rest will go away. 2. Why does lvm allow the thread to exit while waiting for a lock when all signals are blocked? The only way I can see this happening is via "kill -9". Is this happening? So I suppose the next thing to do is to get a DLM lock dump of the clvmd lockspace. Then try to ascertain what the process holding the lock is doing. By "process holding the lock" I don't really mean clvmd ... I know that's the actual PID holding the lock. I mean the lvm command that is supposedly doing something useful under protection of that lock. Thanks, the lockdump on its own isn't information as such though .. it's just a way of locating the node that has the hung command. From there we need to know which command is hanging and why. Most likely it'll be the longest running LVM command on the system ... unless it's been killed with -9 too :S Do we have any more information about this ? I can't reproduce it at all. What commands are they running on these LVs? Ideally I need to know the first command that seems to hang - as that will be the one that's holding the VG lock. THEN we need to know why that particular command is hanging. But without knowing what it is it's almost impossible. Created attachment 333895 [details]
locks for node psu250 27th Feb 09
Created attachment 333896 [details]
locks for node psu252 27th Feb 09
If the lockspace is stuck in recovery, that could possibly explain the locks on the waiting queue. Otherwise, I don't have any great suggestions, I'd be suspicious of the kill -9 causing it somehow. The way that dlm_close() clears locks after kill may lead to some problems, especially if it tries to cancel. So, collecting info before killing clvmd seems like an important step to avoid mixing the effects of the original problem with the possible effects of the kill. hi, it seems to be that we are facing the same issue. we are running rhel 5.2 / 5.3 and sometimes our cluster lvm get stuck and we are not able to modify any lv. i opened a sr 1883878 but the workaround will always cause downtime, because we have to stop all applications before killing clvmd and remove the lvm cache. so i'm really interested in what is causing the problem and how we can avoid/ fix it. greetings daniel I am also having this same issue on Redhat 5.3. I sent this email to the linux cluster list. We do not have an entitlement to lvm2-cluster, so I am using the one from centos, but everything else is from RHEL 5. Thanks, Curtis. ===== Hi, I've been experiencing hangs with clvmd (and all associated commands: lvs, vgdsplay, ect) on a 3 node cluster. I am using cman and clvmd to create clustered lvs from a san backend. I recently upgraded from 5.2 to 5.3 because occasionally commands began to hang, and would not work until I rebooted either 1 or 2 of the nodes, and there seemed to be several fixes in 5.3. These are the versions of cman and lvm2-cluster I am using. I'm using the lvm2-cluster from centos, rather than redhat. [r...@cluster6 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.3 (Tikanga) [r...@cluster6 ~]# rpm -qa cman cman-2.0.98-1.el5_3.4 [r...@cluster6 ~]# rpm -qa lvm2-cluster lvm2-cluster-2.02.40-7.el5 <- from centos packages and this is what nodes there are: [r...@cluster6 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 2336 2009-08-04 14:39:42 cluster7 2 M 2336 2009-08-04 14:39:42 cluster5 3 M 2332 2009-08-04 14:39:19 cluster6 [r...@cluster6 ~]# cman_tool services type level name id state fence 0 default 00010003 none [1 2 3] dlm 1 clvmd 00020003 none [1 2 3] and this is what's hanging after "service clvmd start" 7056 pts/0 S+ 0:00 /usr/sbin/vgscan it just sits there forever, and I can't run any "lvm" type commands. If I kill it the clvmd init script moves onto vgchange -ayl which also hangs. Any thoughts? Is there some kind of human error misconfiguration that could be going on here? Thanks, Curtis. I've manage to reproduce something that looks very similar to this on RHEL4, and it points to a DLM problem. I started a 4 node cluster with 5 PVs containing 2 VGs. on all nodes run # while [ true ]; do lvs; done > /dev/null & then repeatedly run # vgscan on all nodes until it hangs on one or more nodes. Looking at /proc/cluster/dlm_locks for clvmd I see this: Resource 000001003add6de8 (parent 0000000000000000). Name (len=7) "V_guest" Master Copy Granted Queue Conversion Queue Waiting Queue 00070050 -- (PR) 7930 LQ: 0,0x0 00020392 -- (PR) 7704 Remote: 3 00040302 LQ: 0,0x0 000503dc -- (PR) 8018 LQ: 0,0x0 0003016c -- (PR) 7908 Remote: 4 000502f1 LQ: 0,0x0 0004027a -- (PR) 7288 Remote: 1 0003027e LQ: 0,0x0 ie .. all locks are waiting and none are granted. /proc/cluster/dlm_debug has this (on the master node): clvmd send einval to 3 clvmd (7860) req reply einval 7001f fr 4 r 4 V_VolGroup00 clvmd send einval to 4 clvmd send einval to 3 clvmd (7888) req reply einval 70275 fr 3 r 3 V_guest clvmd (7900) req reply einval 500cf fr 3 r 3 V_guest clvmd send einval to 3 clvmd (7902) req reply einval 303ac fr 3 r 3 V_guest clvmd send einval to 4 clvmd send einval to 3 clvmd (7912) req reply einval 602a3 fr 3 r 3 V_test2 clvmd send einval to 3 clvmd send einval to 1 clvmd (7922) req reply einval 600dc fr 4 r 4 V_guest clvmd (7924) req reply einval 40084 fr 1 r 1 V_VolGroup00 clvmd send einval to 3 clvmd send einval to 3 clvmd send einval to 4 clvmd (7954) req reply einval 600ca fr 4 r 4 V_guest clvmd send einval to 3 clvmd send einval to 4 clvmd send einval to 3 clvmd (7992) req reply einval 500e3 fr 4 r 4 V_test2 clvmd send einval to 3 clvmd (7996) req reply einval 602f1 fr 3 r 3 V_test2 clvmd send einval to 4 clvmd (7930) req reply einval 70050 fr 4 r 4 V_guest clvmd (8018) req reply einval 503dc fr 4 r 0 V_guest So all LVM comands are hung waiting for the lock on volume group lock V_guest Created attachment 365083 [details]
dlm debugging patch
I'm hoping this will tell us whether the locks are waiting because:
1. there was a granted lock that has since been removed, but we somehow
missed granting the waiting locks afterward.
2. all these locks were on the waiting list not because there was a
granted lock, but because the dlm never established valid master nodeid,
and once the master became the local node, those waiting locks were never
processed. (i.e. related to the einval nonsense.)
I reproduced with the looping lvs and repeating some parallel vgscans. The fast looping of the first command probably allows for the second command to overlap and race in just the right way. The looping is enough to stress the einval/mastering problems by itself, and the second command combines this with a second area of weakness: multiple outstanding requests on one resource from the same node (gfs never does this). The two local requests below are from 11404 and 11406. The work around for the looping problems is conversions to/from NL, which sounds too invasive for clvmd. But what about serializing the requests in clvmd to work around the problems handling parallel requests? Will continue looking for the specific dlm problem, of course. Resource 0000010016a53ca0 (parent 0000000000000000). Name (len=9) "V_cluster" Master Copy 4295570810 Granted Queue last add 0 0 last rem 0 0 Conversion Queue Waiting Queue 00140162 -- (PR) 11404 4295570810 LQ: 0,0x0 001702c2 -- (PR) 11406 4295570810 LQ: 0,0x0 000c004e -- (PR) 7850 4295570811 Remote: 1 000c03b2 LQ: 0,0x0 0017001d -- (PR) 7236 4295570854 Remote: 4 000e0144 LQ: 0,0x0 00110378 -- (PR) 8364 4295570941 Remote: 2 000e01a4 LQ: 0,0x0 clvmd (11404) req reply einval 140162 fr 2 r 2 V_cluster clvmd (11406) req reply einval 1702c2 fr 2 r 0 V_cluster Is it just me or does it sound a bit odd being asked to serialise access to a DLM ?! The only thing I can think of for clvmd that might work around this is to keep a NL lock for each VG as it's accessed. that way we wouldn't get the ping-ponging of locks across nodes. It's a grotesque thing to do and we can't let that code into RHEL5 or later ... it would have to be a separate patch, outside of CVS. We've narrowed down what's happening in the dlm most of the way, but not all the way yet. We know what's happening, just not the specific lines of code that are racing. Once we've found that we should be able to fix or work around the issue in the dlm. We're also looking at a specific work around in clvmd in case the dlm work around isn't sufficient. Resource 00000100194e6e00 (parent 0000000000000000). Name (len=12) "V_VolGroup00 Master Copy 4296208138 Granted Queue last add 0 0 last rem 0 0 Conversion Queue Waiting Queue 002b0249 -- (PR) 23450 4296208138 LQ: 0,0x0 002d00b7 -- (PR) 23453 4296208138 LQ: 0,0x0 00340170 -- (PR) 12518 4296208153 Remote: 1 001e03fd LQ: 0,0x0 003000d4 -- (PR) 12280 4296208200 Remote: 4 001902ce LQ: 0,0x0 00240021 -- (PR) 15551 4296208265 Remote: 2 00270221 LQ: 0,0x0 clvmd (23450) 4296208138 req reply einval 2b0249 fr 2 r 2 V_Vol clvmd (23453) 4296208138 req reply einval 2d00b7 fr 2 r 0 V_Vol All at 4296208138: both local locks get einval replies, get put on waiting queue and resource nodeid gets set to 0. Now adding prints through the code around the einval replies to get a trace of exactly what path we're going through. Resource 000001001bfa8a20 (parent 0000000000000000). Name (len=9) "V_cluster" Master Copy 4295468866 Granted Queue last add 0 0 last rem 0 0 Conversion Queue Waiting Queue 0003005a -- (PR) 5019 4295468866 LQ: 0,0x0 00030240 -- (PR) 5113 4295468895 Remote: 1 000402fe LQ: 0,0x0 000103f2 -- (PR) 5022 4295468906 LQ: 0,0x0 00040003 -- (PR) 4531 4295468941 Remote: 4 000301bf LQ: 0,0x0 0005029e -- (PR) 4716 4295468994 Remote: 2 000103fa LQ: 0,0x0 clvmd (5020) 4295468810 rrep einval 203d9 4 fr 4 r 4 V_cluster clvmd rr 203d9 b clvmd rr 203d9 m 0 clvmd (5019) 4295468849 rrep einval 30086 2 fr 2 r 2 V_VolGroup clvmd rr 30086 b clvmd rr 30086 m 0 clvmd (5019) 4295468866 rrep einval 3005a 2 fr 2 r 2 V_cluster clvmd rr 3005a b clvmd rr 3005a m 0 clvmd send einval to 4 clvmd (5022) 4295468906 rrep einval 103f2 2 fr 2 r 0 V_cluster clvmd rr 103f2 c Created attachment 365621 [details]
debugging patch
patch with debug printks for prev comment
(and a bunch of other locking junk that I tried but haven't removed)
I believe I've identified the dlm problem and have a patch that fixes the test above, but the patch will require a lot of testing because it's at the heart of all lock operations and could have unknown negative effects. We may want to consider an option to enable/disable the new code, or limit it to the clvmd lockspace. The problem is in the _can_be_granted() logic regarding the waitqueue. Normally, a lock request cannot be granted if there's another lock already on the waitqueue. This gives FIFO fairness. One of the big problems in the RHEL4 dlm is that the waitqueue is misused, and locks are kept there while waiting for on a remote lock request. This can mess up the normal lock granting logic. Here is my current understanding of specifically what's going wrong: Requests for lkb1 and lkb2 are both sent to another node we think is master. Both are kept on the rsb waitqueue while waiting for a reply. The remote node ends up not being the master and returns EINVAL for lkb1. Local processing of lkb1 after getting the EINVAL reply: - lkb1 is removed from the waitqueue - we look up the master again (resdir is local for this rsb) - the master ends up being us - rsb nodeid set to 0 - lkb1 is passed into dlm_lock_stage2 and dlm_lock_stage3 - lkb1 is passed into _can_be_granted - _can_be_granted sees lkb2 on the waitqueue so says lkb1 cannot be granted - so lkb1 is added to the waitqueue (in the proper sense, i.e. not because it's waiting for a remote reply but because of the master granting logic) Local processing of lkb2 after getting its EINVAL reply: - lkb2 removed from the waitqueue - we see we are now the master, rsb nodid is 0 - lkb2 is passed into dlm_lock_stage2 and dlm_lock_stage3 - lkb2 is passed into _can_be_granted - _can_be_granted sees lkb1 on the waitqueue so says lkb2 cannot be granted - so lkb2 is added to the waitqueue (again in the proper sense) Other lock requests then arrive for this rsb, and all continue to be added to the waitqueue because it's not empty. The patch I've tried and that seems to fix this one specific test, is to make _can_be_granted return TRUE for locks, like lkb1, that are being requested/tested by the grant logic for the first time since being requested, i.e. they are not already on the waitqueue. @@ -328,6 +333,33 @@ static int _can_be_granted(struct dlm_rsb *r, struct dlm_lk list_empty(&r->res_waitqueue)) return TRUE; + if (now && !conv && list_empty(&r->res_convertqueue) && + list_empty(&r->res_grantqueue)) + return TRUE; Created attachment 365897 [details]
patch that doesn't work yet...
This patch applies the new _can_be_granted() code to all user lockspaces.
In case it actually breaks some userspace app, it can be disabled with,
echo 0 > /proc/cluster/config/dlm/user_grant_now
(patch actually oopses in dlm_write and I don't see any obvious reason; will have to fix that up next week)
Dave and Chrisstie, Any update on this one? Debbie Created attachment 366668 [details]
patch that works
This patch works and seems to fix the problem with the following test case based on what chrissie found to reproduce the bug.
4 nodes all running while [ true ]; do lvs; done > /dev/null &
and then also running vgscan every 1 to 6 seconds (with random
delay between each iteration).
Without the patch this test ran for around 6 iterations before hitting
the bug.
patch from comment 69 pushed to cluster.git RHEL4 branch http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commit;h=18411e7ca35ad8b970bb57192c89ef2b067c8432 We will want to pay close attention for regressions in the next QE cycle that includes this patch. Hi Dave, I'm experiencing this problem with a RHEL 5.4 cluster. Do you know if this patch can solve this same problem on RHEL 5.4 too ? Thanks The code and bug in question only exist in RHEL4. Similar symptoms on RHEL5 would be due to something else. I'd say the fix is not complete. with /proc/cluster/config/dlm/user_grant_now == 0 and running "hangscript" described in the bug (lvs in while loop + vgscan) it freezes as expected. with /proc/cluster/config/dlm/user_grant_now set to 1 I'm also able to reproduce the bug (much easier with 3 script instances on each of 3 nodes). Error printed to terminal: Error reading data from clvmd: Connection reset by peer Can't lock vedder: skipping Error writing data to clvmd: Broken pipe Reading all physical volumes. This may take a while... After this there is high probability (70%+) of cluster-wide freeze. Tested on: Linux a1 2.6.9-89.0.16.EL #1 SMP Tue Oct 27 03:38:53 EDT 2009 ia64 ia64 ia64 GNU/Linux dlm-kernel-2.6.9-58.6.el4_8.9 After keeping it running for a while in "frozen" state one of the nodes (a1) got kernel panic and was fenced, rebooted and did not join the cluster (correct configured expected behaviour). So there were 2 nodes left and the operation continued for another ~25 minutes. Then a2 got kernel panic (the same problem as a1). Relevant logging (a1 console log): reply einval 50c03d3 fr 3 r -1 V_vedder clvmd (2868) req reply einval 52b017d fr 3 r 3 V_VolGroup00 clvmd (2869) req reply einval 4eb0326 fr 3 r 0 V_VolGroup00 clvmd (2875) req reply einval 51301b6 fr 3 r 3 V_VolGroup00 clvmd (2882) req reply einval 4e203b3 fr 2 r 2 V_VolGroup00 clvmd (2878) req reply einval 4c00214 fr 3 r 0 V_VolGroup00 clvmd send einval to 3 clvmd send einval to 3 clvmd (2907) req reply einval 50301c2 fr 3 r 3 V_VolGroup00 clvmd send einval to 2 clvmd (2951) req reply einval 52f0245 fr 3 r 3 V_VolGroup00 clvmd (3011) req reply einval 4bb0183 fr 2 r 2 V_vedder clvmd (3097) req reply einval 5500319 fr 3 r 3 V_VolGroup00 clvmd send einval to 2 clvmd (3182) req reply einval 4fb00db fr 2 r 2 V_VolGroup00 clvmd (3206) req reply einval 52f01e9 fr 2 r 2 V_VolGroup00 clvmd (3207) req reply einval 4ec013f fr 2 r 0 V_VolGroup00 clvmd (3214) req reply einval 4fe03bf fr 3 r 3 V_VolGroup00 clvmd (3215) req reply einval 4e200f8 fr 3 r 0 V_VolGroup00 clvmd (3234) req reply einval 53703c4 fr 3 r 3 V_vedder DLM: Assertion failed on line 659 of file /builddir/build/BUILD/dlm-kernel-2.6.9-58/up/src/locking.c DLM: assertion: "rsb->res_nodeid != -1" DLM: time = 4298100099 dlm: lkb id 4d503b0 remid 0 flags 0 status 0 rqmode 3 grmode -1 nodeid 3 lqstate 0 lqflags 0 dlm: rsb name "V_vedder" nodeid 0 flags 2 ref 1 kernel BUG at /builddir/build/BUILD/dlm-kernel-2.6.9-58/up/src/locking.c:659! clvmd[3235]: bugcheck! 0 [1] Modules linked in: lock_dlm(U) dm_cmirror(U) gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat button ohci_hcd ehci_hcd e1000 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod mptscsih mptsas lpfc scsi_transport_fc mptspi mptscsi mptbase sd_mod scsi_mod Pid: 3235, CPU 1, comm: clvmd psr : 0000101008126010 ifs : 800000000000038c ip : [<a000000200812da0>] Not tainted ip is at dlm_lock_stage2+0x1a0/0x1c0 [dlm] unat: 0000000000000000 pfs : 000000000000038c rsc : 0000000000000003 rnat: a000000100804cec bsps: a00000010068f838 pr : 00209a0005a99955 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000200812da0 b6 : a00000010026c7a0 b7 : a00000010026c7a0 f6 : 1003e00000000000000a0 f7 : 1003e0000000000000001 f8 : 1003e00000000000000a0 f9 : 10002a000000000000000 f10 : 0fffe9999999996900000 f11 : 1003e0000000000000000 r1 : a0000001009f13f0 r2 : 0000000000005094 r3 : a0000001007f1a70 r8 : 000000000000004e r9 : a000000100781bc0 r10 : a000000100781bb8 r11 : a0000001007f1648 r12 : e000000104277db0 r13 : e000000104270000 r14 : 0000000000004000 r15 : a00000010079a8c8 r16 : 0000000000000001 r17 : a0000001008058b8 r18 : 00000000000000ff r19 : e0000000010174e0 r20 : a000000100a72eb8 r21 : a000000100a6e610 r22 : 0000000000000000 r23 : a000000100a6e610 r24 : a000000100a6e610 r25 : 0000000000000000 r26 : a0000001008058c0 r27 : 0000001008126010 r28 : e000000003450504 r29 : 0000000000000000 r30 : 0000000000000000 r31 : a000000100804cc8 Call Trace: [<a000000100016e40>] show_stack+0x80/0xa0 sp=e000000104277920 bsp=e0000001042713d0 [<a000000100017750>] show_regs+0x890/0x8c0 sp=e000000104277af0 bsp=e000000104271388 [<a00000010003ecf0>] die+0x150/0x240 sp=e000000104277b10 bsp=e000000104271348 [<a00000010003ee20>] die_if_kernel+0x40/0x60 sp=e000000104277b10 bsp=e000000104271318 [<a00000010003efc0>] ia64_bad_break+0x180/0x600 sp=e000000104277b10 bsp=e0000001042712f0 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260 sp=e000000104277be0 bsp=e0000001042712f0 [<a000000200812da0>] dlm_lock_stage2+0x1a0/0x1c0 [dlm] sp=e000000104277db0 bsp=e000000104271290 [<a0000002008131c0>] dlm_lock_stage1+0x400/0x6e0 [dlm] sp=e000000104277db0 bsp=e000000104271240 [<a000000200814310>] dlm_lock+0x5d0/0x1520 [dlm] sp=e000000104277dc0 bsp=e0000001042711a8 [<a00000020080c0a0>] do_user_lock+0x3a0/0x880 [dlm] sp=e000000104277dc0 bsp=e000000104271150 [<a00000020080d650>] dlm_write+0xa10/0x1260 [dlm] sp=e000000104277e00 bsp=e0000001042710e0 [<a00000010012a7f0>] vfs_write+0x290/0x360 sp=e000000104277e20 bsp=e000000104271090 [<a00000010012aa10>] sys_write+0x70/0xe0 sp=e000000104277e20 bsp=e000000104271018 [<a00000010000f4a0>] ia64_ret_from_syscall+0x0/0x20 [root@a1 ~]# rpm -qva | grep cman cman-kernel-debuginfo-2.6.9-56.7.el4_8.9 cman-1.0.27-1.el4 cman-kernel-2.6.9-56.7.el4_8.9 cman-debuginfo-1.0.27-1.el4 cman-kernheaders-2.6.9-56.7.el4_8.9 cman-devel-1.0.27-1.el4 cman-kernel-largesmp-2.6.9-56.7.el4_8.9 [root@a1 ~]# rpm -qva | grep dlm dlm-kernel-largesmp-2.6.9-58.6.el4_8.9 dlm-kernel-2.6.9-58.6.el4_8.9 dlm-devel-1.0.7-1 dlm-kernel-debuginfo-2.6.9-58.6.el4_8.9 dlm-1.0.7-1 dlm-debuginfo-1.0.3-1 dlm-kernheaders-2.6.9-58.6.el4_8.9 [root@a1 ~]# uname -a Linux a1 2.6.9-89.0.16.EL #1 SMP Tue Oct 27 03:38:53 EDT 2009 ia64 ia64 ia64 GNU/Linux I believe the assertion failure is a different problem than the bug being fixed by this patch. What I need to find out is whether you managed to get locks on the waiting queue with no locks on the granted queue, like is shown in comment 54. Dave, you are right. The locks and queues are solved by the patch, no more queues with waiting and not granted locks. We have traced the problem down to clvmd and it's being solved as bug 559999. |