Description of problem: Running multiple cmirror config operations results in a deadlock. [root@hayes-01 tmp]# cman_tool nodes Node Sts Inc Joined Name 1 M 920 2008-01-21 13:55:51 hayes-01 2 M 928 2008-01-21 13:55:53 hayes-02 3 M 924 2008-01-21 13:55:53 hayes-03 [root@hayes-01 tmp]# ./lvm_backtraces.pl Backtrace for lvcreate-m1-nhayes-01.3664-L500Mlock_stress (28958): #1 0x0000000000452ce5 in init_cluster_locking () #2 0x0000000000453125 in init_cluster_locking () #3 0x000000000045336c in init_cluster_locking () #4 0x000000000043ed9f in reset_locking () #5 0x000000000043f11b in lock_vol () #6 0x0000000000411a68 in lvcreate () #7 0x0000000000413444 in lvm_run_command () #8 0x0000000000415657 in lvm2_main () #9 0x0000003946a1d8a4 in __libc_start_main () from /lib64/libc.so.6 [root@hayes-02 tmp]# ./lvm_backtraces.pl Backtrace for lvslock_stress/hayes-02.3666--noheadings-ocopy_percent (29701): #1 0x0000000000452ce5 in init_cluster_locking () #2 0x0000000000453125 in init_cluster_locking () #3 0x000000000045336c in init_cluster_locking () #4 0x000000000043ed9f in reset_locking () #5 0x000000000043f11b in lock_vol () #6 0x0000000000420933 in process_each_lv () #7 0x000000000041d640 in pvscan () #8 0x0000000000413444 in lvm_run_command () #9 0x0000000000415657 in lvm2_main () #10 0x0000003473e1d8a4 in __libc_start_main () from /lib64/libc.so.6 [root@hayes-03 tmp]# ./lvm_backtraces.pl Backtrace for lvcreate-m1-nhayes-03.3662-L500Mlock_stress (30653): #1 0x0000000000452ce5 in init_cluster_locking () #2 0x0000000000453125 in init_cluster_locking () #3 0x000000000045336c in init_cluster_locking () #4 0x000000000043ed9f in reset_locking () #5 0x000000000043f11b in lock_vol () #6 0x0000000000411a68 in lvcreate () #7 0x0000000000413444 in lvm_run_command () #8 0x0000000000415657 in lvm2_main () #9 0x000000363a61d8a4 in __libc_start_main () from /lib64/libc.so.6 Backtrace for lvs-a-o+devices (30790): #1 0x0000000000452ce5 in init_cluster_locking () #2 0x0000000000453125 in init_cluster_locking () #3 0x000000000045336c in init_cluster_locking () #4 0x000000000043ed9f in reset_locking () #5 0x000000000043f11b in lock_vol () #6 0x0000000000420933 in process_each_lv () #7 0x000000000041d640 in pvscan () #8 0x0000000000413444 in lvm_run_command () #9 0x0000000000415657 in lvm2_main () #10 0x000000363a61d8a4 in __libc_start_main () from /lib64/libc.so.6 Version-Release number of selected component (if applicable): 2.6.18-62.el5 cmirror-1.1.5-4.el5 kmod-cmirror-0.1.2-1.el5.scratch.1
Created attachment 292527 [details] backtraces from hayes-01
Created attachment 292528 [details] backtraces from hayes-02
Created attachment 292529 [details] backtraces from hayes-03
What are "cmirror config operations"? converting/creating/removing? I haven't looked at the attachments yet, but it doesn't look like it's hit cmirror code yet... LVM/device-mapper has changed quite a bit in this release - it could be there too. Something that convinces you that it's cmirror?
need -vvvv for all three commands to check for new userspace deadlocks - or lock manager state showing which locks are held where in the cluster
The operations, run from multiple nodes in the cluster to mulitple mirrors, are as follows: create, down convert, up convert, core log convert, disk log convert, deactivate, reactivate, deactivate, delete.
This deadlock can be triggered by just having many mirrors to activate during clvmd start up.
many == ?
There were 28 mirrors in that case, however I spoke to soon, it didn't deadlock, it everntually timed out and then the stream of 'dm-log-clustered:' errors followed, so comment #8 is most likely one of the other open BZs
corey, is this still reproducible?
I'm unable to reproducing this, so this may be fixed. However I do eventually hit another bz so verifying this may be blocked behind 429599.
I think 429599 has been cleared... ok to have another go at this?
I appear to have hit this last night, again while running cirror_lock_stress. ./cmirror_lock_stress -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.2 -R ../../var/share/resource_files/hayes.xml -m 2 I saw a lot of the following messages on the console: "dm-log-clustered: Stray request returned:" "dm-log-clustered: Excessive delay in request processing, 1 sec" There are no mirror or mirror components currently listed with dmsetup I'll attach a dump of the stuck processes... [root@hayes-01 ~]# rpm -qi lvm2 Name : lvm2 Relocations: (not relocatable) Version : 2.02.32 Vendor: Red Hat, Inc. Release : 4.el5 Build Date: Fri 04 Apr 2008 06:15:19 AM CDT Install Date: Mon 05 May 2008 11:26:24 AM CDT Build Host: hs20-bc2-3.build.redhat.com [root@hayes-01 ~]# rpm -qi lvm2-cluster Name : lvm2-cluster Relocations: (not relocatable) Version : 2.02.32 Vendor: Red Hat, Inc. Release : 4.el5 Build Date: Wed 02 Apr 2008 03:56:50 AM CDT Install Date: Mon 05 May 2008 11:27:46 AM CDT Build Host: hs20-bc2-3.build.redhat.com [root@hayes-01 ~]# rpm -qi cmirror Name : cmirror Relocations: (not relocatable) Version : 1.1.17 Vendor: Red Hat, Inc. Release : 1.el5 Build Date: Fri 09 May 2008 11:33:43 AM CDT Install Date: Fri 09 May 2008 03:57:23 PM CDT Build Host: hs20-bc1-7.build.redhat.com [root@hayes-01 ~]# rpm -qi kmod-cmirror Name : kmod-cmirror Relocations: (not relocatable) Version : 0.1.9 Vendor: Red Hat, Inc. Release : 1.el5 Build Date: Thu 08 May 2008 02:28:27 PM CDT Install Date: Fri 09 May 2008 09:17:33 AM CDT Build Host: hs20-bc2-4.build.redhat.com [root@hayes-01 ~]# rpm -qi openais Name : openais Relocations: (not relocatable) Version : 0.80.3 Vendor: Red Hat, Inc. Release : 15.el5 Build Date: Wed 02 Apr 2008 03:42:29 AM CDT Install Date: Mon 05 May 2008 11:26:27 AM CDT Build Host: ls20-bc2-13.build.redhat.com
Created attachment 305252 [details] log from hayes-01
Created attachment 305253 [details] log from hayes-02
Created attachment 305254 [details] log from hayes-03
This bz has not been seen in 5 months, or is possibly the same as bz 460222. Will reopen if needed.