Bug 362691
Summary: | deadlock during core log mirror creation | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | cmirror-kernel | Assignee: | Jonathan Earl Brassow <jbrassow> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 4 | CC: | agk, ccaulfie, cfeist, mbroz |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-07-24 16:30:38 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 217895, 364081 | ||
Bug Blocks: |
Description
Corey Marthaler
2007-11-01 21:39:25 UTC
[root@link-07 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 6 M grant-02 2 1 6 M link-07 3 1 6 M link-02 4 1 6 M grant-01 5 1 6 M link-08 6 1 6 M grant-03 [root@link-07 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 3 2 run - [2 6 4 5 3 1] DLM Lock Space: "clvmd" 5 3 run - [2 1 4 5 6 3] Nothing is in device-mapper yet: [root@link-07 ~]# dmsetup ls --tree VolGroup00-LogVol01 (253:1) └─ (3:2) VolGroup00-LogVol00 (253:0) └─ (3:2) [root@link-07 ~]# dmsetup table VolGroup00-LogVol01: 0 4063232 linear 3:2 151781760 VolGroup00-LogVol00: 0 151781376 linear 3:2 384 [root@link-02 tmp]# ./lvm_backtraces.pl Nov 1 15:39:52 link-02 kernel: dm-cmirror: co-ordinator: 57005 Nov 1 15:39:52 link-02 kernel: dm-cmirror: node_count : 1 Nov 1 15:40:02 link-02 kernel: dm-cmirror: stop_server called Nov 1 15:40:07 link-02 kernel: dm-cmirror: Closing socket on server side Nov 1 15:40:14 link-02 qarshd[29681]: Talking to peer 10.15.80.47:42867 Nov 1 15:40:14 link-02 qarshd[29681]: Running cmdline: dmsetup ls Nov 1 15:40:14 link-02 qarshd[29681]: That's enough Nov 1 15:40:14 link-02 qarshd[29683]: Talking to peer 10.15.80.47:42868 Nov 1 15:40:14 link-02 qarshd[29683]: Running cmdline: dmsetup ls Nov 1 15:40:14 link-02 qarshd[29683]: That's enough DLM lockspace 'clvmd' Resource 00000100393f9178 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpzslEr8sw0fg2cVXlRYYNuxOBGE7KADNpi" Local Copy, Master is node [[grant-03]] Granted Queue 00010238 CR 4697 Master: 000102ff Conversion Queue Waiting Queue Resource 00000100393f9508 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpz3rCt2OEKPraYcmx8UjSLMCZXU245t4lt" Local Copy, Master is node [[grant-03]] Granted Queue 0001021d CR 4697 Master: 000103ea Conversion Queue Waiting Queue [root@link-07 tmp]# ./lvm_backtraces.pl Backtrace for lvcreate-m1-nmirror_log_convert--corelog-L236Gmirror_sanity/dev/sda1/dev/sdb1/dev/sdc1/dev/sdd1 (9881): #1 0x000000000044fde3 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not available.) at locking/cluster_locking.c:115 #2 0x00000000004500e3 in _lock_resource (cmd=Variable "cmd" is not available.) at locking/cluster_locking.c:415 #3 0x000000000043cea6 in _lock_vol (cmd=0x587240, resource=0x7fbfffd970 "mirror_sanity", flags=36) at locking/locking.c:321 ####################################### # _lock_vol flags = LCK_VG | LCK_HOLD | LCK_WRITE ####################################### #4 0x000000000043d0b9 in lock_vol (cmd=0x587240, vol=0x7fbffffe5d "mirror_sanity", flags=36) at locking/locking.c:354 #5 0x0000000000411292 in lvcreate (cmd=0x587240, argc=Variable "argc" is not available.) at lvcreate.c:864 #6 0x0000000000414c42 in lvm_run_command (cmd=0x587240, argc=5, argv=0x7fbffffc48) at lvmcmdline.c:927 #7 0x00000000004155f2 in lvm2_main (argc=13, argv=0x7fbffffc08, is_static=0) at lvmcmdline.c:1205 #8 0x00000033efb1c3fb in __libc_start_main () from /lib64/tls/libc.so.6 #9 0x000000000040c84a in _start () #10 0x0000007fbffffbf8 in ?? () #11 0x000000000000001c in ?? () #12 0x000000000000000d in ?? () #13 0x0000007fbffffe27 in ?? () #14 0x0000007fbffffe30 in ?? () #15 0x0000007fbffffe33 in ?? () #16 0x0000007fbffffe35 in ?? () #17 0x0000007fbffffe38 in ?? () #18 0x0000007fbffffe4b in ?? () #19 0x0000007fbffffe55 in ?? () #20 0x0000007fbffffe58 in ?? () #21 0x0000007fbffffe5d in ?? () #22 0x0000007fbffffe6b in ?? () #23 0x0000007fbffffe75 in ?? () #24 0x0000007fbffffe7f in ?? () #25 0x0000007fbffffe89 in ?? () Nov 1 16:49:44 link-07 last message repeated 21 times Nov 1 16:50:47 link-07 last message repeated 21 times Nov 1 16:51:50 link-07 last message repeated 21 times Nov 1 16:52:53 link-07 last message repeated 21 times Nov 1 16:53:56 link-07 last message repeated 21 times Nov 1 16:54:59 link-07 last message repeated 21 times Nov 1 16:56:02 link-07 last message repeated 21 times Nov 1 16:57:05 link-07 last message repeated 21 times Nov 1 16:58:08 link-07 last message repeated 21 times Nov 1 16:59:11 link-07 last message repeated 21 times DLM lockspace 'clvmd' Resource 000001001e3c2178 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpzslEr8sw0fg2cVXlRYYNuxOBGE7KADNpi" Local Copy, Master is node [[grant-03]] Granted Queue 00010019 CR 4682 Master: 00010110 Conversion Queue Waiting Queue Resource 000001001e3c2508 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpz3rCt2OEKPraYcmx8UjSLMCZXU245t4lt" Local Copy, Master is node [[grant-03]] Granted Queue 00010251 CR 4682 Master: 00010163 Conversion Queue Waiting Queue Resource 0000010019ca9e18 (parent 0000000000000000). Name (len=15) "V_mirror_sanity" Master Copy Granted Queue 000200ae PR 9876 Conversion Queue Waiting Queue 000203ea -- (PW) 9882 LQ: 0,0x0 000100bf -- (PR) 8250 Remote: [[grant-02]] 00030309 LQ: 0,0x0 [root@link-08 tmp]# ./lvm_backtraces.pl Nov 1 16:02:25 link-08 kernel: dm-cmirror: co-ordinator: 57005 Nov 1 16:02:25 link-08 kernel: dm-cmirror: node_count : 3 Nov 1 16:02:41 link-08 kernel: dm-cmirror: stop_server called Nov 1 16:02:44 link-08 kernel: dm-cmirror: Closing socket on server side Nov 1 16:02:46 link-08 qarshd[8893]: Talking to peer 10.15.80.47:46160 Nov 1 16:02:46 link-08 qarshd[8893]: Running cmdline: dmsetup ls Nov 1 16:02:46 link-08 qarshd[8893]: That's enough Nov 1 16:02:46 link-08 qarshd[8895]: Talking to peer 10.15.80.47:46161 Nov 1 16:02:46 link-08 qarshd[8895]: Running cmdline: dmsetup ls Nov 1 16:02:46 link-08 qarshd[8895]: That's enough DLM lockspace 'clvmd' Resource 000001001dbf7178 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpzslEr8sw0fg2cVXlRYYNuxOBGE7KADNpi" Local Copy, Master is node [[grant-03]] Granted Queue 00010191 CR 4672 Master: 0001001c Conversion Queue Waiting Queue Resource 000001001dbf7508 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpz3rCt2OEKPraYcmx8UjSLMCZXU245t4lt" Local Copy, Master is node [[grant-03]] Granted Queue 0001007d CR 4672 Master: 00010202 Conversion Queue Waiting Queue [root@grant-01 tmp]# ./lvm_backtraces.pl Nov 1 10:46:02 grant-01 lvm[4871]: No longer monitoring mirror device mirror_sanity-display_mirror for events Nov 1 10:46:27 grant-01 qarshd[10536]: Talking to peer 10.15.80.47:58597 Nov 1 10:46:27 grant-01 qarshd[10536]: Running cmdline: lvremove -f /dev/mirror_sanity/display_mirror Nov 1 10:46:27 grant-01 [4871]: Monitoring mirror device mirror_sanity-display_mirror for events Nov 1 10:46:27 grant-01 lvm[4871]: mirror_sanity-display_mirror is now in-sync Nov 1 10:46:27 grant-01 lvm[4871]: No longer monitoring mirror device mirror_sanity-display_mirror for events Nov 1 10:46:28 grant-01 qarshd[10630]: Talking to peer 10.15.80.47:58604 Nov 1 10:46:28 grant-01 qarshd[10630]: Running cmdline: dmsetup ls Nov 1 10:46:28 grant-01 qarshd[10632]: Talking to peer 10.15.80.47:58605 Nov 1 10:46:28 grant-01 qarshd[10632]: Running cmdline: dmsetup ls DLM lockspace 'clvmd' Resource 00000102187ae178 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpzslEr8sw0fg2cVXlRYYNuxOBGE7KADNpi" Local Copy, Master is node [[grant-03]] Granted Queue 00010310 CR 4820 Master: 000100ec Conversion Queue Waiting Queue Resource 00000102187ae508 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpz3rCt2OEKPraYcmx8UjSLMCZXU245t4lt" Local Copy, Master is node [[grant-03]] Granted Queue 00010094 CR 4820 Master: 000103bb Conversion Queue Waiting Queue [root@grant-02 tmp]# ./lvm_backtraces.pl Backtrace for lvs-a-o+devices (8249): #1 0x000000000044fde3 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not available.) at locking/cluster_locking.c:115 #2 0x00000000004500e3 in _lock_resource (cmd=Variable "cmd" is not available.) at locking/cluster_locking.c:415 #3 0x000000000043cea6 in _lock_vol (cmd=0x587240, resource=0x7fbfffd730 "mirror_sanity", flags=33) at locking/locking.c:321 ####################################### # _lock_vol flags = LCK_VG | LCK_HOLD | LCK_READ ####################################### #4 0x000000000043d0b9 in lock_vol (cmd=0x587240, vol=0x5a0fb8 "mirror_sanity", flags=33) at locking/locking.c:354 #5 0x000000000041f2ef in process_each_lv (cmd=0x587240, argc=0, argv=0x7fbffffa68, lock_type=33, handle=0x5a1df0, process_single=0x41c840 <_lvsegs_single>) at toollib.c:346 #6 0x000000000041ceaa in _report (cmd=0x587240, argc=0, argv=0x7fbffffa68, report_type=Variable "report_type" is not available.) at reporter.c:325 #7 0x0000000000414c42 in lvm_run_command (cmd=0x587240, argc=0, argv=0x7fbffffa68) at lvmcmdline.c:927 #8 0x00000000004155f2 in lvm2_main (argc=4, argv=0x7fbffffa48, is_static=0) at lvmcmdline.c:1205 #9 0x000000367251c3fb in __libc_start_main () from /lib64/tls/libc.so.6 #10 0x000000000040c84a in _start () #11 0x0000007fbffffa38 in ?? () #12 0x000000000000001c in ?? () #13 0x0000000000000004 in ?? () #14 0x0000007fbffffc25 in ?? () #15 0x0000007fbffffc29 in ?? () #16 0x0000007fbffffc2c in ?? () #17 0x0000007fbffffc2f in ?? () Nov 1 10:45:51 grant-02 qarshd[8199]: Running cmdline: dmsetup ls Nov 1 10:45:51 grant-02 lvm[4861]: No longer monitoring mirror device mirror_sanity-display_mirror for events Nov 1 10:46:17 grant-02 qarshd[8245]: Talking to peer 10.15.80.47:39693 Nov 1 10:46:17 grant-02 qarshd[8245]: Running cmdline: dmsetup ls Nov 1 10:46:17 grant-02 qarshd[8247]: Talking to peer 10.15.80.47:39694 Nov 1 10:46:17 grant-02 qarshd[8247]: Running cmdline: dmsetup ls Nov 1 11:24:47 grant-02 dhclient: DHCPREQUEST on eth0 to 10.15.89.100 port 67 Nov 1 11:24:48 grant-02 dhclient: DHCPACK from 10.15.89.100 Nov 1 11:24:48 grant-02 dhclient: bound to 10.15.89.154 -- renewal in 8611 seconds. Nov 1 11:40:39 grant-02 sshd(pam_unix)[8295]: session opened for user root by (uid=0) DLM lockspace 'clvmd' Resource 000001011a337178 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpzslEr8sw0fg2cVXlRYYNuxOBGE7KADNpi" Local Copy, Master is node [[grant-03]] Granted Queue 00010123 CR 4809 Master: 00010062 Conversion Queue Waiting Queue Resource 000001011a337508 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpz3rCt2OEKPraYcmx8UjSLMCZXU245t4lt" Local Copy, Master is node [[grant-03]] Granted Queue 000200dc CR 4809 Master: 0001035c Conversion Queue Waiting Queue Resource 0000010115bd1e18 (parent 0000000000000000). Name (len=15) "V_mirror_sanity" Local Copy, Master is node [[link-07]] Granted Queue Conversion Queue Waiting Queue 00030309 -- (PR) 8250 Master: 000100bf LQ: 0,0x0 [root@grant-03 tmp]# ./lvm_backtraces.pl Nov 1 10:45:52 grant-03 qarshd[8413]: Running cmdline: dmsetup ls Nov 1 10:46:19 grant-03 [4873]: Monitoring mirror device mirror_sanity-display_mirror for events Nov 1 10:46:19 grant-03 lvm[4873]: mirror_sanity-display_mirror is now in-sync Nov 1 10:46:19 grant-03 qarshd[8485]: Talking to peer 10.15.80.47:59434 Nov 1 10:46:19 grant-03 qarshd[8485]: Running cmdline: lvdisplay /dev/mirror_sanity/display_mirror Nov 1 10:46:19 grant-03 lvm[4873]: No longer monitoring mirror device mirror_sanity-display_mirror for events Nov 1 10:46:46 grant-03 qarshd[8532]: Talking to peer 10.15.80.47:59459 Nov 1 10:46:46 grant-03 qarshd[8532]: Running cmdline: dmsetup ls Nov 1 10:46:46 grant-03 qarshd[8534]: Talking to peer 10.15.80.47:59460 Nov 1 10:46:46 grant-03 qarshd[8534]: Running cmdline: dmsetup ls DLM lockspace 'clvmd' Resource 0000010218e5d178 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpzslEr8sw0fg2cVXlRYYNuxOBGE7KADNpi" Master Copy Granted Queue 000102ff CR 4697 Remote: [[link-02]] 00010238 00010110 CR 4682 Remote: [[link-07]] 00010019 0001001c CR 4672 Remote: [[link-08]] 00010191 00010062 CR 4809 Remote: [[grant-02]] 00010123 000100ec CR 4820 Remote: [[grant-01]] 00010310 000100eb CR 4822 Conversion Queue Waiting Queue Resource 0000010218e5d508 (parent 0000000000000000). Name (len=64) "DB3qjSNTrSuoZjKO57YDQqSuKqkOCEpz3rCt2OEKPraYcmx8UjSLMCZXU245t4lt" Master Copy Granted Queue 00010202 CR 4672 Remote: [[link-08]] 0001007d 00010163 CR 4682 Remote: [[link-07]] 00010251 000103ea CR 4697 Remote: [[link-02]] 0001021d 0001035c CR 4809 Remote: [[grant-02]] 000200dc 000103bb CR 4820 Remote: [[grant-01]] 00010094 000102b5 CR 4822 Conversion Queue Waiting Queue This is easily reproducable, smells like a regression. lvcreate -m 1 -n mirror_log_convert --corelog -L 236G mirror_sanity /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 I can not seem to reproduce this bug by hand though... ...what is my test doing that I'm not doing? Odd, now I'm seeing issues with just simple cmds liks 'lvs'. Similar trace even. During a vgs cmd, the clvmd -d output looked like this over and over for a long time: [...] CLVMD[41e02960]: Nov 2 10:52:03 doing PRE command LOCK_VG 'V_mirror_sanity' at 6 (client=0x581740) CLVMD[41e02960]: Nov 2 10:52:03 Writing status 22 down pipe 9 CLVMD[41e02960]: Nov 2 10:52:03 doing PRE command LOCK_VG 'V_mirror_sanity' at 6 (client=0x581740) CLVMD[41e02960]: Nov 2 10:52:03 Writing status 22 down pipe 9 CLVMD[41e02960]: Nov 2 10:52:03 doing PRE command LOCK_VG 'V_mirror_sanity' at 6 (client=0x581740) CLVMD[41e02960]: Nov 2 10:52:03 Writing status 22 down pipe 9 [...] I killed clvmd, restarted it and it seems to be working (at least for now). Seems to have gotten into some weird state. Here's the requested -vvvv output for the deadlocked mirror creation. [root@grant-03 ~]# lvcreate -m 1 -n mirror_log_convert --corelog -L 20G mirror_sanity -vvvv #lvmcmdline.c:905 Processing: lvcreate -m 1 -n mirror_log_convert --corelog -L 20G mirror_sanity -vvvv #lvmcmdline.c:908 O_DIRECT will be used #config/config.c:952 Setting global/locking_type to 3 #locking/locking.c:244 Cluster locking selected. #activate/activate.c:362 Getting target version for mirror #ioctl/libdm-iface.c:1572 dm version OF [16384] #ioctl/libdm-iface.c:1572 dm versions OF [16384] #activate/activate.c:362 Getting target version for mirror #ioctl/libdm-iface.c:1572 dm versions OF [16384] #config/config.c:952 Setting activation/mirror_region_size to 512 #locking/cluster_locking.c:413 Locking V_mirror_sanity at 0x4 [HANG] Reproduced this bug last night during final 4.6 testing. This may be a blocker after all. This bz is being caused by 364081. There are two bugs here I think. There is the double-unlock bug which is causing the unlocks to fail in clvmd. But there is also a problem in that clvmd doesn't handle errors in the PRE command stage very well. I'll look into that. Do we have enough BZs open for this problem to cover that or shall I open a new one ? Unable to verify any progress on this bz due to bz 217895. Adding that this bug depends on 217895 in order to be verified. 217895 is in modified state. I also believe this is a dup of 437446. Similar bugs are: 239614, 362691, 437446*, 435341, 435491, 217895 Have not been able to reproduce this bug lately, marking this verified. Closign this bug as the code has been released in 4.7. |