Description of problem: I was running our cmirror lock stress test on the hayes cluster (hayes-0[123]) to 3 mirrors per machine and saw this deadlock. ./cmirror_lock_stress -R ../../var/share/resource_files/hayes.xml -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.3 -m 3 I'll attach the kernel dumps from all three machines. Version-Release number of selected component (if applicable): [root@hayes-03 tmp]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 2.6.18-92.el5 lvm2-2.02.32-4.el5 BUILT: Fri Apr 4 06:15:19 CDT 2008 lvm2-cluster-2.02.32-4.el5 BUILT: Wed Apr 2 03:56:50 CDT 2008 cmirror-1.1.22-1.el5 BUILT: Thu Jul 24 15:59:03 CDT 2008 kmod-cmirror-0.1.13-2.el5 BUILT: Thu Jul 24 16:00:48 CDT 2008
Created attachment 315047 [details] log from hayes-01
Created attachment 315048 [details] log from hayes-02
Created attachment 315049 [details] log from hayes-03
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Reproduce this issue last night running the exact same load. Will attach kern dump from the 3 hayes nodes.
Created attachment 315107 [details] 2nd log from hayes-01
Created attachment 315108 [details] 2nd log from hayes-02
Created attachment 315109 [details] 2nd log from hayes-03
Hmmm, up-converting: Aug 25 17:56:31 hayes-01 qarshd[17129]: Running cmdline: lvconvert -m 3 lock_stress/hayes-01.27240 If this is upconverting from linear, then this is a new issue. If this is up-converting from one mirror to another, then this is a know issue. Do you know which was happening?
There are no up converts from existing mirrors in this test, only upconverts from linears. However, there are potential down converts to lessor legged mirrors in this test. It's possible that mirror had been a '-m 4' mirror and was down converting to a '-m 3' mirror. Though without those logs, I can't be certain.
Are we sure that it doesn't just appear deadlocked, due to the fact that an upconvert now waits for the sync to complete... which can take a long time?
The test doesn't fail due to a time out, the test is happy thinking that the cmds are runningfine. However, on the cluster the cmds (and lvm) was stuck. That said, I haven't been able to reproduce this with the latest builds. I'll bump up the number of mirrors being used in this test and see what happens over night.
I appear to be unable to reproduce this issue with the latest code. Will take off the blocker list but leave open in case it's seen again. [root@hayes-02 ~]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 2.6.18-115.gfs2abhi.001 lvm2-2.02.40-3.el5 BUILT: Thu Sep 25 14:59:07 CDT 2008 lvm2-cluster-2.02.40-3.el5 BUILT: Thu Sep 25 15:00:54 CDT 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008 cmirror-1.1.28-1.el5 BUILT: Tue Sep 30 15:48:54 CDT 2008 kmod-cmirror-0.1.18-1.el5 BUILT: Mon Sep 29 16:20:21 CDT 2008
I've been unable to reproduce the hang in this bug. I have however seen these operations take much longer than "normal" and for that'll I'll likely open a bug, but this one can be marking verified in the mean time. 2.6.18-150.el5 lvm2-2.02.46-5.el5 BUILT: Sat Jun 6 16:29:23 CDT 2009 lvm2-cluster-2.02.46-5.el5 BUILT: Sat Jun 6 16:28:13 CDT 2009 device-mapper-1.02.32-1.el5 BUILT: Thu May 21 02:18:23 CDT 2009 cmirror-1.1.37-1.el5 BUILT: Tue May 5 11:46:05 CDT 2009 kmod-cmirror-0.1.21-14.el5 BUILT: Thu May 21 08:28:17 CDT 2009
Just to add to the above comment, I am also unable to reproduce this bug despite numerous tries. Seems like an intermittent issue.
Assuming this VERIFIED fix got released. Closing. Reopen if it's not yet resolved.