Bug 460222 - RHEL5 cmirror tracker: multiple cmirror operations cause deadlock
RHEL5 cmirror tracker: multiple cmirror operations cause deadlock
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cmirror (Show other bugs)
All Linux
medium Severity medium
: rc
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2008-08-26 16:50 EDT by Corey Marthaler
Modified: 2010-04-27 11:04 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-04-27 11:04:26 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
log from hayes-01 (212.12 KB, text/plain)
2008-08-26 16:51 EDT, Corey Marthaler
no flags Details
log from hayes-02 (433.12 KB, text/plain)
2008-08-26 16:52 EDT, Corey Marthaler
no flags Details
log from hayes-03 (429.56 KB, text/plain)
2008-08-26 16:53 EDT, Corey Marthaler
no flags Details
2nd log from hayes-01 (201.42 KB, text/plain)
2008-08-27 10:54 EDT, Corey Marthaler
no flags Details
2nd log from hayes-02 (202.13 KB, text/plain)
2008-08-27 10:54 EDT, Corey Marthaler
no flags Details
2nd log from hayes-03 (202.16 KB, text/plain)
2008-08-27 10:58 EDT, Corey Marthaler
no flags Details

  None (edit)
Description Corey Marthaler 2008-08-26 16:50:04 EDT
Description of problem:
I was running our cmirror lock stress test on the hayes cluster (hayes-0[123]) to 3 mirrors per machine and saw this deadlock.

./cmirror_lock_stress -R ../../var/share/resource_files/hayes.xml -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.3 -m 3

I'll attach the kernel dumps from all three machines.

Version-Release number of selected component (if applicable):
[root@hayes-03 tmp]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 

lvm2-2.02.32-4.el5    BUILT: Fri Apr  4 06:15:19 CDT 2008
lvm2-cluster-2.02.32-4.el5    BUILT: Wed Apr  2 03:56:50 CDT 2008
cmirror-1.1.22-1.el5    BUILT: Thu Jul 24 15:59:03 CDT 2008
kmod-cmirror-0.1.13-2.el5    BUILT: Thu Jul 24 16:00:48 CDT 2008
Comment 1 Corey Marthaler 2008-08-26 16:51:53 EDT
Created attachment 315047 [details]
log from hayes-01
Comment 2 Corey Marthaler 2008-08-26 16:52:24 EDT
Created attachment 315048 [details]
log from hayes-02
Comment 3 Corey Marthaler 2008-08-26 16:53:45 EDT
Created attachment 315049 [details]
log from hayes-03
Comment 4 RHEL Product and Program Management 2008-08-26 17:02:36 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
Comment 5 Corey Marthaler 2008-08-27 10:53:05 EDT
Reproduce this issue last night running the exact same load. Will attach kern dump from the 3 hayes nodes.
Comment 6 Corey Marthaler 2008-08-27 10:54:20 EDT
Created attachment 315107 [details]
2nd log from hayes-01
Comment 7 Corey Marthaler 2008-08-27 10:54:51 EDT
Created attachment 315108 [details]
2nd log from hayes-02
Comment 8 Corey Marthaler 2008-08-27 10:58:56 EDT
Created attachment 315109 [details]
2nd log from hayes-03
Comment 9 Jonathan Earl Brassow 2008-09-09 11:27:38 EDT
Hmmm, up-converting:

Aug 25 17:56:31 hayes-01 qarshd[17129]: Running cmdline: lvconvert -m 3 lock_stress/hayes-01.27240 

If this is upconverting from linear, then this is a new issue.  If this is up-converting from one mirror to another, then this is a know issue.  Do you know which was happening?
Comment 10 Corey Marthaler 2008-09-09 15:51:12 EDT
There are no up converts from existing mirrors in this test, only upconverts from linears. 

However, there are potential down converts to lessor legged mirrors in this test. It's possible that mirror had been a '-m 4' mirror and was down converting to a '-m 3' mirror. Though without those logs, I can't be certain.
Comment 11 Jonathan Earl Brassow 2008-09-23 14:38:40 EDT
Are we sure that it doesn't just appear deadlocked, due to the fact that an upconvert now waits for the sync to complete... which can take a long time?
Comment 12 Corey Marthaler 2008-09-30 18:19:08 EDT
The test doesn't fail due to a time out, the test is happy thinking that the cmds are runningfine. However, on the cluster the cmds (and lvm) was stuck.

That said, I haven't been able to reproduce this with the latest builds. I'll bump up the number of mirrors being used in this test and see what happens over night.
Comment 13 Corey Marthaler 2008-10-01 10:11:39 EDT
I appear to be unable to reproduce this issue with the latest code. Will take off the blocker list but leave open in case it's seen again.

[root@hayes-02 ~]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 

lvm2-2.02.40-3.el5    BUILT: Thu Sep 25 14:59:07 CDT 2008
lvm2-cluster-2.02.40-3.el5    BUILT: Thu Sep 25 15:00:54 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.28-1.el5    BUILT: Tue Sep 30 15:48:54 CDT 2008
kmod-cmirror-0.1.18-1.el5    BUILT: Mon Sep 29 16:20:21 CDT 2008
Comment 14 Corey Marthaler 2009-06-11 12:02:32 EDT
I've been unable to reproduce the hang in this bug. I have however seen these operations take much longer than "normal" and for that'll I'll likely open a bug, but this one can be marking verified in the mean time.


lvm2-2.02.46-5.el5    BUILT: Sat Jun  6 16:29:23 CDT 2009
lvm2-cluster-2.02.46-5.el5    BUILT: Sat Jun  6 16:28:13 CDT 2009
device-mapper-1.02.32-1.el5    BUILT: Thu May 21 02:18:23 CDT 2009
cmirror-1.1.37-1.el5    BUILT: Tue May  5 11:46:05 CDT 2009
kmod-cmirror-0.1.21-14.el5    BUILT: Thu May 21 08:28:17 CDT 2009
Comment 15 Tom Parker 2009-12-28 19:22:43 EST
Just to add to the above comment, I am also unable to reproduce this bug despite numerous tries. Seems like an intermittent issue.
Comment 17 Alasdair Kergon 2010-04-27 11:04:26 EDT
Assuming this VERIFIED fix got released.  Closing.
Reopen if it's not yet resolved.

Note You need to log in before you can comment on or make changes to this bug.