Bug 460222 - RHEL5 cmirror tracker: multiple cmirror operations cause deadlock
Summary: RHEL5 cmirror tracker: multiple cmirror operations cause deadlock
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cmirror
Version: 5.3
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-08-26 20:50 UTC by Corey Marthaler
Modified: 2010-04-27 15:04 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-04-27 15:04:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log from hayes-01 (212.12 KB, text/plain)
2008-08-26 20:51 UTC, Corey Marthaler
no flags Details
log from hayes-02 (433.12 KB, text/plain)
2008-08-26 20:52 UTC, Corey Marthaler
no flags Details
log from hayes-03 (429.56 KB, text/plain)
2008-08-26 20:53 UTC, Corey Marthaler
no flags Details
2nd log from hayes-01 (201.42 KB, text/plain)
2008-08-27 14:54 UTC, Corey Marthaler
no flags Details
2nd log from hayes-02 (202.13 KB, text/plain)
2008-08-27 14:54 UTC, Corey Marthaler
no flags Details
2nd log from hayes-03 (202.16 KB, text/plain)
2008-08-27 14:58 UTC, Corey Marthaler
no flags Details

Description Corey Marthaler 2008-08-26 20:50:04 UTC
Description of problem:
I was running our cmirror lock stress test on the hayes cluster (hayes-0[123]) to 3 mirrors per machine and saw this deadlock.

./cmirror_lock_stress -R ../../var/share/resource_files/hayes.xml -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.3 -m 3

I'll attach the kernel dumps from all three machines.

Version-Release number of selected component (if applicable):
[root@hayes-03 tmp]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 
2.6.18-92.el5

lvm2-2.02.32-4.el5    BUILT: Fri Apr  4 06:15:19 CDT 2008
lvm2-cluster-2.02.32-4.el5    BUILT: Wed Apr  2 03:56:50 CDT 2008
cmirror-1.1.22-1.el5    BUILT: Thu Jul 24 15:59:03 CDT 2008
kmod-cmirror-0.1.13-2.el5    BUILT: Thu Jul 24 16:00:48 CDT 2008

Comment 1 Corey Marthaler 2008-08-26 20:51:53 UTC
Created attachment 315047 [details]
log from hayes-01

Comment 2 Corey Marthaler 2008-08-26 20:52:24 UTC
Created attachment 315048 [details]
log from hayes-02

Comment 3 Corey Marthaler 2008-08-26 20:53:45 UTC
Created attachment 315049 [details]
log from hayes-03

Comment 4 RHEL Program Management 2008-08-26 21:02:36 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Corey Marthaler 2008-08-27 14:53:05 UTC
Reproduce this issue last night running the exact same load. Will attach kern dump from the 3 hayes nodes.

Comment 6 Corey Marthaler 2008-08-27 14:54:20 UTC
Created attachment 315107 [details]
2nd log from hayes-01

Comment 7 Corey Marthaler 2008-08-27 14:54:51 UTC
Created attachment 315108 [details]
2nd log from hayes-02

Comment 8 Corey Marthaler 2008-08-27 14:58:56 UTC
Created attachment 315109 [details]
2nd log from hayes-03

Comment 9 Jonathan Earl Brassow 2008-09-09 15:27:38 UTC
Hmmm, up-converting:

Aug 25 17:56:31 hayes-01 qarshd[17129]: Running cmdline: lvconvert -m 3 lock_stress/hayes-01.27240 

If this is upconverting from linear, then this is a new issue.  If this is up-converting from one mirror to another, then this is a know issue.  Do you know which was happening?

Comment 10 Corey Marthaler 2008-09-09 19:51:12 UTC
There are no up converts from existing mirrors in this test, only upconverts from linears. 

However, there are potential down converts to lessor legged mirrors in this test. It's possible that mirror had been a '-m 4' mirror and was down converting to a '-m 3' mirror. Though without those logs, I can't be certain.

Comment 11 Jonathan Earl Brassow 2008-09-23 18:38:40 UTC
Are we sure that it doesn't just appear deadlocked, due to the fact that an upconvert now waits for the sync to complete... which can take a long time?

Comment 12 Corey Marthaler 2008-09-30 22:19:08 UTC
The test doesn't fail due to a time out, the test is happy thinking that the cmds are runningfine. However, on the cluster the cmds (and lvm) was stuck.

That said, I haven't been able to reproduce this with the latest builds. I'll bump up the number of mirrors being used in this test and see what happens over night.

Comment 13 Corey Marthaler 2008-10-01 14:11:39 UTC
I appear to be unable to reproduce this issue with the latest code. Will take off the blocker list but leave open in case it's seen again.

[root@hayes-02 ~]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 
2.6.18-115.gfs2abhi.001

lvm2-2.02.40-3.el5    BUILT: Thu Sep 25 14:59:07 CDT 2008
lvm2-cluster-2.02.40-3.el5    BUILT: Thu Sep 25 15:00:54 CDT 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.28-1.el5    BUILT: Tue Sep 30 15:48:54 CDT 2008
kmod-cmirror-0.1.18-1.el5    BUILT: Mon Sep 29 16:20:21 CDT 2008

Comment 14 Corey Marthaler 2009-06-11 16:02:32 UTC
I've been unable to reproduce the hang in this bug. I have however seen these operations take much longer than "normal" and for that'll I'll likely open a bug, but this one can be marking verified in the mean time.

2.6.18-150.el5

lvm2-2.02.46-5.el5    BUILT: Sat Jun  6 16:29:23 CDT 2009
lvm2-cluster-2.02.46-5.el5    BUILT: Sat Jun  6 16:28:13 CDT 2009
device-mapper-1.02.32-1.el5    BUILT: Thu May 21 02:18:23 CDT 2009
cmirror-1.1.37-1.el5    BUILT: Tue May  5 11:46:05 CDT 2009
kmod-cmirror-0.1.21-14.el5    BUILT: Thu May 21 08:28:17 CDT 2009

Comment 15 Tom Parker 2009-12-29 00:22:43 UTC
Just to add to the above comment, I am also unable to reproduce this bug despite numerous tries. Seems like an intermittent issue.

Comment 17 Alasdair Kergon 2010-04-27 15:04:26 UTC
Assuming this VERIFIED fix got released.  Closing.
Reopen if it's not yet resolved.


Note You need to log in before you can comment on or make changes to this bug.