Bug 429708 - RHEL5 cmirror tracker: cluster locking deadlock
Summary: RHEL5 cmirror tracker: cluster locking deadlock
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cmirror
Version: 5.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 430797
TreeView+ depends on / blocked
 
Reported: 2008-01-22 15:59 UTC by Corey Marthaler
Modified: 2010-01-12 02:06 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-10-15 21:56:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
backtraces from hayes-01 (199.45 KB, text/plain)
2008-01-22 15:59 UTC, Corey Marthaler
no flags Details
backtraces from hayes-02 (199.25 KB, text/plain)
2008-01-22 16:00 UTC, Corey Marthaler
no flags Details
backtraces from hayes-03 (199.06 KB, text/plain)
2008-01-22 16:01 UTC, Corey Marthaler
no flags Details
log from hayes-01 (203.40 KB, text/plain)
2008-05-13 16:11 UTC, Corey Marthaler
no flags Details
log from hayes-02 (204.91 KB, text/plain)
2008-05-13 16:11 UTC, Corey Marthaler
no flags Details
log from hayes-03 (206.50 KB, text/plain)
2008-05-13 16:12 UTC, Corey Marthaler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:0158 0 normal SHIPPED_LIVE new package: cmirror 2009-01-20 16:05:16 UTC

Description Corey Marthaler 2008-01-22 15:59:04 UTC
Description of problem:
Running multiple cmirror config operations results in a deadlock.

[root@hayes-01 tmp]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    920   2008-01-21 13:55:51  hayes-01
   2   M    928   2008-01-21 13:55:53  hayes-02
   3   M    924   2008-01-21 13:55:53  hayes-03

[root@hayes-01 tmp]# ./lvm_backtraces.pl
Backtrace for lvcreate-m1-nhayes-01.3664-L500Mlock_stress (28958):
#1  0x0000000000452ce5 in init_cluster_locking ()
#2  0x0000000000453125 in init_cluster_locking ()
#3  0x000000000045336c in init_cluster_locking ()
#4  0x000000000043ed9f in reset_locking ()
#5  0x000000000043f11b in lock_vol ()
#6  0x0000000000411a68 in lvcreate ()
#7  0x0000000000413444 in lvm_run_command ()
#8  0x0000000000415657 in lvm2_main ()
#9  0x0000003946a1d8a4 in __libc_start_main () from /lib64/libc.so.6


[root@hayes-02 tmp]# ./lvm_backtraces.pl
Backtrace for lvslock_stress/hayes-02.3666--noheadings-ocopy_percent (29701):
#1  0x0000000000452ce5 in init_cluster_locking ()
#2  0x0000000000453125 in init_cluster_locking ()
#3  0x000000000045336c in init_cluster_locking ()
#4  0x000000000043ed9f in reset_locking ()
#5  0x000000000043f11b in lock_vol ()
#6  0x0000000000420933 in process_each_lv ()
#7  0x000000000041d640 in pvscan ()
#8  0x0000000000413444 in lvm_run_command ()
#9  0x0000000000415657 in lvm2_main ()
#10 0x0000003473e1d8a4 in __libc_start_main () from /lib64/libc.so.6


[root@hayes-03 tmp]# ./lvm_backtraces.pl
Backtrace for lvcreate-m1-nhayes-03.3662-L500Mlock_stress (30653):
#1  0x0000000000452ce5 in init_cluster_locking ()
#2  0x0000000000453125 in init_cluster_locking ()
#3  0x000000000045336c in init_cluster_locking ()
#4  0x000000000043ed9f in reset_locking ()
#5  0x000000000043f11b in lock_vol ()
#6  0x0000000000411a68 in lvcreate ()
#7  0x0000000000413444 in lvm_run_command ()
#8  0x0000000000415657 in lvm2_main ()
#9  0x000000363a61d8a4 in __libc_start_main () from /lib64/libc.so.6

Backtrace for lvs-a-o+devices (30790):
#1  0x0000000000452ce5 in init_cluster_locking ()
#2  0x0000000000453125 in init_cluster_locking ()
#3  0x000000000045336c in init_cluster_locking ()
#4  0x000000000043ed9f in reset_locking ()
#5  0x000000000043f11b in lock_vol ()
#6  0x0000000000420933 in process_each_lv ()
#7  0x000000000041d640 in pvscan ()
#8  0x0000000000413444 in lvm_run_command ()
#9  0x0000000000415657 in lvm2_main ()
#10 0x000000363a61d8a4 in __libc_start_main () from /lib64/libc.so.6


Version-Release number of selected component (if applicable):
2.6.18-62.el5
cmirror-1.1.5-4.el5
kmod-cmirror-0.1.2-1.el5.scratch.1

Comment 1 Corey Marthaler 2008-01-22 15:59:58 UTC
Created attachment 292527 [details]
backtraces from hayes-01

Comment 2 Corey Marthaler 2008-01-22 16:00:55 UTC
Created attachment 292528 [details]
backtraces from hayes-02

Comment 3 Corey Marthaler 2008-01-22 16:01:32 UTC
Created attachment 292529 [details]
backtraces from hayes-03

Comment 4 Jonathan Earl Brassow 2008-01-24 15:30:31 UTC
What are "cmirror config operations"?  converting/creating/removing?

I haven't looked at the attachments yet, but it doesn't look like it's hit
cmirror code yet...  LVM/device-mapper has changed quite a bit in this release -
it could be there too.

Something that convinces you that it's cmirror?


Comment 5 Alasdair Kergon 2008-01-24 16:23:06 UTC
need -vvvv for all three commands to check for new userspace deadlocks - or lock
manager state showing which locks are held where in the cluster

Comment 6 Corey Marthaler 2008-01-24 16:26:50 UTC
The operations, run from multiple nodes in the cluster to mulitple mirrors, are
as follows: create, down convert, up convert, core log convert, disk log
convert, deactivate, reactivate, deactivate, delete.

Comment 8 Corey Marthaler 2008-02-04 16:17:02 UTC
This deadlock can be triggered by just having many mirrors to activate during
clvmd start up.

Comment 9 Jonathan Earl Brassow 2008-02-04 17:13:36 UTC
many == ?


Comment 10 Corey Marthaler 2008-02-04 21:51:35 UTC
There were 28 mirrors in that case, however I spoke to soon, it didn't deadlock,
it everntually timed out and then the stream of 'dm-log-clustered:' errors
followed, so comment #8 is most likely one of the other open BZs

Comment 12 Jonathan Earl Brassow 2008-02-15 16:00:58 UTC
corey, is this still reproducible?

Comment 13 Corey Marthaler 2008-02-18 20:39:10 UTC
I'm unable to reproducing this, so this may be fixed. However I do eventually
hit another bz so verifying this may be blocked behind 429599.

Comment 14 Jonathan Earl Brassow 2008-02-28 21:56:40 UTC
I think 429599 has been cleared... ok to have another go at this?

Comment 15 Corey Marthaler 2008-05-13 14:41:20 UTC
I appear to have hit this last night, again while running cirror_lock_stress. 

./cmirror_lock_stress -l /home/msp/cmarthal/work/rhel5/sts-root -r
/usr/tests/sts-rhel5.2 -R ../../var/share/resource_files/hayes.xml -m 2

I saw a lot of the following messages on the console:
"dm-log-clustered: Stray request returned:"
"dm-log-clustered: Excessive delay in request processing, 1 sec"

There are no mirror or mirror components currently listed with dmsetup

I'll attach a dump of the stuck processes...

[root@hayes-01 ~]# rpm -qi lvm2
Name        : lvm2                         Relocations: (not relocatable)
Version     : 2.02.32                           Vendor: Red Hat, Inc.
Release     : 4.el5                         Build Date: Fri 04 Apr 2008 06:15:19
AM CDT
Install Date: Mon 05 May 2008 11:26:24 AM CDT      Build Host:
hs20-bc2-3.build.redhat.com

[root@hayes-01 ~]# rpm -qi lvm2-cluster
Name        : lvm2-cluster                 Relocations: (not relocatable)
Version     : 2.02.32                           Vendor: Red Hat, Inc.
Release     : 4.el5                         Build Date: Wed 02 Apr 2008 03:56:50
AM CDT
Install Date: Mon 05 May 2008 11:27:46 AM CDT      Build Host:
hs20-bc2-3.build.redhat.com

[root@hayes-01 ~]# rpm -qi cmirror
Name        : cmirror                      Relocations: (not relocatable)
Version     : 1.1.17                            Vendor: Red Hat, Inc.
Release     : 1.el5                         Build Date: Fri 09 May 2008 11:33:43
AM CDT
Install Date: Fri 09 May 2008 03:57:23 PM CDT      Build Host:
hs20-bc1-7.build.redhat.com

[root@hayes-01 ~]# rpm -qi kmod-cmirror
Name        : kmod-cmirror                 Relocations: (not relocatable)
Version     : 0.1.9                             Vendor: Red Hat, Inc.
Release     : 1.el5                         Build Date: Thu 08 May 2008 02:28:27
PM CDT
Install Date: Fri 09 May 2008 09:17:33 AM CDT      Build Host:
hs20-bc2-4.build.redhat.com

[root@hayes-01 ~]# rpm -qi openais
Name        : openais                      Relocations: (not relocatable)
Version     : 0.80.3                            Vendor: Red Hat, Inc.
Release     : 15.el5                        Build Date: Wed 02 Apr 2008 03:42:29
AM CDT
Install Date: Mon 05 May 2008 11:26:27 AM CDT      Build Host:
ls20-bc2-13.build.redhat.com


Comment 16 Corey Marthaler 2008-05-13 16:11:13 UTC
Created attachment 305252 [details]
log from hayes-01

Comment 17 Corey Marthaler 2008-05-13 16:11:49 UTC
Created attachment 305253 [details]
log from hayes-02

Comment 18 Corey Marthaler 2008-05-13 16:12:46 UTC
Created attachment 305254 [details]
log from hayes-03

Comment 22 Corey Marthaler 2008-10-15 21:56:05 UTC
This bz has not been seen in 5 months, or is possibly the same as bz 460222. Will reopen if needed.


Note You need to log in before you can comment on or make changes to this bug.