Bug 429708 - RHEL5 cmirror tracker: cluster locking deadlock
RHEL5 cmirror tracker: cluster locking deadlock
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cmirror (Show other bugs)
5.2
All Linux
low Severity low
: rc
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
: TestBlocker
Depends On:
Blocks: 430797
  Show dependency treegraph
 
Reported: 2008-01-22 10:59 EST by Corey Marthaler
Modified: 2010-01-11 21:06 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-10-15 17:56:05 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
backtraces from hayes-01 (199.45 KB, text/plain)
2008-01-22 10:59 EST, Corey Marthaler
no flags Details
backtraces from hayes-02 (199.25 KB, text/plain)
2008-01-22 11:00 EST, Corey Marthaler
no flags Details
backtraces from hayes-03 (199.06 KB, text/plain)
2008-01-22 11:01 EST, Corey Marthaler
no flags Details
log from hayes-01 (203.40 KB, text/plain)
2008-05-13 12:11 EDT, Corey Marthaler
no flags Details
log from hayes-02 (204.91 KB, text/plain)
2008-05-13 12:11 EDT, Corey Marthaler
no flags Details
log from hayes-03 (206.50 KB, text/plain)
2008-05-13 12:12 EDT, Corey Marthaler
no flags Details

  None (edit)
Description Corey Marthaler 2008-01-22 10:59:04 EST
Description of problem:
Running multiple cmirror config operations results in a deadlock.

[root@hayes-01 tmp]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    920   2008-01-21 13:55:51  hayes-01
   2   M    928   2008-01-21 13:55:53  hayes-02
   3   M    924   2008-01-21 13:55:53  hayes-03

[root@hayes-01 tmp]# ./lvm_backtraces.pl
Backtrace for lvcreate-m1-nhayes-01.3664-L500Mlock_stress (28958):
#1  0x0000000000452ce5 in init_cluster_locking ()
#2  0x0000000000453125 in init_cluster_locking ()
#3  0x000000000045336c in init_cluster_locking ()
#4  0x000000000043ed9f in reset_locking ()
#5  0x000000000043f11b in lock_vol ()
#6  0x0000000000411a68 in lvcreate ()
#7  0x0000000000413444 in lvm_run_command ()
#8  0x0000000000415657 in lvm2_main ()
#9  0x0000003946a1d8a4 in __libc_start_main () from /lib64/libc.so.6


[root@hayes-02 tmp]# ./lvm_backtraces.pl
Backtrace for lvslock_stress/hayes-02.3666--noheadings-ocopy_percent (29701):
#1  0x0000000000452ce5 in init_cluster_locking ()
#2  0x0000000000453125 in init_cluster_locking ()
#3  0x000000000045336c in init_cluster_locking ()
#4  0x000000000043ed9f in reset_locking ()
#5  0x000000000043f11b in lock_vol ()
#6  0x0000000000420933 in process_each_lv ()
#7  0x000000000041d640 in pvscan ()
#8  0x0000000000413444 in lvm_run_command ()
#9  0x0000000000415657 in lvm2_main ()
#10 0x0000003473e1d8a4 in __libc_start_main () from /lib64/libc.so.6


[root@hayes-03 tmp]# ./lvm_backtraces.pl
Backtrace for lvcreate-m1-nhayes-03.3662-L500Mlock_stress (30653):
#1  0x0000000000452ce5 in init_cluster_locking ()
#2  0x0000000000453125 in init_cluster_locking ()
#3  0x000000000045336c in init_cluster_locking ()
#4  0x000000000043ed9f in reset_locking ()
#5  0x000000000043f11b in lock_vol ()
#6  0x0000000000411a68 in lvcreate ()
#7  0x0000000000413444 in lvm_run_command ()
#8  0x0000000000415657 in lvm2_main ()
#9  0x000000363a61d8a4 in __libc_start_main () from /lib64/libc.so.6

Backtrace for lvs-a-o+devices (30790):
#1  0x0000000000452ce5 in init_cluster_locking ()
#2  0x0000000000453125 in init_cluster_locking ()
#3  0x000000000045336c in init_cluster_locking ()
#4  0x000000000043ed9f in reset_locking ()
#5  0x000000000043f11b in lock_vol ()
#6  0x0000000000420933 in process_each_lv ()
#7  0x000000000041d640 in pvscan ()
#8  0x0000000000413444 in lvm_run_command ()
#9  0x0000000000415657 in lvm2_main ()
#10 0x000000363a61d8a4 in __libc_start_main () from /lib64/libc.so.6


Version-Release number of selected component (if applicable):
2.6.18-62.el5
cmirror-1.1.5-4.el5
kmod-cmirror-0.1.2-1.el5.scratch.1
Comment 1 Corey Marthaler 2008-01-22 10:59:58 EST
Created attachment 292527 [details]
backtraces from hayes-01
Comment 2 Corey Marthaler 2008-01-22 11:00:55 EST
Created attachment 292528 [details]
backtraces from hayes-02
Comment 3 Corey Marthaler 2008-01-22 11:01:32 EST
Created attachment 292529 [details]
backtraces from hayes-03
Comment 4 Jonathan Earl Brassow 2008-01-24 10:30:31 EST
What are "cmirror config operations"?  converting/creating/removing?

I haven't looked at the attachments yet, but it doesn't look like it's hit
cmirror code yet...  LVM/device-mapper has changed quite a bit in this release -
it could be there too.

Something that convinces you that it's cmirror?
Comment 5 Alasdair Kergon 2008-01-24 11:23:06 EST
need -vvvv for all three commands to check for new userspace deadlocks - or lock
manager state showing which locks are held where in the cluster
Comment 6 Corey Marthaler 2008-01-24 11:26:50 EST
The operations, run from multiple nodes in the cluster to mulitple mirrors, are
as follows: create, down convert, up convert, core log convert, disk log
convert, deactivate, reactivate, deactivate, delete.
Comment 8 Corey Marthaler 2008-02-04 11:17:02 EST
This deadlock can be triggered by just having many mirrors to activate during
clvmd start up.
Comment 9 Jonathan Earl Brassow 2008-02-04 12:13:36 EST
many == ?
Comment 10 Corey Marthaler 2008-02-04 16:51:35 EST
There were 28 mirrors in that case, however I spoke to soon, it didn't deadlock,
it everntually timed out and then the stream of 'dm-log-clustered:' errors
followed, so comment #8 is most likely one of the other open BZs
Comment 12 Jonathan Earl Brassow 2008-02-15 11:00:58 EST
corey, is this still reproducible?
Comment 13 Corey Marthaler 2008-02-18 15:39:10 EST
I'm unable to reproducing this, so this may be fixed. However I do eventually
hit another bz so verifying this may be blocked behind 429599.
Comment 14 Jonathan Earl Brassow 2008-02-28 16:56:40 EST
I think 429599 has been cleared... ok to have another go at this?
Comment 15 Corey Marthaler 2008-05-13 10:41:20 EDT
I appear to have hit this last night, again while running cirror_lock_stress. 

./cmirror_lock_stress -l /home/msp/cmarthal/work/rhel5/sts-root -r
/usr/tests/sts-rhel5.2 -R ../../var/share/resource_files/hayes.xml -m 2

I saw a lot of the following messages on the console:
"dm-log-clustered: Stray request returned:"
"dm-log-clustered: Excessive delay in request processing, 1 sec"

There are no mirror or mirror components currently listed with dmsetup

I'll attach a dump of the stuck processes...

[root@hayes-01 ~]# rpm -qi lvm2
Name        : lvm2                         Relocations: (not relocatable)
Version     : 2.02.32                           Vendor: Red Hat, Inc.
Release     : 4.el5                         Build Date: Fri 04 Apr 2008 06:15:19
AM CDT
Install Date: Mon 05 May 2008 11:26:24 AM CDT      Build Host:
hs20-bc2-3.build.redhat.com

[root@hayes-01 ~]# rpm -qi lvm2-cluster
Name        : lvm2-cluster                 Relocations: (not relocatable)
Version     : 2.02.32                           Vendor: Red Hat, Inc.
Release     : 4.el5                         Build Date: Wed 02 Apr 2008 03:56:50
AM CDT
Install Date: Mon 05 May 2008 11:27:46 AM CDT      Build Host:
hs20-bc2-3.build.redhat.com

[root@hayes-01 ~]# rpm -qi cmirror
Name        : cmirror                      Relocations: (not relocatable)
Version     : 1.1.17                            Vendor: Red Hat, Inc.
Release     : 1.el5                         Build Date: Fri 09 May 2008 11:33:43
AM CDT
Install Date: Fri 09 May 2008 03:57:23 PM CDT      Build Host:
hs20-bc1-7.build.redhat.com

[root@hayes-01 ~]# rpm -qi kmod-cmirror
Name        : kmod-cmirror                 Relocations: (not relocatable)
Version     : 0.1.9                             Vendor: Red Hat, Inc.
Release     : 1.el5                         Build Date: Thu 08 May 2008 02:28:27
PM CDT
Install Date: Fri 09 May 2008 09:17:33 AM CDT      Build Host:
hs20-bc2-4.build.redhat.com

[root@hayes-01 ~]# rpm -qi openais
Name        : openais                      Relocations: (not relocatable)
Version     : 0.80.3                            Vendor: Red Hat, Inc.
Release     : 15.el5                        Build Date: Wed 02 Apr 2008 03:42:29
AM CDT
Install Date: Mon 05 May 2008 11:26:27 AM CDT      Build Host:
ls20-bc2-13.build.redhat.com
Comment 16 Corey Marthaler 2008-05-13 12:11:13 EDT
Created attachment 305252 [details]
log from hayes-01
Comment 17 Corey Marthaler 2008-05-13 12:11:49 EDT
Created attachment 305253 [details]
log from hayes-02
Comment 18 Corey Marthaler 2008-05-13 12:12:46 EDT
Created attachment 305254 [details]
log from hayes-03
Comment 22 Corey Marthaler 2008-10-15 17:56:05 EDT
This bz has not been seen in 5 months, or is possibly the same as bz 460222. Will reopen if needed.

Note You need to log in before you can comment on or make changes to this bug.