Bug 245799 - cmirror/clvmd deadlock during simultaneous cmirror operations
Summary: cmirror/clvmd deadlock during simultaneous cmirror operations
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cmirror-kernel
Version: 4
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-06-26 19:11 UTC by Corey Marthaler
Modified: 2013-09-23 15:32 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-23 15:32:02 UTC
Embargoed:


Attachments (Terms of Use)
lvm backtraces (4.71 KB, text/plain)
2007-06-26 19:14 UTC, Corey Marthaler
no flags Details
lvm backtraces (4.44 KB, text/plain)
2007-06-26 19:15 UTC, Corey Marthaler
no flags Details
lvm backtraces (5.06 KB, text/plain)
2007-06-26 19:16 UTC, Corey Marthaler
no flags Details

Description Corey Marthaler 2007-06-26 19:11:57 UTC
Description of problem:
I was doing looping cmirror creates and deletes on each node in a 3 node
cluster, in order to verify bz 217895, and after a few iterations clvmd hung.

I'll attach the backtraces from the 3 nodes, here are the messages:


[root@link-02 tmp]# cat messages.txt
Jun 26 13:57:18 link-02 kernel: dm-cmirror: LOG INFO:
Jun 26 13:57:18 link-02 kernel: dm-cmirror:   uuid:
LVM-0TTcBDxtFhUgecwRivnHcce7rYD55adOyEpgsNT3ucbtIJEFk5pY4lfrwRsNvELD
Jun 26 13:57:18 link-02 kernel: dm-cmirror:   uuid_ref    : 1
Jun 26 13:57:18 link-02 kernel: dm-cmirror:  ?region_count: 4096
Jun 26 13:57:18 link-02 kernel: dm-cmirror:  ?sync_count  : 0
Jun 26 13:57:18 link-02 kernel: dm-cmirror:  ?sync_search : 0
Jun 26 13:57:18 link-02 kernel: dm-cmirror:   in_sync     : YES
Jun 26 13:57:18 link-02 kernel: dm-cmirror:   suspended   : NO
Jun 26 13:57:18 link-02 kernel: dm-cmirror:   server_id   : 3
Jun 26 13:57:18 link-02 kernel: dm-cmirror:   server_valid: YES
Jun 26 13:58:48 link-02 kernel: dm-cmirror: LOG INFO:
Jun 26 13:58:48 link-02 kernel: dm-cmirror:   uuid:
LVM-0TTcBDxtFhUgecwRivnHcce7rYD55adOyEpgsNT3ucbtIJEFk5pY4lfrwRsNvELD
Jun 26 13:58:48 link-02 kernel: dm-cmirror:   uuid_ref    : 1
Jun 26 13:58:48 link-02 kernel: dm-cmirror:  ?region_count: 4096
Jun 26 13:58:48 link-02 kernel: dm-cmirror:  ?sync_count  : 0
Jun 26 13:58:48 link-02 kernel: dm-cmirror:  ?sync_search : 0
Jun 26 13:58:48 link-02 kernel: dm-cmirror:   in_sync     : YES
Jun 26 13:58:48 link-02 kernel: dm-cmirror:   suspended   : NO
Jun 26 13:58:48 link-02 kernel: dm-cmirror:   server_id   : 3
Jun 26 13:58:48 link-02 kernel: dm-cmirror:   server_valid: YES


[root@link-04 tmp]# cat messages.txt
Jun 26 13:59:45 link-04 kernel: dm-cmirror: LOG INFO:
Jun 26 13:59:45 link-04 kernel: dm-cmirror:   uuid:
LVM-0TTcBDxtFhUgecwRivnHcce7rYD55adOyEpgsNT3ucbtIJEFk5pY4lfrwRsNvELD
Jun 26 13:59:45 link-04 kernel: dm-cmirror:   uuid_ref    : 1
Jun 26 13:59:45 link-04 kernel: dm-cmirror:  ?region_count: 4096
Jun 26 13:59:45 link-04 kernel: dm-cmirror:  ?sync_count  : 0
Jun 26 13:59:45 link-04 kernel: dm-cmirror:  ?sync_search : 0
Jun 26 13:59:45 link-04 kernel: dm-cmirror:   in_sync     : YES
Jun 26 13:59:45 link-04 kernel: dm-cmirror:   suspended   : NO
Jun 26 13:59:45 link-04 kernel: dm-cmirror:   server_id   : 3
Jun 26 13:59:45 link-04 kernel: dm-cmirror:   server_valid: YES



[root@link-08 tmp]# cat messages.txt
Jun 26 14:03:24 link-08 kernel: dm-cmirror: LOG INFO:
Jun 26 14:03:24 link-08 kernel: dm-cmirror:   uuid:
LVM-0TTcBDxtFhUgecwRivnHcce7rYD55adOyEpgsNT3ucbtIJEFk5pY4lfrwRsNvELD
Jun 26 14:03:24 link-08 kernel: dm-cmirror:   uuid_ref    : 1
Jun 26 14:03:24 link-08 kernel: dm-cmirror:  ?region_count: 4096
Jun 26 14:03:24 link-08 kernel: dm-cmirror:  ?sync_count  : 4096
Jun 26 14:03:24 link-08 kernel: dm-cmirror:  ?sync_search : 4096
Jun 26 14:03:24 link-08 kernel: dm-cmirror:   in_sync     : YES
Jun 26 14:03:24 link-08 kernel: dm-cmirror:   suspended   : NO
Jun 26 14:03:24 link-08 kernel: dm-cmirror:   server_id   : 3
Jun 26 14:03:24 link-08 kernel: dm-cmirror:   server_valid: YES
Jun 26 14:04:54 link-08 kernel: dm-cmirror: LOG INFO:
Jun 26 14:04:54 link-08 kernel: dm-cmirror:   uuid:
LVM-0TTcBDxtFhUgecwRivnHcce7rYD55adOyEpgsNT3ucbtIJEFk5pY4lfrwRsNvELD
Jun 26 14:04:54 link-08 kernel: dm-cmirror:   uuid_ref    : 1
Jun 26 14:04:54 link-08 kernel: dm-cmirror:  ?region_count: 4096
Jun 26 14:04:54 link-08 kernel: dm-cmirror:  ?sync_count  : 4096
Jun 26 14:04:54 link-08 kernel: dm-cmirror:  ?sync_search : 4096
Jun 26 14:04:54 link-08 kernel: dm-cmirror:   in_sync     : YES
Jun 26 14:04:54 link-08 kernel: dm-cmirror:   suspended   : NO
Jun 26 14:04:54 link-08 kernel: dm-cmirror:   server_id   : 3
Jun 26 14:04:54 link-08 kernel: dm-cmirror:   server_valid: YES



Version-Release number of selected component (if applicable):
2.6.9-55.8.ELsmp
cmirror-kernel-2.6.9-32.0
lvm2-cluster-2.02.21-7.el4

Comment 1 Corey Marthaler 2007-06-26 19:14:48 UTC
Created attachment 157941 [details]
lvm backtraces

Comment 2 Corey Marthaler 2007-06-26 19:15:47 UTC
Created attachment 157942 [details]
lvm backtraces

Comment 3 Corey Marthaler 2007-06-26 19:16:11 UTC
Created attachment 157943 [details]
lvm backtraces

Comment 4 Corey Marthaler 2007-06-26 21:13:42 UTC
This was fairly easy to reproduce... I smell a regression.

Comment 5 Corey Marthaler 2007-06-26 21:21:25 UTC
Just a note that a write to the mirror with dd during the clvmd deadlock did
succeed.

Comment 6 Nate Straz 2007-06-28 16:53:16 UTC
I think I hit this bug while running activator on a cluster using UP kernels. 
The first time I hit it on the 6th iteration, the second time I hit it on the 20th.

Backtrace for vgchange-anactivator4 (13510):
#1  0x008f55a3 in __read_nocancel () from /lib/tls/libpthread.so.0
#2  0x080a00b0 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not
available.)    at locking/cluster_locking.c:115
#3  0x080a04a7 in _lock_resource (cmd=0x8e82a88, resource=Variable "resource" is
not available.)    at locking/cluster_locking.c:410
#4  0x0808a1f5 in _lock_vol (cmd=0x8e82a88, resource=0xbff6d920 "activator4",  
  flags=6) at locking/locking.c:237
        #######################################
        # _lock_vol flags = LCK_VG | LCK_UNLOCK
        #######################################
#5  0x0808a414 in lock_vol (cmd=0x8e82a88, vol=0x8e99e00 "activator4", flags=6)
   at locking/locking.c:270
#6  0x08066599 in _process_one_vg (cmd=0x8e82a88,     vg_name=0x8e99e00
"activator4", vgid=0x0, tags=0xbff6dad0,     arg_vgnames=0xbff6dac8,
lock_type=33, consistent=1, handle=0x0, ret_max=1,     process_single=0x80697e8
<vgchange_single>) at toollib.c:487
#7  0x0806698c in process_each_vg (cmd=0x8e82a88, argc=1, argv=0xbff716cc,    
lock_type=33, consistent=0, handle=0x0,     process_single=0x80697e8
<vgchange_single>) at toollib.c:568
#8  0x0806a8fe in vgchange (cmd=0x8e82a88, argc=-512, argv=0xfffffe00)    at
vgchange.c:617
#9  0x0805b148 in lvm_run_command (cmd=0x8e82a88, argc=1, argv=0xbff716cc)    at
lvmcmdline.c:935
#10 0x0805c147 in lvm2_main (argc=3, argv=0xbff716c4, is_static=0)    at
lvmcmdline.c:1423

Backtrace for lvs (12930):
#1  0x008f55a3 in __read_nocancel () from /lib/tls/libpthread.so.0
#2  0x080a00b0 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not
available.)    at locking/cluster_locking.c:115
#3  0x080a04a7 in _lock_resource (cmd=0x9b6da40, resource=Variable "resource" is
not available.)    at locking/cluster_locking.c:410
#4  0x0808a1f5 in _lock_vol (cmd=0x9b6da40, resource=0xbfdff1b0 "activator4",  
  flags=33) at locking/locking.c:237
        #######################################
        # _lock_vol flags = LCK_VG | LCK_HOLD | LCK_READ
        #######################################
#5  0x0808a414 in lock_vol (cmd=0x9b6da40, vol=0x9b86ff8 "activator4",    
flags=33) at locking/locking.c:270
#6  0x08067404 in process_each_lv (cmd=0x9b6da40, argc=0, argv=0xbfe02fa8,    
lock_type=33, handle=0x9b873d8, process_single=0x806500e <_lvs_single>)    at
toollib.c:324
#7  0x080659d5 in _report (cmd=0x9b6da40, argc=0, argv=0xbfe02fa8,    
report_type=LVS) at reporter.c:329
#8  0x0805b148 in lvm_run_command (cmd=0x9b6da40, argc=0, argv=0xbfe02fa8)    at
lvmcmdline.c:935
#9  0x0805c147 in lvm2_main (argc=1, argv=0xbfe02fa4, is_static=0)    at
lvmcmdline.c:1423


Comment 7 Jonathan Earl Brassow 2007-08-01 19:14:53 UTC
I think I've reproduced this using:

#from each node
while true ; do
 lvcreate -m1 -L 500M -n `hostname -s` vg
 lvchange -an vg/`hostname -s`
 lvremove -f vg/`hostname -s`
done

kernel: 2.6.9-55.16.ELsmp


Comment 8 Corey Marthaler 2007-08-24 13:13:53 UTC
I reproduced this bug while running cmirror_lock_stress on the latest code.

2.6.9-56.ELsmp
cmirror-kernel-2.6.9-33.2
lvm2-cluster-2.02.27-1.el4

Comment 9 Corey Marthaler 2007-09-06 13:39:06 UTC
Just a note that with the latest code, I wasn't able to reproduce this deadlock
while running cmirror lock stress tests all night. I'll continue testing however...

2.6.9-56.ELsmp
lvm2-cluster-2.02.27-2.el4
lvm2-2.02.27-2.el4
cmirror-kernel-2.6.9-34.1

Comment 10 Jonathan Earl Brassow 2007-09-28 15:34:59 UTC
assigned -> modified.

Comment 11 Corey Marthaler 2007-11-08 17:10:46 UTC
Marking this verified as it hasn't been seen with any of the latest
cmirror-kernel versions.

Comment 13 Lon Hohberger 2013-09-23 15:32:02 UTC
The Red Hat Cluster Suite product is past end-of-life; closing.


Note You need to log in before you can comment on or make changes to this bug.