Description of problem: I was doing looping cmirror creates and deletes on each node in a 3 node cluster, in order to verify bz 217895, and after a few iterations clvmd hung. I'll attach the backtraces from the 3 nodes, here are the messages: [root@link-02 tmp]# cat messages.txt Jun 26 13:57:18 link-02 kernel: dm-cmirror: LOG INFO: Jun 26 13:57:18 link-02 kernel: dm-cmirror: uuid: LVM-0TTcBDxtFhUgecwRivnHcce7rYD55adOyEpgsNT3ucbtIJEFk5pY4lfrwRsNvELD Jun 26 13:57:18 link-02 kernel: dm-cmirror: uuid_ref : 1 Jun 26 13:57:18 link-02 kernel: dm-cmirror: ?region_count: 4096 Jun 26 13:57:18 link-02 kernel: dm-cmirror: ?sync_count : 0 Jun 26 13:57:18 link-02 kernel: dm-cmirror: ?sync_search : 0 Jun 26 13:57:18 link-02 kernel: dm-cmirror: in_sync : YES Jun 26 13:57:18 link-02 kernel: dm-cmirror: suspended : NO Jun 26 13:57:18 link-02 kernel: dm-cmirror: server_id : 3 Jun 26 13:57:18 link-02 kernel: dm-cmirror: server_valid: YES Jun 26 13:58:48 link-02 kernel: dm-cmirror: LOG INFO: Jun 26 13:58:48 link-02 kernel: dm-cmirror: uuid: LVM-0TTcBDxtFhUgecwRivnHcce7rYD55adOyEpgsNT3ucbtIJEFk5pY4lfrwRsNvELD Jun 26 13:58:48 link-02 kernel: dm-cmirror: uuid_ref : 1 Jun 26 13:58:48 link-02 kernel: dm-cmirror: ?region_count: 4096 Jun 26 13:58:48 link-02 kernel: dm-cmirror: ?sync_count : 0 Jun 26 13:58:48 link-02 kernel: dm-cmirror: ?sync_search : 0 Jun 26 13:58:48 link-02 kernel: dm-cmirror: in_sync : YES Jun 26 13:58:48 link-02 kernel: dm-cmirror: suspended : NO Jun 26 13:58:48 link-02 kernel: dm-cmirror: server_id : 3 Jun 26 13:58:48 link-02 kernel: dm-cmirror: server_valid: YES [root@link-04 tmp]# cat messages.txt Jun 26 13:59:45 link-04 kernel: dm-cmirror: LOG INFO: Jun 26 13:59:45 link-04 kernel: dm-cmirror: uuid: LVM-0TTcBDxtFhUgecwRivnHcce7rYD55adOyEpgsNT3ucbtIJEFk5pY4lfrwRsNvELD Jun 26 13:59:45 link-04 kernel: dm-cmirror: uuid_ref : 1 Jun 26 13:59:45 link-04 kernel: dm-cmirror: ?region_count: 4096 Jun 26 13:59:45 link-04 kernel: dm-cmirror: ?sync_count : 0 Jun 26 13:59:45 link-04 kernel: dm-cmirror: ?sync_search : 0 Jun 26 13:59:45 link-04 kernel: dm-cmirror: in_sync : YES Jun 26 13:59:45 link-04 kernel: dm-cmirror: suspended : NO Jun 26 13:59:45 link-04 kernel: dm-cmirror: server_id : 3 Jun 26 13:59:45 link-04 kernel: dm-cmirror: server_valid: YES [root@link-08 tmp]# cat messages.txt Jun 26 14:03:24 link-08 kernel: dm-cmirror: LOG INFO: Jun 26 14:03:24 link-08 kernel: dm-cmirror: uuid: LVM-0TTcBDxtFhUgecwRivnHcce7rYD55adOyEpgsNT3ucbtIJEFk5pY4lfrwRsNvELD Jun 26 14:03:24 link-08 kernel: dm-cmirror: uuid_ref : 1 Jun 26 14:03:24 link-08 kernel: dm-cmirror: ?region_count: 4096 Jun 26 14:03:24 link-08 kernel: dm-cmirror: ?sync_count : 4096 Jun 26 14:03:24 link-08 kernel: dm-cmirror: ?sync_search : 4096 Jun 26 14:03:24 link-08 kernel: dm-cmirror: in_sync : YES Jun 26 14:03:24 link-08 kernel: dm-cmirror: suspended : NO Jun 26 14:03:24 link-08 kernel: dm-cmirror: server_id : 3 Jun 26 14:03:24 link-08 kernel: dm-cmirror: server_valid: YES Jun 26 14:04:54 link-08 kernel: dm-cmirror: LOG INFO: Jun 26 14:04:54 link-08 kernel: dm-cmirror: uuid: LVM-0TTcBDxtFhUgecwRivnHcce7rYD55adOyEpgsNT3ucbtIJEFk5pY4lfrwRsNvELD Jun 26 14:04:54 link-08 kernel: dm-cmirror: uuid_ref : 1 Jun 26 14:04:54 link-08 kernel: dm-cmirror: ?region_count: 4096 Jun 26 14:04:54 link-08 kernel: dm-cmirror: ?sync_count : 4096 Jun 26 14:04:54 link-08 kernel: dm-cmirror: ?sync_search : 4096 Jun 26 14:04:54 link-08 kernel: dm-cmirror: in_sync : YES Jun 26 14:04:54 link-08 kernel: dm-cmirror: suspended : NO Jun 26 14:04:54 link-08 kernel: dm-cmirror: server_id : 3 Jun 26 14:04:54 link-08 kernel: dm-cmirror: server_valid: YES Version-Release number of selected component (if applicable): 2.6.9-55.8.ELsmp cmirror-kernel-2.6.9-32.0 lvm2-cluster-2.02.21-7.el4
Created attachment 157941 [details] lvm backtraces
Created attachment 157942 [details] lvm backtraces
Created attachment 157943 [details] lvm backtraces
This was fairly easy to reproduce... I smell a regression.
Just a note that a write to the mirror with dd during the clvmd deadlock did succeed.
I think I hit this bug while running activator on a cluster using UP kernels. The first time I hit it on the 6th iteration, the second time I hit it on the 20th. Backtrace for vgchange-anactivator4 (13510): #1 0x008f55a3 in __read_nocancel () from /lib/tls/libpthread.so.0 #2 0x080a00b0 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not available.) at locking/cluster_locking.c:115 #3 0x080a04a7 in _lock_resource (cmd=0x8e82a88, resource=Variable "resource" is not available.) at locking/cluster_locking.c:410 #4 0x0808a1f5 in _lock_vol (cmd=0x8e82a88, resource=0xbff6d920 "activator4", flags=6) at locking/locking.c:237 ####################################### # _lock_vol flags = LCK_VG | LCK_UNLOCK ####################################### #5 0x0808a414 in lock_vol (cmd=0x8e82a88, vol=0x8e99e00 "activator4", flags=6) at locking/locking.c:270 #6 0x08066599 in _process_one_vg (cmd=0x8e82a88, vg_name=0x8e99e00 "activator4", vgid=0x0, tags=0xbff6dad0, arg_vgnames=0xbff6dac8, lock_type=33, consistent=1, handle=0x0, ret_max=1, process_single=0x80697e8 <vgchange_single>) at toollib.c:487 #7 0x0806698c in process_each_vg (cmd=0x8e82a88, argc=1, argv=0xbff716cc, lock_type=33, consistent=0, handle=0x0, process_single=0x80697e8 <vgchange_single>) at toollib.c:568 #8 0x0806a8fe in vgchange (cmd=0x8e82a88, argc=-512, argv=0xfffffe00) at vgchange.c:617 #9 0x0805b148 in lvm_run_command (cmd=0x8e82a88, argc=1, argv=0xbff716cc) at lvmcmdline.c:935 #10 0x0805c147 in lvm2_main (argc=3, argv=0xbff716c4, is_static=0) at lvmcmdline.c:1423 Backtrace for lvs (12930): #1 0x008f55a3 in __read_nocancel () from /lib/tls/libpthread.so.0 #2 0x080a00b0 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not available.) at locking/cluster_locking.c:115 #3 0x080a04a7 in _lock_resource (cmd=0x9b6da40, resource=Variable "resource" is not available.) at locking/cluster_locking.c:410 #4 0x0808a1f5 in _lock_vol (cmd=0x9b6da40, resource=0xbfdff1b0 "activator4", flags=33) at locking/locking.c:237 ####################################### # _lock_vol flags = LCK_VG | LCK_HOLD | LCK_READ ####################################### #5 0x0808a414 in lock_vol (cmd=0x9b6da40, vol=0x9b86ff8 "activator4", flags=33) at locking/locking.c:270 #6 0x08067404 in process_each_lv (cmd=0x9b6da40, argc=0, argv=0xbfe02fa8, lock_type=33, handle=0x9b873d8, process_single=0x806500e <_lvs_single>) at toollib.c:324 #7 0x080659d5 in _report (cmd=0x9b6da40, argc=0, argv=0xbfe02fa8, report_type=LVS) at reporter.c:329 #8 0x0805b148 in lvm_run_command (cmd=0x9b6da40, argc=0, argv=0xbfe02fa8) at lvmcmdline.c:935 #9 0x0805c147 in lvm2_main (argc=1, argv=0xbfe02fa4, is_static=0) at lvmcmdline.c:1423
I think I've reproduced this using: #from each node while true ; do lvcreate -m1 -L 500M -n `hostname -s` vg lvchange -an vg/`hostname -s` lvremove -f vg/`hostname -s` done kernel: 2.6.9-55.16.ELsmp
I reproduced this bug while running cmirror_lock_stress on the latest code. 2.6.9-56.ELsmp cmirror-kernel-2.6.9-33.2 lvm2-cluster-2.02.27-1.el4
Just a note that with the latest code, I wasn't able to reproduce this deadlock while running cmirror lock stress tests all night. I'll continue testing however... 2.6.9-56.ELsmp lvm2-cluster-2.02.27-2.el4 lvm2-2.02.27-2.el4 cmirror-kernel-2.6.9-34.1
assigned -> modified.
Marking this verified as it hasn't been seen with any of the latest cmirror-kernel versions.
The Red Hat Cluster Suite product is past end-of-life; closing.