Bug 676909

Summary: cmirror log server communication issues during mirror testing
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED WORKSFORME QA Contact: Corey Marthaler <cmarthal>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.1CC: agk, coughlan, djansa, dwysocha, heinzm, jbrassow, mcsontos, prajnoha, prockai, slevine, thornber, zkabelac
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-11-14 19:25:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 756082, 840699    
Attachments:
Description Flags
log from taft-01
none
log from taft-02
none
log from taft-03
none
log from taft-04
none
clvmd -d log from taft-01
none
clvmd -d log from taft-02
none
clvmd -d log from taft-03
none
clvmd -d log from taft-04 none

Description Corey Marthaler 2011-02-11 19:33:57 UTC
Description of problem:
It appears that the log server lost communication during testing. Nothing appears to have died however (cmirrord, dmeventd, udev, corosync are all still running). I'll attach the logs from each of the four machines in this taft cluster. 


================================================================================
Iteration 1.19 started at Thu Feb 10 18:57:50 CST 2011
================================================================================
Scenario: Kill secondary leg of non synced 3 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              nonsyncd_secondary_3legs_1 nonsyncd_secondary_3legs_2
* sync:               0
* leg devices:        /dev/sde1 /dev/sdf1 /dev/sdc1
* log devices:        /dev/sdh1
* no MDA devices:     
* failpv(s):          /dev/sdf1
* failnode(s):        taft-01 taft-02 taft-03 taft-04
* leg fault policy:   remove
* log fault policy:   allocate
******************************************************

Creating mirror(s) on taft-02...
taft-02: lvcreate -m 2 -n nonsyncd_secondary_3legs_1 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-1000 /dev/sdh1:0-150
taft-02: lvcreate -m 2 -n nonsyncd_secondary_3legs_2 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-1000 /dev/sdh1:0-150

PV=/dev/sdf1
        nonsyncd_secondary_3legs_1_mimage_1: 6
        nonsyncd_secondary_3legs_2_mimage_1: 6
PV=/dev/sdf1
        nonsyncd_secondary_3legs_1_mimage_1: 6
        nonsyncd_secondary_3legs_2_mimage_1: 6

Continuing on without fully syncd mirrors, currently at...
        ( 24.92% 14.83% )

Creating gfs2 on top of mirror(s) on taft-01...
Mounting mirrored gfs2 filesystems on taft-01...
Mounting mirrored gfs2 filesystems on taft-02...
Mounting mirrored gfs2 filesystems on taft-03...
Mounting mirrored gfs2 filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Disabling device sdf on taft-01
Disabling device sdf on taft-02
Disabling device sdf on taft-03
Disabling device sdf on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.517681 s, 81.0 MB/s
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.54711 s, 76.7 MB/s
Verifying current sanity of lvm after the failure
  /dev/sdf1: open failed: No such device or address
Verifying FAILED device /dev/sdf1 is *NOT* in the volume(s)
  /dev/sdf1: open failed: No such device or address
olog: 2
Verifying LOG device(s) /dev/sdh1 *ARE* in the mirror(s)
  /dev/sdf1: open failed: No such device or address
Verifying LEG device /dev/sde1 *IS* in the volume(s)
  /dev/sdf1: open failed: No such device or address
Verifying LEG device /dev/sdc1 *IS* in the volume(s)
  /dev/sdf1: open failed: No such device or address
verify the dm devices associated with /dev/sdf1 have been removed as expected
Checking REMOVAL of nonsyncd_secondary_3legs_1_mimage_1 on:  taft-01 taft-02 taft-03 taft-04
Checking REMOVAL of nonsyncd_secondary_3legs_2_mimage_1 on:  taft-01 taft-02 taft-03 taft-04

Verify that the mirror image order remains the same after the down conversion
Verify that each of the mirror repairs finished successfully

Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Enabling device sdf on taft-01
Enabling device sdf on taft-02
Enabling device sdf on taft-03
Enabling device sdf on taft-04

Recreating PVs /dev/sdf1
  WARNING: Volume group helter_skelter is not consistent
Extending the recreated PVs back into VG helter_skelter

Attempting to reconvert volume(s) back to their original mirrored state on taft-02...
(it's possible this may be a redundant operation in some allocate scenarios)
taft-02: lvconvert --mirrorlog disk -m 2 -b helter_skelter/nonsyncd_secondary_3legs_1 /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-1000 /dev/sdh1:0-150
taft-02: lvconvert --mirrorlog disk -m 2 -b helter_skelter/nonsyncd_secondary_3legs_2 /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-1000 /dev/sdh1:0-150
Verifying the up conversions by checking if all original leg devices are back in the mirror(s)
Verifying device /dev/sde1 *IS* one of the legs in the mirror(s)
Verifying device /dev/sdf1 *IS* one of the legs in the mirror(s)
Verifying device /dev/sdc1 *IS* one of the legs in the mirror(s)
Checking for leftover '-missing_0_0' or 'unknown devices'

Waiting until all mirrors become fully syncd...
   0/2 mirror(s) are fully synced: ( 47.08% 42.92% )
   0/2 mirror(s) are fully synced: ( 74.17% 68.50% )
   0/2 mirror(s) are fully synced: ( 98.75% 91.50% )
   2/2 mirror(s) are fully synced: ( 100.00% 100.00% )

[DEADLOCK]



[root@taft-01 tmp]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M     40   2011-02-10 15:40:21  taft-01
   2   M     44   2011-02-10 15:40:21  taft-02
   3   M     48   2011-02-10 15:40:21  taft-03
   4   M     52   2011-02-10 15:40:21  taft-04
[root@taft-01 tmp]# cman_tool services
fence domain
member count  4
victim count  0
victim now    0
master nodeid 2
wait state    none
members       1 2 3 4

dlm lockspaces
name          gfs2
id            0xe4f48a18
flags         0x00000008 fs_reg
change        member 4 joined 1 remove 0 failed 0 seq 4,4
members       1 2 3 4

name          gfs1
id            0xcfd9d9db
flags         0x00000008 fs_reg
change        member 4 joined 1 remove 0 failed 0 seq 4,4
members       1 2 3 4

name          clvmd
id            0x4104eefa
flags         0x00000000
change        member 4 joined 1 remove 0 failed 0 seq 1,1
members       1 2 3 4

gfs mountgroups
name          gfs2
id            0x7474a276
flags         0x00000048 mounted
change        member 4 joined 1 remove 0 failed 0 seq 4,4
members       1 2 3 4

name          gfs1
id            0x5f59f1b5
flags         0x00000048 mounted
change        member 4 joined 1 remove 0 failed 0 seq 4,4
members       1 2 3 4



TAFT-01:
Feb 10 19:06:08 taft-01 kernel: INFO: task glock_workqueue:3643 blocked for more than 120 seconds.
Feb 10 19:06:08 taft-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 10 19:06:08 taft-01 kernel: glock_workque D 0000000000000002     0  3643      2 0x00000080
Feb 10 19:06:08 taft-01 kernel: ffff8801f48a3b50 0000000000000046 ffff8801f48a3b10 ffffffffa000409c
Feb 10 19:06:08 taft-01 kernel: ffff880210f87a80 ffff8801e98f6808 ffff8801f48a3af0 ffffffffa0003389
Feb 10 19:06:08 taft-01 kernel: ffff880213317a98 ffff8801f48a3fd8 0000000000010558 ffff880213317a98
Feb 10 19:06:08 taft-01 kernel: Call Trace:
Feb 10 19:06:08 taft-01 kernel: [<ffffffffa000409c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod]
Feb 10 19:06:08 taft-01 kernel: [<ffffffffa0003389>] ? dm_request+0x179/0x1a0 [dm_mod]
Feb 10 19:06:08 taft-01 kernel: [<ffffffff81093df9>] ? ktime_get_ts+0xa9/0xe0
Feb 10 19:06:08 taft-01 kernel: [<ffffffff811974a0>] ? sync_buffer+0x0/0x50
Feb 10 19:06:08 taft-01 kernel: [<ffffffff814c4a53>] io_schedule+0x73/0xc0
Feb 10 19:06:08 taft-01 kernel: [<ffffffff811974e0>] sync_buffer+0x40/0x50
Feb 10 19:06:08 taft-01 kernel: [<ffffffff814c52cf>] __wait_on_bit+0x5f/0x90
Feb 10 19:06:08 taft-01 kernel: [<ffffffff811974a0>] ? sync_buffer+0x0/0x50
Feb 10 19:06:08 taft-01 kernel: [<ffffffff814c5378>] out_of_line_wait_on_bit+0x78/0x90
Feb 10 19:06:08 taft-01 kernel: [<ffffffff8108a130>] ? wake_bit_function+0x0/0x50
Feb 10 19:06:08 taft-01 kernel: [<ffffffff81197496>] __wait_on_buffer+0x26/0x30
Feb 10 19:06:08 taft-01 kernel: [<ffffffffa0442ddb>] gfs2_meta_read+0xbb/0x140 [gfs2]
Feb 10 19:06:08 taft-01 kernel: [<ffffffffa0442f3f>] gfs2_meta_indirect_buffer+0xdf/0x190 [gfs2]
Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043d7a9>] gfs2_inode_refresh+0x29/0x330 [gfs2]
Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043cbe0>] inode_go_lock+0x80/0xf0 [gfs2]
Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043afb7>] do_promote+0x1c7/0x340 [gfs2]
Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043b77d>] finish_xmote+0x13d/0x3a0 [gfs2]
Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043c4f2>] glock_work_func+0x142/0x1d0 [gfs2]
Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043c3b0>] ? glock_work_func+0x0/0x1d0 [gfs2]
Feb 10 19:06:08 taft-01 kernel: [<ffffffff810849f0>] worker_thread+0x170/0x2a0
Feb 10 19:06:08 taft-01 kernel: [<ffffffff8108a0f0>] ? autoremove_wake_function+0x0/0x40
Feb 10 19:06:08 taft-01 kernel: [<ffffffff81084880>] ? worker_thread+0x0/0x2a0
Feb 10 19:06:08 taft-01 kernel: [<ffffffff81089d86>] kthread+0x96/0xa0
Feb 10 19:06:08 taft-01 kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
Feb 10 19:06:08 taft-01 kernel: [<ffffffff81089cf0>] ? kthread+0x0/0xa0
Feb 10 19:06:08 taft-01 kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Feb 10 19:06:08 taft-01 kernel: INFO: task flush-253:12:10238 blocked for more than 120 seconds.



TAFT-03:
Feb 10 19:02:04 taft-03 lvm[2901]: helter_skelter-nonsyncd_secondary_3legs_1 is now in-sync.
Feb 10 19:02:04 taft-03 lvm[2901]: No longer monitoring mirror device helter_skelter-nonsyncd_secondary_3legs_2_mimagetmp_3 for events.
Feb 10 19:02:04 taft-03 lvm[2901]: No longer monitoring mirror device helter_skelter-nonsyncd_secondary_3legs_2 for events.
Feb 10 19:02:09 taft-03 qarshd[2304]: Running cmdline: dmsetup status helter_skelter-nonsyncd_secondary_3legs_1-real
Feb 10 19:02:09 taft-03 qarshd[2306]: Running cmdline: dmsetup status helter_skelter-nonsyncd_secondary_3legs_1
Feb 10 19:02:19 taft-03 kernel: device-mapper: dm-log-userspace: [3496Okgc] Request timed out: [9/2017494] - retrying
Feb 10 19:02:24 taft-03 kernel: device-mapper: dm-log-userspace: [CpoIEtjY] Request timed out: [14/2017496] - retrying
[...]
Feb 10 19:02:34 taft-03 kernel: device-mapper: dm-log-userspace: [3496Okgc] Request timed out: [9/2017497] - retrying
Feb 10 19:05:54 taft-03 kernel: device-mapper: dm-log-userspace: [CpoIEtjY] Request timed out: [14/2017537] - retrying
Feb 10 19:05:57 taft-03 kernel: INFO: task cmirrord:2246 blocked for more than 120 seconds.
Feb 10 19:05:57 taft-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 10 19:05:57 taft-03 kernel: cmirrord      D ffff88021fc24300     0  2246      1 0x00000080
Feb 10 19:05:57 taft-03 kernel: ffff8802153afb08 0000000000000086 0000000000000000 ffffffffa000409c
Feb 10 19:05:57 taft-03 kernel: ffff8802153afad8 00000000ed5e1a1c 0000000000000000 0000000100b65466
Feb 10 19:05:57 taft-03 kernel: ffff880211bb6638 ffff8802153affd8 0000000000010558 ffff880211bb6638
Feb 10 19:05:57 taft-03 kernel: Call Trace:
Feb 10 19:05:57 taft-03 kernel: [<ffffffffa000409c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod]
Feb 10 19:05:57 taft-03 kernel: [<ffffffff814c4a53>] io_schedule+0x73/0xc0
Feb 10 19:05:57 taft-03 kernel: [<ffffffff8119f84e>] __blockdev_direct_IO+0x70e/0xc40
Feb 10 19:05:57 taft-03 kernel: [<ffffffff8117afd0>] ? __pollwait+0x0/0xf0
Feb 10 19:05:57 taft-03 kernel: [<ffffffff811f97a2>] ? security_inode_getsecurity+0x22/0x30
Feb 10 19:05:57 taft-03 kernel: [<ffffffff8119d497>] blkdev_direct_IO+0x57/0x60
Feb 10 19:05:57 taft-03 kernel: [<ffffffff8119c680>] ? blkdev_get_blocks+0x0/0xc0
Feb 10 19:05:57 taft-03 kernel: [<ffffffff81105572>] generic_file_direct_write+0xc2/0x190
Feb 10 19:05:57 taft-03 kernel: [<ffffffff81106d55>] __generic_file_aio_write+0x345/0x480
Feb 10 19:05:57 taft-03 kernel: [<ffffffff8109a909>] ? drop_futex_key_refs+0x49/0x80
Feb 10 19:05:57 taft-03 kernel: [<ffffffff8119cbec>] blkdev_aio_write+0x3c/0xa0
Feb 10 19:05:57 taft-03 kernel: [<ffffffff81165b4a>] do_sync_write+0xfa/0x140
Feb 10 19:05:57 taft-03 kernel: [<ffffffff8108a0f0>] ? autoremove_wake_function+0x0/0x40
Feb 10 19:05:57 taft-03 kernel: [<ffffffff8120537b>] ? selinux_file_permission+0xfb/0x150
Feb 10 19:05:57 taft-03 kernel: [<ffffffff811f87e6>] ? security_file_permission+0x16/0x20
Feb 10 19:05:57 taft-03 kernel: [<ffffffff81165e48>] vfs_write+0xb8/0x1a0
Feb 10 19:05:57 taft-03 kernel: [<ffffffff810cca12>] ? audit_syscall_entry+0x272/0x2a0
Feb 10 19:05:57 taft-03 kernel: [<ffffffff81166881>] sys_write+0x51/0x90
Feb 10 19:05:57 taft-03 kernel: [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
Feb 10 19:05:57 taft-03 kernel: INFO: task clvmd:2281 blocked for more than 120 seconds.
[...]



Version-Release number of selected component (if applicable):
2.6.32-94.el6.x86_64

lvm2-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-libs-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-cluster-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
udev-147-2.31.el6    BUILT: Wed Jan 26 05:39:15 CST 2011
device-mapper-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
cmirror-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011

Comment 1 Corey Marthaler 2011-02-11 19:56:48 UTC
Created attachment 478299 [details]
log from taft-01

Comment 2 Corey Marthaler 2011-02-11 19:57:25 UTC
Created attachment 478300 [details]
log from taft-02

Comment 3 Corey Marthaler 2011-02-11 19:57:48 UTC
Created attachment 478301 [details]
log from taft-03

Comment 4 Corey Marthaler 2011-02-11 19:58:14 UTC
Created attachment 478302 [details]
log from taft-04

Comment 5 Corey Marthaler 2011-02-11 23:31:23 UTC
I appear to have hit this same issue on a different cluster while attempting snapshot of exclusively activated mirrors.

SCENARIO - [snaphot_exclusive_mirror]
Snapshot an exclusively activated mirror
grant-01: lvcreate -m 1 -n exclusive_origin -L 100M mirror_sanity
Deactivate and then exclusively activate mirror
Taking multiple snapshots of exclusive mirror
        1 2 3 4 5 6 7 8 9 10 
Removing snapshots of exclusive mirror
        1 2 3 4 5 6 7 8 9 10 
[HANG]

[root@grant-03 ~]# lvs -a -o +devices
  Error locking on node grant-01: Command timed out
[HANG]


TAFT-01:

Feb 11 17:15:52 grant-01 qarshd[4983]: Running cmdline: lvremove -f /dev/mirror_sanity/msnap_10
Feb 11 17:15:53 grant-01 lvm[2250]: No longer monitoring snapshot mirror_sanity-msnap_10
Feb 11 17:16:15 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5660] - retrying
Feb 11 17:16:30 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5661] - retrying
Feb 11 17:16:45 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5662] - retrying
Feb 11 17:17:00 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5663] - retrying
Feb 11 17:17:15 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5664] - retrying
Feb 11 17:17:30 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5665] - retrying
Feb 11 17:17:45 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5666] - retrying
Feb 11 17:18:00 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5667] - retrying
Feb 11 17:18:15 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5668] - retrying
Feb 11 17:18:30 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5669] - retrying
Feb 11 17:18:45 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5670] - retrying
Feb 11 17:18:49 grant-01 kernel: INFO: task cmirrord:2175 blocked for more than 120 seconds.
Feb 11 17:18:49 grant-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 11 17:18:49 grant-01 kernel: cmirrord      D ffff88011fc24300     0  2175      1 0x00000000
Feb 11 17:18:49 grant-01 kernel: ffff88011a4a7b88 0000000000000086 0000000000000000 ffffffffa000409c
Feb 11 17:18:49 grant-01 kernel: ffff88011a4a7b58 000000004bdac9df 0000000000000000 000000010009683b
Feb 11 17:18:49 grant-01 kernel: ffff8801199e3028 ffff88011a4a7fd8 0000000000010558 ffff8801199e3028
Feb 11 17:18:49 grant-01 kernel: Call Trace:
Feb 11 17:18:49 grant-01 kernel: [<ffffffffa000409c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod]
Feb 11 17:18:49 grant-01 kernel: [<ffffffff814c4a53>] io_schedule+0x73/0xc0
Feb 11 17:18:49 grant-01 kernel: [<ffffffff8119f84e>] __blockdev_direct_IO+0x70e/0xc40
Feb 11 17:18:49 grant-01 kernel: [<ffffffff8119d497>] blkdev_direct_IO+0x57/0x60
Feb 11 17:18:49 grant-01 kernel: [<ffffffff8119c680>] ? blkdev_get_blocks+0x0/0xc0
Feb 11 17:18:49 grant-01 kernel: [<ffffffff811060c6>] generic_file_aio_read+0x696/0x6f0
Feb 11 17:18:49 grant-01 kernel: [<ffffffff8117fad0>] ? iput+0x30/0x70
Feb 11 17:18:49 grant-01 kernel: [<ffffffff8109a909>] ? drop_futex_key_refs+0x49/0x80
Feb 11 17:18:49 grant-01 kernel: [<ffffffff81165c8a>] do_sync_read+0xfa/0x140
Feb 11 17:18:49 grant-01 kernel: [<ffffffff8108a0f0>] ? autoremove_wake_function+0x0/0x40
Feb 11 17:18:49 grant-01 kernel: [<ffffffff8120537b>] ? selinux_file_permission+0xfb/0x150
Feb 11 17:18:49 grant-01 kernel: [<ffffffff811f87e6>] ? security_file_permission+0x16/0x20
Feb 11 17:18:49 grant-01 kernel: [<ffffffff811666b5>] vfs_read+0xb5/0x1a0
Feb 11 17:18:49 grant-01 kernel: [<ffffffff811667f1>] sys_read+0x51/0x90
Feb 11 17:18:49 grant-01 kernel: [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
Feb 11 17:19:00 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5671] - retrying
Feb 11 17:19:15 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5672] - retrying
Feb 11 17:19:30 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5673] - retrying



2.6.32-94.el6.x86_64

lvm2-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-libs-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-cluster-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
udev-147-2.31.el6    BUILT: Wed Jan 26 05:39:15 CST 2011
device-mapper-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
cmirror-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011

Comment 6 Corey Marthaler 2011-02-14 19:42:21 UTC
Reproduced again with helter_skelter, bumping pri/sev and adding the 'Regression' keyword.

[...]
Enabling device sdg on taft-01
Enabling device sdg on taft-02
Enabling device sdg on taft-03
Enabling device sdg on taft-04

Recreating PVs /dev/sdg1
  WARNING: Volume group helter_skelter is not consistent
Extending the recreated PVs back into VG helter_skelter

Attempting to reconvert volume(s) back to their original mirrored state on taft-04...
(it's possible this may be a redundant operation in some allocate scenarios)
taft-04: lvconvert --mirrorlog disk -m 3 -b helter_skelter/nonsyncd_secondary_4legs_1 /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdb1:0-1000 /dev/sdh1:0-150
taft-04: lvconvert --mirrorlog disk -m 3 -b helter_skelter/nonsyncd_secondary_4legs_2 /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdb1:0-1000 /dev/sdh1:0-150
Verifying the up conversions by checking if all original leg devices are back in the mirror(s)
Verifying device /dev/sdf1 *IS* one of the legs in the mirror(s)
Verifying device /dev/sdg1 *IS* one of the legs in the mirror(s)
Verifying device /dev/sde1 *IS* one of the legs in the mirror(s)
Verifying device /dev/sdb1 *IS* one of the legs in the mirror(s)
Checking for leftover '-missing_0_0' or 'unknown devices'

Waiting until all mirrors become fully syncd...
   0/2 mirror(s) are fully synced: ( 51.92% 44.25% )
   0/2 mirror(s) are fully synced: ( 79.58% 70.17% )
   1/2 mirror(s) are fully synced: ( 100.00% 98.33% )
   2/2 mirror(s) are fully synced: ( 100.00% 100.00% )

[HANG]

Comment 7 Corey Marthaler 2011-03-01 20:30:52 UTC
FWIW - comment #5 ended up being bug 677739.

Comment 8 Jonathan Earl Brassow 2011-03-02 16:48:42 UTC
The log attached to comment #3 seems to indicate the log daemon died there.  Can I have access to that core?

Comment 12 Corey Marthaler 2011-07-27 16:04:17 UTC
Appeared to have hit this again, however still not much to go on.

Jul 26 18:47:48 taft-01 qarshd[7962]: Running cmdline: lvconvert --mirrorlog disk -m 2 -b helter_skelter/syncd_secondary_3legs_12 /dev/sde1:0-1000 /dev/sdb1:0-1000 /dev/sdg1:0-1000 /dev/sdd1:0-150
Jul 26 18:47:49 taft-01 lvm[13149]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_2_mimagetmp_3 for events.
Jul 26 18:47:49 taft-01 lvm[13149]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_2 for events.
Jul 26 18:47:50 taft-01 lvm[13149]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_2 for events.
Jul 26 18:47:50 taft-01 lvm[13149]: helter_skelter-syncd_secondary_3legs_2 is now in-sync.
Jul 26 18:47:50 taft-01 lvm[13149]: helter_skelter-syncd_secondary_3legs_2 is now in-sync.
Jul 26 18:47:51 taft-01 lvm[13149]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_2 for events.
Jul 26 18:47:54 taft-01 lvm[13149]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_2 for events.
Jul 26 18:47:54 taft-01 lvm[13149]: helter_skelter-syncd_secondary_3legs_2 is now in-sync.
Jul 26 18:47:55 taft-01 lvm[13149]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_3_mimagetmp_3 for events.
Jul 26 18:47:55 taft-01 lvm[13149]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_3 for events.
Jul 26 18:48:12 taft-01 kernel: device-mapper: dm-log-userspace: [ECqp7CPp] Request timed out: [9/12144258] - retrying
Jul 26 18:48:12 taft-01 kernel: device-mapper: dm-log-userspace: [ECqp7CPp] Request timed out: [3/12144263] - retrying
Jul 26 18:48:12 taft-01 kernel: device-mapper: dm-log-userspace: [Qtd2Lc63] Request timed out: [12/12144260] - retrying
Jul 26 18:48:12 taft-01 kernel: device-mapper: dm-log-userspace: [VDf6EWxB] Request timed out: [10/12144261] - retrying
Jul 26 18:48:12 taft-01 kernel: device-mapper: dm-log-userspace: [MALeSaAm] Request timed out: [8/12144262] - retrying



2.6.32-170.el6.x86_64

lvm2-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-libs-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-cluster-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
udev-147-2.35.el6    BUILT: Wed Mar 30 07:32:05 CDT 2011
device-mapper-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-libs-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-libs-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
cmirror-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011

Comment 13 Corey Marthaler 2011-07-27 20:53:48 UTC
This is fairly east to repo when testing with multiple mirrors. Something does seem to be dying here, but I can't figure out what it is. I'll attempt next with clvmd debugging

[root@taft-01 core_files]# ps -ef | grep corosync
root      2123     1 10 14:03 ?        00:10:43 corosync -f
root     20145  2064  0 15:48 pts/0    00:00:00 grep corosync

[root@taft-01 core_files]# ps -ef | grep cmirror
root      2339     1  4 14:05 ?        00:04:34 cmirrord
root     20147  2064  0 15:49 pts/0    00:00:00 grep cmirror

[root@taft-01 core_files]# ps -ef | grep dmeventd
root      2471     1  0 14:08 ?        00:00:12 /sbin/dmeventd
root     20149  2064  0 15:49 pts/0    00:00:00 grep dmeventd

Comment 14 Corey Marthaler 2011-07-28 15:52:43 UTC
Created attachment 515755 [details]
clvmd -d log from taft-01

Comment 15 Corey Marthaler 2011-07-28 15:53:29 UTC
Created attachment 515756 [details]
clvmd -d log from taft-02

Comment 16 Corey Marthaler 2011-07-28 15:54:06 UTC
Created attachment 515757 [details]
clvmd -d log from taft-03

Comment 17 Corey Marthaler 2011-07-28 15:55:18 UTC
Created attachment 515758 [details]
clvmd -d log from taft-04

Comment 18 Jonathan Earl Brassow 2011-09-22 18:57:08 UTC
It could be that 'cmirrord' is somehow not responding but alive.  I haven't repo'd yet.

Comment 26 Corey Marthaler 2012-09-14 18:26:40 UTC
FWIW - saw this today while doing pvmove regression testing for BZ 852438.

=== Iteration 2 of 10 started on taft-01 at Fri Sep 14 11:39:59 CDT 2012 ===
FREE PVS: /dev/sdb1 /dev/sdb2 /dev/sdd1 /dev/sdg1
INUSE PVS:
      pv_shuffle_A=/dev/sdc2 /dev/sdg2 /dev/sdh2
      pv_shuffle_B=/dev/sdc1 /dev/sde1 /dev/sdf1
NOT INUSE PVS:
      pv_shuffle_A=/dev/sdf2 /dev/sdh1
      pv_shuffle_B=/dev/sdd2 /dev/sde2

Adding /dev/sdd1 to pv_shuffle_A
Adding /dev/sdb1 to pv_shuffle_B

Moving data from /dev/sdc2 to /dev/sdd1 on taft-01
Moving data from /dev/sdf1 to /dev/sdb1 on taft-01
taft-01: pvmove -v /dev/sdc2 /dev/sdd1
taft-01: pvmove -v /dev/sdf1 /dev/sdb1
    Finding volume group "pv_shuffle_B"
    Executing: /sbin/modprobe dm-log-userspace
    Finding volume group "pv_shuffle_A"
    Executing: /sbin/modprobe dm-log-userspace
    Archiving volume group "pv_shuffle_B" metadata (seqno 9).
    Archiving volume group "pv_shuffle_A" metadata (seqno 9).
    Creating logical volume pvmove0
    Creating logical volume pvmove0
    Moving 896 extents of logical volume pv_shuffle_B/linear
    Moving 299 extents of logical volume pv_shuffle_B/stripe
    Moving 299 extents of logical volume pv_shuffle_A/stripe
    Updating volume group metadata
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 10).
    Checking progress before waiting every 15 seconds
  /dev/sdc2: Moved: 4.0%
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 10).
    Checking progress before waiting every 15 seconds
  /dev/sdf1: Moved: 2.1%
  /dev/sdc2: Moved: 27.8%
  /dev/sdf1: Moved: 8.3%
  /dev/sdc2: Moved: 49.2%
  /dev/sdf1: Moved: 13.1%
  /dev/sdc2: Moved: 77.9%
  /dev/sdf1: Moved: 20.8%
  /dev/sdc2: Moved: 100.0%
    Removing temporary pvmove LV
    Writing out final volume group after pvmove
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 12).
  /dev/sdf1: Moved: 29.6%
  /dev/sdf1: Moved: 37.7%
  /dev/sdf1: Moved: 44.8%
  /dev/sdf1: Moved: 54.2%
  /dev/sdf1: Moved: 62.1%
  /dev/sdf1: Moved: 68.0%
  /dev/sdf1: Moved: 73.8%
  /dev/sdf1: Moved: 75.0%
    Updating volume group metadata
pvmove appears stuck, copy percent hasn't changed in 1 minutes
  Error locking on node taft-04: Command timed out
  Error locking on node taft-03: Command timed out
  Error locking on node taft-02: Command timed out
  Error locking on node taft-01: Command timed out


Version:
2.6.32-279.el6.x86_64
lvm2-2.02.95-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
lvm2-libs-2.02.95-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
lvm2-cluster-2.02.95-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
udev-147-2.41.el6    BUILT: Thu Mar  1 13:01:08 CST 2012
device-mapper-1.02.74-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
device-mapper-libs-1.02.74-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
device-mapper-event-1.02.74-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
device-mapper-event-libs-1.02.74-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
cmirror-2.02.95-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012

Comment 27 Corey Marthaler 2012-10-22 17:00:33 UTC
Hit this during 6.3.z errata regression testing.

Comment 28 Corey Marthaler 2012-10-22 19:50:30 UTC
Is this the same issue as 803271 and 816724, just another /lib/udev/rules.d/69-dm-lvmetad.rules issue?

Comment 29 Peter Rajnoha 2012-11-07 19:55:34 UTC
(In reply to comment #28)
> Is this the same issue as 803271 and 816724, just another
> /lib/udev/rules.d/69-dm-lvmetad.rules issue?

Corey, do you run into these problems only with use_lvmetad=1? I don't see lvmetad mentioned above. If it's only with lvmetad, please, try the latest build lvm2-2.02.98-3.el6 then.

Comment 30 Marian Csontos 2012-11-13 12:40:41 UTC
(In reply to comment #28)
> Is this the same issue as 803271 and 816724, just another
> /lib/udev/rules.d/69-dm-lvmetad.rules issue?

Hi Corey, to answer the question: the pvmove deadlock in comment #26 is most likely related to lvmetad udev rules - this used to be fairly frequent issue, while I were not able to reproduce with:

a) removed lvmetad rules or
b) with latest 6.4 build addressing the issue.

(In reply to comment #29)
> Corey, do you run into these problems only with use_lvmetad=1? I don't see
> lvmetad mentioned above. If it's only with lvmetad, please, try the latest
> build lvm2-2.02.98-3.el6 then.

Peter, I can not speak for Corey, but I have seen this to happen with use_lvmetad=0 and as I said the issue is gone with latest build.

Comment 31 Corey Marthaler 2012-11-14 19:25:48 UTC
I can no longer reproduce this deadlock with lvmetad on or off running the latest lvm rpms.

2.6.32-338.el6.x86_64
lvm2-2.02.98-3.el6    BUILT: Mon Nov  5 06:45:48 CST 2012
lvm2-cluster-2.02.98-3.el6    BUILT: Mon Nov  5 06:45:48 CST 2012

Comment 32 Corey Marthaler 2012-11-14 19:32:42 UTC
FWIW, the tests now see a ton of the following messages:

  WARNING: configuration setting use_lvmetad overriden to 0 due to locking_type 3. Clustered environment not supported by lvmetad yet.
  WARNING: configuration setting use_lvmetad overriden to 0 due to locking_type 3. Clustered environment not supported by lvmetad yet.
  WARNING: configuration setting use_lvmetad overriden to 0 due to locking_type 3. Clustered environment not supported by lvmetad yet.