Hide Forgot
Description of problem: It appears that the log server lost communication during testing. Nothing appears to have died however (cmirrord, dmeventd, udev, corosync are all still running). I'll attach the logs from each of the four machines in this taft cluster. ================================================================================ Iteration 1.19 started at Thu Feb 10 18:57:50 CST 2011 ================================================================================ Scenario: Kill secondary leg of non synced 3 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: nonsyncd_secondary_3legs_1 nonsyncd_secondary_3legs_2 * sync: 0 * leg devices: /dev/sde1 /dev/sdf1 /dev/sdc1 * log devices: /dev/sdh1 * no MDA devices: * failpv(s): /dev/sdf1 * failnode(s): taft-01 taft-02 taft-03 taft-04 * leg fault policy: remove * log fault policy: allocate ****************************************************** Creating mirror(s) on taft-02... taft-02: lvcreate -m 2 -n nonsyncd_secondary_3legs_1 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-1000 /dev/sdh1:0-150 taft-02: lvcreate -m 2 -n nonsyncd_secondary_3legs_2 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-1000 /dev/sdh1:0-150 PV=/dev/sdf1 nonsyncd_secondary_3legs_1_mimage_1: 6 nonsyncd_secondary_3legs_2_mimage_1: 6 PV=/dev/sdf1 nonsyncd_secondary_3legs_1_mimage_1: 6 nonsyncd_secondary_3legs_2_mimage_1: 6 Continuing on without fully syncd mirrors, currently at... ( 24.92% 14.83% ) Creating gfs2 on top of mirror(s) on taft-01... Mounting mirrored gfs2 filesystems on taft-01... Mounting mirrored gfs2 filesystems on taft-02... Mounting mirrored gfs2 filesystems on taft-03... Mounting mirrored gfs2 filesystems on taft-04... Writing verification files (checkit) to mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Disabling device sdf on taft-01 Disabling device sdf on taft-02 Disabling device sdf on taft-03 Disabling device sdf on taft-04 Attempting I/O to cause mirror down conversion(s) on taft-01 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.517681 s, 81.0 MB/s 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.54711 s, 76.7 MB/s Verifying current sanity of lvm after the failure /dev/sdf1: open failed: No such device or address Verifying FAILED device /dev/sdf1 is *NOT* in the volume(s) /dev/sdf1: open failed: No such device or address olog: 2 Verifying LOG device(s) /dev/sdh1 *ARE* in the mirror(s) /dev/sdf1: open failed: No such device or address Verifying LEG device /dev/sde1 *IS* in the volume(s) /dev/sdf1: open failed: No such device or address Verifying LEG device /dev/sdc1 *IS* in the volume(s) /dev/sdf1: open failed: No such device or address verify the dm devices associated with /dev/sdf1 have been removed as expected Checking REMOVAL of nonsyncd_secondary_3legs_1_mimage_1 on: taft-01 taft-02 taft-03 taft-04 Checking REMOVAL of nonsyncd_secondary_3legs_2_mimage_1 on: taft-01 taft-02 taft-03 taft-04 Verify that the mirror image order remains the same after the down conversion Verify that each of the mirror repairs finished successfully Verifying files (checkit) on mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Enabling device sdf on taft-01 Enabling device sdf on taft-02 Enabling device sdf on taft-03 Enabling device sdf on taft-04 Recreating PVs /dev/sdf1 WARNING: Volume group helter_skelter is not consistent Extending the recreated PVs back into VG helter_skelter Attempting to reconvert volume(s) back to their original mirrored state on taft-02... (it's possible this may be a redundant operation in some allocate scenarios) taft-02: lvconvert --mirrorlog disk -m 2 -b helter_skelter/nonsyncd_secondary_3legs_1 /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-1000 /dev/sdh1:0-150 taft-02: lvconvert --mirrorlog disk -m 2 -b helter_skelter/nonsyncd_secondary_3legs_2 /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-1000 /dev/sdh1:0-150 Verifying the up conversions by checking if all original leg devices are back in the mirror(s) Verifying device /dev/sde1 *IS* one of the legs in the mirror(s) Verifying device /dev/sdf1 *IS* one of the legs in the mirror(s) Verifying device /dev/sdc1 *IS* one of the legs in the mirror(s) Checking for leftover '-missing_0_0' or 'unknown devices' Waiting until all mirrors become fully syncd... 0/2 mirror(s) are fully synced: ( 47.08% 42.92% ) 0/2 mirror(s) are fully synced: ( 74.17% 68.50% ) 0/2 mirror(s) are fully synced: ( 98.75% 91.50% ) 2/2 mirror(s) are fully synced: ( 100.00% 100.00% ) [DEADLOCK] [root@taft-01 tmp]# cman_tool nodes Node Sts Inc Joined Name 1 M 40 2011-02-10 15:40:21 taft-01 2 M 44 2011-02-10 15:40:21 taft-02 3 M 48 2011-02-10 15:40:21 taft-03 4 M 52 2011-02-10 15:40:21 taft-04 [root@taft-01 tmp]# cman_tool services fence domain member count 4 victim count 0 victim now 0 master nodeid 2 wait state none members 1 2 3 4 dlm lockspaces name gfs2 id 0xe4f48a18 flags 0x00000008 fs_reg change member 4 joined 1 remove 0 failed 0 seq 4,4 members 1 2 3 4 name gfs1 id 0xcfd9d9db flags 0x00000008 fs_reg change member 4 joined 1 remove 0 failed 0 seq 4,4 members 1 2 3 4 name clvmd id 0x4104eefa flags 0x00000000 change member 4 joined 1 remove 0 failed 0 seq 1,1 members 1 2 3 4 gfs mountgroups name gfs2 id 0x7474a276 flags 0x00000048 mounted change member 4 joined 1 remove 0 failed 0 seq 4,4 members 1 2 3 4 name gfs1 id 0x5f59f1b5 flags 0x00000048 mounted change member 4 joined 1 remove 0 failed 0 seq 4,4 members 1 2 3 4 TAFT-01: Feb 10 19:06:08 taft-01 kernel: INFO: task glock_workqueue:3643 blocked for more than 120 seconds. Feb 10 19:06:08 taft-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 10 19:06:08 taft-01 kernel: glock_workque D 0000000000000002 0 3643 2 0x00000080 Feb 10 19:06:08 taft-01 kernel: ffff8801f48a3b50 0000000000000046 ffff8801f48a3b10 ffffffffa000409c Feb 10 19:06:08 taft-01 kernel: ffff880210f87a80 ffff8801e98f6808 ffff8801f48a3af0 ffffffffa0003389 Feb 10 19:06:08 taft-01 kernel: ffff880213317a98 ffff8801f48a3fd8 0000000000010558 ffff880213317a98 Feb 10 19:06:08 taft-01 kernel: Call Trace: Feb 10 19:06:08 taft-01 kernel: [<ffffffffa000409c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod] Feb 10 19:06:08 taft-01 kernel: [<ffffffffa0003389>] ? dm_request+0x179/0x1a0 [dm_mod] Feb 10 19:06:08 taft-01 kernel: [<ffffffff81093df9>] ? ktime_get_ts+0xa9/0xe0 Feb 10 19:06:08 taft-01 kernel: [<ffffffff811974a0>] ? sync_buffer+0x0/0x50 Feb 10 19:06:08 taft-01 kernel: [<ffffffff814c4a53>] io_schedule+0x73/0xc0 Feb 10 19:06:08 taft-01 kernel: [<ffffffff811974e0>] sync_buffer+0x40/0x50 Feb 10 19:06:08 taft-01 kernel: [<ffffffff814c52cf>] __wait_on_bit+0x5f/0x90 Feb 10 19:06:08 taft-01 kernel: [<ffffffff811974a0>] ? sync_buffer+0x0/0x50 Feb 10 19:06:08 taft-01 kernel: [<ffffffff814c5378>] out_of_line_wait_on_bit+0x78/0x90 Feb 10 19:06:08 taft-01 kernel: [<ffffffff8108a130>] ? wake_bit_function+0x0/0x50 Feb 10 19:06:08 taft-01 kernel: [<ffffffff81197496>] __wait_on_buffer+0x26/0x30 Feb 10 19:06:08 taft-01 kernel: [<ffffffffa0442ddb>] gfs2_meta_read+0xbb/0x140 [gfs2] Feb 10 19:06:08 taft-01 kernel: [<ffffffffa0442f3f>] gfs2_meta_indirect_buffer+0xdf/0x190 [gfs2] Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043d7a9>] gfs2_inode_refresh+0x29/0x330 [gfs2] Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043cbe0>] inode_go_lock+0x80/0xf0 [gfs2] Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043afb7>] do_promote+0x1c7/0x340 [gfs2] Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043b77d>] finish_xmote+0x13d/0x3a0 [gfs2] Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043c4f2>] glock_work_func+0x142/0x1d0 [gfs2] Feb 10 19:06:08 taft-01 kernel: [<ffffffffa043c3b0>] ? glock_work_func+0x0/0x1d0 [gfs2] Feb 10 19:06:08 taft-01 kernel: [<ffffffff810849f0>] worker_thread+0x170/0x2a0 Feb 10 19:06:08 taft-01 kernel: [<ffffffff8108a0f0>] ? autoremove_wake_function+0x0/0x40 Feb 10 19:06:08 taft-01 kernel: [<ffffffff81084880>] ? worker_thread+0x0/0x2a0 Feb 10 19:06:08 taft-01 kernel: [<ffffffff81089d86>] kthread+0x96/0xa0 Feb 10 19:06:08 taft-01 kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20 Feb 10 19:06:08 taft-01 kernel: [<ffffffff81089cf0>] ? kthread+0x0/0xa0 Feb 10 19:06:08 taft-01 kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Feb 10 19:06:08 taft-01 kernel: INFO: task flush-253:12:10238 blocked for more than 120 seconds. TAFT-03: Feb 10 19:02:04 taft-03 lvm[2901]: helter_skelter-nonsyncd_secondary_3legs_1 is now in-sync. Feb 10 19:02:04 taft-03 lvm[2901]: No longer monitoring mirror device helter_skelter-nonsyncd_secondary_3legs_2_mimagetmp_3 for events. Feb 10 19:02:04 taft-03 lvm[2901]: No longer monitoring mirror device helter_skelter-nonsyncd_secondary_3legs_2 for events. Feb 10 19:02:09 taft-03 qarshd[2304]: Running cmdline: dmsetup status helter_skelter-nonsyncd_secondary_3legs_1-real Feb 10 19:02:09 taft-03 qarshd[2306]: Running cmdline: dmsetup status helter_skelter-nonsyncd_secondary_3legs_1 Feb 10 19:02:19 taft-03 kernel: device-mapper: dm-log-userspace: [3496Okgc] Request timed out: [9/2017494] - retrying Feb 10 19:02:24 taft-03 kernel: device-mapper: dm-log-userspace: [CpoIEtjY] Request timed out: [14/2017496] - retrying [...] Feb 10 19:02:34 taft-03 kernel: device-mapper: dm-log-userspace: [3496Okgc] Request timed out: [9/2017497] - retrying Feb 10 19:05:54 taft-03 kernel: device-mapper: dm-log-userspace: [CpoIEtjY] Request timed out: [14/2017537] - retrying Feb 10 19:05:57 taft-03 kernel: INFO: task cmirrord:2246 blocked for more than 120 seconds. Feb 10 19:05:57 taft-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 10 19:05:57 taft-03 kernel: cmirrord D ffff88021fc24300 0 2246 1 0x00000080 Feb 10 19:05:57 taft-03 kernel: ffff8802153afb08 0000000000000086 0000000000000000 ffffffffa000409c Feb 10 19:05:57 taft-03 kernel: ffff8802153afad8 00000000ed5e1a1c 0000000000000000 0000000100b65466 Feb 10 19:05:57 taft-03 kernel: ffff880211bb6638 ffff8802153affd8 0000000000010558 ffff880211bb6638 Feb 10 19:05:57 taft-03 kernel: Call Trace: Feb 10 19:05:57 taft-03 kernel: [<ffffffffa000409c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod] Feb 10 19:05:57 taft-03 kernel: [<ffffffff814c4a53>] io_schedule+0x73/0xc0 Feb 10 19:05:57 taft-03 kernel: [<ffffffff8119f84e>] __blockdev_direct_IO+0x70e/0xc40 Feb 10 19:05:57 taft-03 kernel: [<ffffffff8117afd0>] ? __pollwait+0x0/0xf0 Feb 10 19:05:57 taft-03 kernel: [<ffffffff811f97a2>] ? security_inode_getsecurity+0x22/0x30 Feb 10 19:05:57 taft-03 kernel: [<ffffffff8119d497>] blkdev_direct_IO+0x57/0x60 Feb 10 19:05:57 taft-03 kernel: [<ffffffff8119c680>] ? blkdev_get_blocks+0x0/0xc0 Feb 10 19:05:57 taft-03 kernel: [<ffffffff81105572>] generic_file_direct_write+0xc2/0x190 Feb 10 19:05:57 taft-03 kernel: [<ffffffff81106d55>] __generic_file_aio_write+0x345/0x480 Feb 10 19:05:57 taft-03 kernel: [<ffffffff8109a909>] ? drop_futex_key_refs+0x49/0x80 Feb 10 19:05:57 taft-03 kernel: [<ffffffff8119cbec>] blkdev_aio_write+0x3c/0xa0 Feb 10 19:05:57 taft-03 kernel: [<ffffffff81165b4a>] do_sync_write+0xfa/0x140 Feb 10 19:05:57 taft-03 kernel: [<ffffffff8108a0f0>] ? autoremove_wake_function+0x0/0x40 Feb 10 19:05:57 taft-03 kernel: [<ffffffff8120537b>] ? selinux_file_permission+0xfb/0x150 Feb 10 19:05:57 taft-03 kernel: [<ffffffff811f87e6>] ? security_file_permission+0x16/0x20 Feb 10 19:05:57 taft-03 kernel: [<ffffffff81165e48>] vfs_write+0xb8/0x1a0 Feb 10 19:05:57 taft-03 kernel: [<ffffffff810cca12>] ? audit_syscall_entry+0x272/0x2a0 Feb 10 19:05:57 taft-03 kernel: [<ffffffff81166881>] sys_write+0x51/0x90 Feb 10 19:05:57 taft-03 kernel: [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b Feb 10 19:05:57 taft-03 kernel: INFO: task clvmd:2281 blocked for more than 120 seconds. [...] Version-Release number of selected component (if applicable): 2.6.32-94.el6.x86_64 lvm2-2.02.83-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 lvm2-libs-2.02.83-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 lvm2-cluster-2.02.83-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 udev-147-2.31.el6 BUILT: Wed Jan 26 05:39:15 CST 2011 device-mapper-1.02.62-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 device-mapper-libs-1.02.62-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 device-mapper-event-1.02.62-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 device-mapper-event-libs-1.02.62-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 cmirror-2.02.83-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011
Created attachment 478299 [details] log from taft-01
Created attachment 478300 [details] log from taft-02
Created attachment 478301 [details] log from taft-03
Created attachment 478302 [details] log from taft-04
I appear to have hit this same issue on a different cluster while attempting snapshot of exclusively activated mirrors. SCENARIO - [snaphot_exclusive_mirror] Snapshot an exclusively activated mirror grant-01: lvcreate -m 1 -n exclusive_origin -L 100M mirror_sanity Deactivate and then exclusively activate mirror Taking multiple snapshots of exclusive mirror 1 2 3 4 5 6 7 8 9 10 Removing snapshots of exclusive mirror 1 2 3 4 5 6 7 8 9 10 [HANG] [root@grant-03 ~]# lvs -a -o +devices Error locking on node grant-01: Command timed out [HANG] TAFT-01: Feb 11 17:15:52 grant-01 qarshd[4983]: Running cmdline: lvremove -f /dev/mirror_sanity/msnap_10 Feb 11 17:15:53 grant-01 lvm[2250]: No longer monitoring snapshot mirror_sanity-msnap_10 Feb 11 17:16:15 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5660] - retrying Feb 11 17:16:30 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5661] - retrying Feb 11 17:16:45 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5662] - retrying Feb 11 17:17:00 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5663] - retrying Feb 11 17:17:15 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5664] - retrying Feb 11 17:17:30 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5665] - retrying Feb 11 17:17:45 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5666] - retrying Feb 11 17:18:00 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5667] - retrying Feb 11 17:18:15 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5668] - retrying Feb 11 17:18:30 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5669] - retrying Feb 11 17:18:45 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5670] - retrying Feb 11 17:18:49 grant-01 kernel: INFO: task cmirrord:2175 blocked for more than 120 seconds. Feb 11 17:18:49 grant-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 11 17:18:49 grant-01 kernel: cmirrord D ffff88011fc24300 0 2175 1 0x00000000 Feb 11 17:18:49 grant-01 kernel: ffff88011a4a7b88 0000000000000086 0000000000000000 ffffffffa000409c Feb 11 17:18:49 grant-01 kernel: ffff88011a4a7b58 000000004bdac9df 0000000000000000 000000010009683b Feb 11 17:18:49 grant-01 kernel: ffff8801199e3028 ffff88011a4a7fd8 0000000000010558 ffff8801199e3028 Feb 11 17:18:49 grant-01 kernel: Call Trace: Feb 11 17:18:49 grant-01 kernel: [<ffffffffa000409c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod] Feb 11 17:18:49 grant-01 kernel: [<ffffffff814c4a53>] io_schedule+0x73/0xc0 Feb 11 17:18:49 grant-01 kernel: [<ffffffff8119f84e>] __blockdev_direct_IO+0x70e/0xc40 Feb 11 17:18:49 grant-01 kernel: [<ffffffff8119d497>] blkdev_direct_IO+0x57/0x60 Feb 11 17:18:49 grant-01 kernel: [<ffffffff8119c680>] ? blkdev_get_blocks+0x0/0xc0 Feb 11 17:18:49 grant-01 kernel: [<ffffffff811060c6>] generic_file_aio_read+0x696/0x6f0 Feb 11 17:18:49 grant-01 kernel: [<ffffffff8117fad0>] ? iput+0x30/0x70 Feb 11 17:18:49 grant-01 kernel: [<ffffffff8109a909>] ? drop_futex_key_refs+0x49/0x80 Feb 11 17:18:49 grant-01 kernel: [<ffffffff81165c8a>] do_sync_read+0xfa/0x140 Feb 11 17:18:49 grant-01 kernel: [<ffffffff8108a0f0>] ? autoremove_wake_function+0x0/0x40 Feb 11 17:18:49 grant-01 kernel: [<ffffffff8120537b>] ? selinux_file_permission+0xfb/0x150 Feb 11 17:18:49 grant-01 kernel: [<ffffffff811f87e6>] ? security_file_permission+0x16/0x20 Feb 11 17:18:49 grant-01 kernel: [<ffffffff811666b5>] vfs_read+0xb5/0x1a0 Feb 11 17:18:49 grant-01 kernel: [<ffffffff811667f1>] sys_read+0x51/0x90 Feb 11 17:18:49 grant-01 kernel: [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b Feb 11 17:19:00 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5671] - retrying Feb 11 17:19:15 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5672] - retrying Feb 11 17:19:30 grant-01 kernel: device-mapper: dm-log-userspace: [hnlU73yL] Request timed out: [5/5673] - retrying 2.6.32-94.el6.x86_64 lvm2-2.02.83-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 lvm2-libs-2.02.83-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 lvm2-cluster-2.02.83-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 udev-147-2.31.el6 BUILT: Wed Jan 26 05:39:15 CST 2011 device-mapper-1.02.62-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 device-mapper-libs-1.02.62-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 device-mapper-event-1.02.62-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 device-mapper-event-libs-1.02.62-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011 cmirror-2.02.83-2.el6 BUILT: Tue Feb 8 10:10:57 CST 2011
Reproduced again with helter_skelter, bumping pri/sev and adding the 'Regression' keyword. [...] Enabling device sdg on taft-01 Enabling device sdg on taft-02 Enabling device sdg on taft-03 Enabling device sdg on taft-04 Recreating PVs /dev/sdg1 WARNING: Volume group helter_skelter is not consistent Extending the recreated PVs back into VG helter_skelter Attempting to reconvert volume(s) back to their original mirrored state on taft-04... (it's possible this may be a redundant operation in some allocate scenarios) taft-04: lvconvert --mirrorlog disk -m 3 -b helter_skelter/nonsyncd_secondary_4legs_1 /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdb1:0-1000 /dev/sdh1:0-150 taft-04: lvconvert --mirrorlog disk -m 3 -b helter_skelter/nonsyncd_secondary_4legs_2 /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdb1:0-1000 /dev/sdh1:0-150 Verifying the up conversions by checking if all original leg devices are back in the mirror(s) Verifying device /dev/sdf1 *IS* one of the legs in the mirror(s) Verifying device /dev/sdg1 *IS* one of the legs in the mirror(s) Verifying device /dev/sde1 *IS* one of the legs in the mirror(s) Verifying device /dev/sdb1 *IS* one of the legs in the mirror(s) Checking for leftover '-missing_0_0' or 'unknown devices' Waiting until all mirrors become fully syncd... 0/2 mirror(s) are fully synced: ( 51.92% 44.25% ) 0/2 mirror(s) are fully synced: ( 79.58% 70.17% ) 1/2 mirror(s) are fully synced: ( 100.00% 98.33% ) 2/2 mirror(s) are fully synced: ( 100.00% 100.00% ) [HANG]
FWIW - comment #5 ended up being bug 677739.
The log attached to comment #3 seems to indicate the log daemon died there. Can I have access to that core?
Appeared to have hit this again, however still not much to go on. Jul 26 18:47:48 taft-01 qarshd[7962]: Running cmdline: lvconvert --mirrorlog disk -m 2 -b helter_skelter/syncd_secondary_3legs_12 /dev/sde1:0-1000 /dev/sdb1:0-1000 /dev/sdg1:0-1000 /dev/sdd1:0-150 Jul 26 18:47:49 taft-01 lvm[13149]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_2_mimagetmp_3 for events. Jul 26 18:47:49 taft-01 lvm[13149]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_2 for events. Jul 26 18:47:50 taft-01 lvm[13149]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_2 for events. Jul 26 18:47:50 taft-01 lvm[13149]: helter_skelter-syncd_secondary_3legs_2 is now in-sync. Jul 26 18:47:50 taft-01 lvm[13149]: helter_skelter-syncd_secondary_3legs_2 is now in-sync. Jul 26 18:47:51 taft-01 lvm[13149]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_2 for events. Jul 26 18:47:54 taft-01 lvm[13149]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_2 for events. Jul 26 18:47:54 taft-01 lvm[13149]: helter_skelter-syncd_secondary_3legs_2 is now in-sync. Jul 26 18:47:55 taft-01 lvm[13149]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_3_mimagetmp_3 for events. Jul 26 18:47:55 taft-01 lvm[13149]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_3 for events. Jul 26 18:48:12 taft-01 kernel: device-mapper: dm-log-userspace: [ECqp7CPp] Request timed out: [9/12144258] - retrying Jul 26 18:48:12 taft-01 kernel: device-mapper: dm-log-userspace: [ECqp7CPp] Request timed out: [3/12144263] - retrying Jul 26 18:48:12 taft-01 kernel: device-mapper: dm-log-userspace: [Qtd2Lc63] Request timed out: [12/12144260] - retrying Jul 26 18:48:12 taft-01 kernel: device-mapper: dm-log-userspace: [VDf6EWxB] Request timed out: [10/12144261] - retrying Jul 26 18:48:12 taft-01 kernel: device-mapper: dm-log-userspace: [MALeSaAm] Request timed out: [8/12144262] - retrying 2.6.32-170.el6.x86_64 lvm2-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 lvm2-libs-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 lvm2-cluster-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 udev-147-2.35.el6 BUILT: Wed Mar 30 07:32:05 CDT 2011 device-mapper-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-event-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-event-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 cmirror-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
This is fairly east to repo when testing with multiple mirrors. Something does seem to be dying here, but I can't figure out what it is. I'll attempt next with clvmd debugging [root@taft-01 core_files]# ps -ef | grep corosync root 2123 1 10 14:03 ? 00:10:43 corosync -f root 20145 2064 0 15:48 pts/0 00:00:00 grep corosync [root@taft-01 core_files]# ps -ef | grep cmirror root 2339 1 4 14:05 ? 00:04:34 cmirrord root 20147 2064 0 15:49 pts/0 00:00:00 grep cmirror [root@taft-01 core_files]# ps -ef | grep dmeventd root 2471 1 0 14:08 ? 00:00:12 /sbin/dmeventd root 20149 2064 0 15:49 pts/0 00:00:00 grep dmeventd
Created attachment 515755 [details] clvmd -d log from taft-01
Created attachment 515756 [details] clvmd -d log from taft-02
Created attachment 515757 [details] clvmd -d log from taft-03
Created attachment 515758 [details] clvmd -d log from taft-04
It could be that 'cmirrord' is somehow not responding but alive. I haven't repo'd yet.
FWIW - saw this today while doing pvmove regression testing for BZ 852438. === Iteration 2 of 10 started on taft-01 at Fri Sep 14 11:39:59 CDT 2012 === FREE PVS: /dev/sdb1 /dev/sdb2 /dev/sdd1 /dev/sdg1 INUSE PVS: pv_shuffle_A=/dev/sdc2 /dev/sdg2 /dev/sdh2 pv_shuffle_B=/dev/sdc1 /dev/sde1 /dev/sdf1 NOT INUSE PVS: pv_shuffle_A=/dev/sdf2 /dev/sdh1 pv_shuffle_B=/dev/sdd2 /dev/sde2 Adding /dev/sdd1 to pv_shuffle_A Adding /dev/sdb1 to pv_shuffle_B Moving data from /dev/sdc2 to /dev/sdd1 on taft-01 Moving data from /dev/sdf1 to /dev/sdb1 on taft-01 taft-01: pvmove -v /dev/sdc2 /dev/sdd1 taft-01: pvmove -v /dev/sdf1 /dev/sdb1 Finding volume group "pv_shuffle_B" Executing: /sbin/modprobe dm-log-userspace Finding volume group "pv_shuffle_A" Executing: /sbin/modprobe dm-log-userspace Archiving volume group "pv_shuffle_B" metadata (seqno 9). Archiving volume group "pv_shuffle_A" metadata (seqno 9). Creating logical volume pvmove0 Creating logical volume pvmove0 Moving 896 extents of logical volume pv_shuffle_B/linear Moving 299 extents of logical volume pv_shuffle_B/stripe Moving 299 extents of logical volume pv_shuffle_A/stripe Updating volume group metadata Updating volume group metadata Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 10). Checking progress before waiting every 15 seconds /dev/sdc2: Moved: 4.0% Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 10). Checking progress before waiting every 15 seconds /dev/sdf1: Moved: 2.1% /dev/sdc2: Moved: 27.8% /dev/sdf1: Moved: 8.3% /dev/sdc2: Moved: 49.2% /dev/sdf1: Moved: 13.1% /dev/sdc2: Moved: 77.9% /dev/sdf1: Moved: 20.8% /dev/sdc2: Moved: 100.0% Removing temporary pvmove LV Writing out final volume group after pvmove Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 12). /dev/sdf1: Moved: 29.6% /dev/sdf1: Moved: 37.7% /dev/sdf1: Moved: 44.8% /dev/sdf1: Moved: 54.2% /dev/sdf1: Moved: 62.1% /dev/sdf1: Moved: 68.0% /dev/sdf1: Moved: 73.8% /dev/sdf1: Moved: 75.0% Updating volume group metadata pvmove appears stuck, copy percent hasn't changed in 1 minutes Error locking on node taft-04: Command timed out Error locking on node taft-03: Command timed out Error locking on node taft-02: Command timed out Error locking on node taft-01: Command timed out Version: 2.6.32-279.el6.x86_64 lvm2-2.02.95-10.el6_3.1 BUILT: Tue Aug 28 09:31:04 CDT 2012 lvm2-libs-2.02.95-10.el6_3.1 BUILT: Tue Aug 28 09:31:04 CDT 2012 lvm2-cluster-2.02.95-10.el6_3.1 BUILT: Tue Aug 28 09:31:04 CDT 2012 udev-147-2.41.el6 BUILT: Thu Mar 1 13:01:08 CST 2012 device-mapper-1.02.74-10.el6_3.1 BUILT: Tue Aug 28 09:31:04 CDT 2012 device-mapper-libs-1.02.74-10.el6_3.1 BUILT: Tue Aug 28 09:31:04 CDT 2012 device-mapper-event-1.02.74-10.el6_3.1 BUILT: Tue Aug 28 09:31:04 CDT 2012 device-mapper-event-libs-1.02.74-10.el6_3.1 BUILT: Tue Aug 28 09:31:04 CDT 2012 cmirror-2.02.95-10.el6_3.1 BUILT: Tue Aug 28 09:31:04 CDT 2012
Hit this during 6.3.z errata regression testing.
Is this the same issue as 803271 and 816724, just another /lib/udev/rules.d/69-dm-lvmetad.rules issue?
(In reply to comment #28) > Is this the same issue as 803271 and 816724, just another > /lib/udev/rules.d/69-dm-lvmetad.rules issue? Corey, do you run into these problems only with use_lvmetad=1? I don't see lvmetad mentioned above. If it's only with lvmetad, please, try the latest build lvm2-2.02.98-3.el6 then.
(In reply to comment #28) > Is this the same issue as 803271 and 816724, just another > /lib/udev/rules.d/69-dm-lvmetad.rules issue? Hi Corey, to answer the question: the pvmove deadlock in comment #26 is most likely related to lvmetad udev rules - this used to be fairly frequent issue, while I were not able to reproduce with: a) removed lvmetad rules or b) with latest 6.4 build addressing the issue. (In reply to comment #29) > Corey, do you run into these problems only with use_lvmetad=1? I don't see > lvmetad mentioned above. If it's only with lvmetad, please, try the latest > build lvm2-2.02.98-3.el6 then. Peter, I can not speak for Corey, but I have seen this to happen with use_lvmetad=0 and as I said the issue is gone with latest build.
I can no longer reproduce this deadlock with lvmetad on or off running the latest lvm rpms. 2.6.32-338.el6.x86_64 lvm2-2.02.98-3.el6 BUILT: Mon Nov 5 06:45:48 CST 2012 lvm2-cluster-2.02.98-3.el6 BUILT: Mon Nov 5 06:45:48 CST 2012
FWIW, the tests now see a ton of the following messages: WARNING: configuration setting use_lvmetad overriden to 0 due to locking_type 3. Clustered environment not supported by lvmetad yet. WARNING: configuration setting use_lvmetad overriden to 0 due to locking_type 3. Clustered environment not supported by lvmetad yet. WARNING: configuration setting use_lvmetad overriden to 0 due to locking_type 3. Clustered environment not supported by lvmetad yet.