Hide Forgot
Description of problem: This may be related to bug 676909. Scenario: Kill primary leg of synced core log 2 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_primary_core_2legs_1 * sync: 1 * leg devices: /dev/sdc1 /dev/sdb1 * log devices: * no MDA devices: * failpv(s): /dev/sdc1 * failnode(s): taft-01 taft-02 taft-03 taft-04 * leg fault policy: allocate * log fault policy: remove ****************************************************** Creating mirror(s) on taft-02... taft-02: lvcreate --mirrorlog core -m 1 -n syncd_primary_core_2legs_1 -L 600M helter_skelter /dev/sdc1:0-1000 /dev/sdb1:0-1000 PV=/dev/sdc1 syncd_primary_core_2legs_1_mimage_0: 5 PV=/dev/sdc1 syncd_primary_core_2legs_1_mimage_0: 5 Waiting until all mirrors become fully syncd... 0/1 mirror(s) are fully synced: ( 60.33% ) 1/1 mirror(s) are fully synced: ( 100.00% ) Creating gfs2 on top of mirror(s) on taft-01... Mounting mirrored gfs2 filesystems on taft-01... Mounting mirrored gfs2 filesystems on taft-02... Mounting mirrored gfs2 filesystems on taft-03... Mounting mirrored gfs2 filesystems on taft-04... Writing verification files (checkit) to mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Disabling device sdc on taft-01 Disabling device sdc on taft-02 Disabling device sdc on taft-03 Disabling device sdc on taft-04 [DEADLOCK] taft-04 lvm[2280]: Mirror status: 1 of 2 images failed. taft-04 lvm[2280]: cluster request failed: Resource temporarily unavailable taft-04 lvm[2280]: Failed to lock syncd_primary_core_2legs_1 taft-04 lvm[2280]: Repair of mirrored LV helter_skelter/syncd_primary_core_2legs_1 failed. taft-04 lvm[2280]: Failed to remove faulty devices in helter_skelter-syncd_primary_core_2legs_1. taft-02 lvm[2289]: Error locking on node taft-04: Volume group for uuid not found: rLZHraBz3JGhJldV368FLgLXXAabnJzSRYPAXy486mP3mqmx1zYSBHNsXf0FSycn taft-02 lvm[2289]: Error locking on node taft-03: Volume group for uuid not found: rLZHraBz3JGhJldV368FLgLXXAabnJzSRYPAXy486mP3mqmx1zYSBHNsXf0FSycn taft-02 lvm[2289]: Failed to lock syncd_primary_core_2legs_1 taft-02 lvm[2289]: Repair of mirrored LV helter_skelter/syncd_primary_core_2legs_1 failed. taft-02 lvm[2289]: Failed to remove faulty devices in helter_skelter-syncd_primary_core_2legs_1. taft-02 lvm[2289]: No longer monitoring mirror device helter_skelter-syncd_primary_core_2legs_1 for events. Version-Release number of selected component (if applicable): 2.6.32-94.el6.x86_64 lvm2-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 lvm2-libs-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 lvm2-cluster-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 udev-147-2.31.el6 BUILT: Wed Jan 26 05:39:15 CST 2011 device-mapper-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-event-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-event-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 cmirror-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
Created attachment 486313 [details] log from taft-01
Created attachment 486314 [details] log from taft-02
Created attachment 486315 [details] log from taft-03
Created attachment 486316 [details] log from taft-04
This is reproducible. Mar 18 15:46:01 taft-02 lvm[2631]: Mirror status: 1 of 2 images failed. Mar 18 15:46:01 taft-02 lvm[2631]: cluster request failed: Resource temporarily unavailable Mar 18 15:46:01 taft-02 lvm[2631]: Failed to lock mirror_stripe Mar 18 15:46:01 taft-02 lvm[2631]: Repair of mirrored LV R9/mirror_stripe failed. Mar 18 15:46:01 taft-02 lvm[2631]: Failed to remove faulty devices in R9-mirror_stripe. Mar 18 15:45:59 taft-01 lvm[3932]: Error locking on node taft-03: Volume group for uuid not found: E9aYbyA8FZZCHIs9A46dJO3hr00AeYqI3AtJwHcwDuQMpyrq30DT7z7HJdFvVByR Mar 18 15:45:59 taft-01 lvm[3932]: Error locking on node taft-04: Volume group for uuid not found: E9aYbyA8FZZCHIs9A46dJO3hr00AeYqI3AtJwHcwDuQMpyrq30DT7z7HJdFvVByR Mar 18 15:45:59 taft-01 lvm[3932]: Error locking on node taft-02: Volume group for uuid not found: E9aYbyA8FZZCHIs9A46dJO3hr00AeYqI3AtJwHcwDuQMpyrq30DT7z7HJdFvVByR Mar 18 15:45:59 taft-01 lvm[3932]: Failed to lock mirror_stripe Mar 18 15:45:59 taft-01 lvm[3932]: Repair of mirrored LV R9/mirror_stripe failed. Mar 18 15:45:59 taft-01 lvm[3932]: Failed to remove faulty devices in R9-mirror_stripe. Mar 18 15:46:01 taft-01 lvm[3932]: No longer monitoring mirror device R9-mirror_stripe for events. # sync stuck at 97% [root@taft-04 ~]# lvs -a -o +devices /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error /dev/sdd1: read failed after 0 of 512 at 4096: Input/output error Couldn't find device with uuid yzcQZv-4Dp9-zhL1-Djg2-PifU-Mqvj-I8i7j3. LV VG Attr Log Copy% Devices mirror_stripe R9 mwi-ao mirror_stripe_mlog 97.46 mirror_stripe_mimage_0(0),mirror_stripe_mimage_1(0) [mirror_stripe_mimage_0] R9 Iwi-ao unknown device(0),/dev/sde1(0) [mirror_stripe_mimage_1] R9 Iwi-ao /dev/sdf1(0) [mirror_stripe_mlog] R9 lwi-ao /dev/sdh1(0)
Hit this again today while attempting to test device failure of 10 cmirrors. [...] taft-04 lvm[7207]: Mirror status: 1 of 2 images failed. taft-04 lvm[7207]: cluster request failed: Resource temporarily unavailable taft-04 lvm[7207]: Failed to lock syncd_secondary_2legs_11 taft-04 lvm[7207]: Repair of mirrored LV helter_skelter/syncd_secondary_2legs_11 failed. taft-04 lvm[7207]: Failed to remove faulty devices in helter_skelter-syncd_secondary_2legs_11. taft-04 lvm[7207]: dm_task_run failed, errno = 22, Invalid argument taft-04 lvm[7207]: No longer monitoring mirror device helter_skelter-syncd_secondary_2legs_10 for events. taft-04 kernel: INFO: task gfs2_quotad:2636 blocked for more than 120 seconds. taft-04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. taft-04 kernel: gfs2_quotad D 0000000000000002 0 2636 2 0x00000080 taft-04 kernel: ffff8801eec55c20 0000000000000046 ffff8801eec55b90 ffffffffa045dd7d taft-04 kernel: 0000000000000000 ffff8801ffa22000 ffff8801eec55c50 ffffffffa045c536 taft-04 kernel: ffff8802188ed038 ffff8801eec55fd8 000000000000f4e8 ffff8802188ed038 taft-04 kernel: Call Trace: taft-04 kernel: [<ffffffffa045dd7d>] ? dlm_put_lockspace+0x1d/0x40 [dlm] taft-04 kernel: [<ffffffffa045c536>] ? dlm_lock+0x96/0x1e0 [dlm] taft-04 kernel: [<ffffffffa0485c70>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] taft-04 kernel: [<ffffffffa0485c7e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2] [...] [root@taft-04 ~]# lvs -a -o +devices /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error /dev/sdg1: read failed after 0 of 512 at 145669554176: Input/output error /dev/sdg1: read failed after 0 of 512 at 145669664768: Input/output error /dev/sdg1: read failed after 0 of 512 at 0: Input/output error /dev/sdg1: read failed after 0 of 512 at 4096: Input/output error Couldn't find device with uuid VhODFk-I677-XeHI-jduB-ACdG-Y4SV-RdYzvS. LV Attr Log Copy% Devices syncd_secondary_2legs_1 -wi-ao /dev/sdc1(0) syncd_secondary_2legs_10 -wi-ao /dev/sdc1(675) syncd_secondary_2legs_10_mimage_0 vwi-a- syncd_secondary_2legs_10_mimage_1 -wi--- unknown device(675) syncd_secondary_2legs_10_mlog -wi-s- /dev/sde1(9) syncd_secondary_2legs_11 mwi-ao syncd_secondary_2legs_11_mlog 98.67 syncd_secondary_2legs_11_mimage_0(0),syncd_secondary_2legs_11_mimage_1(0) [syncd_secondary_2legs_11_mimage_0] Iwi-ao /dev/sdc1(750) [syncd_secondary_2legs_11_mimage_1] Iwi-ao unknown device(750) [syncd_secondary_2legs_11_mlog] lwi-ao /dev/sde1(10) syncd_secondary_2legs_12 -wi-ao /dev/sdc1(825) syncd_secondary_2legs_12_mimage_0 vwi-a- syncd_secondary_2legs_12_mimage_1 -wi--- unknown device(825) syncd_secondary_2legs_12_mlog -wi-s- /dev/sde1(11) syncd_secondary_2legs_2 -wi-ao /dev/sdc1(75) syncd_secondary_2legs_2_mimage_0 vwi-a- syncd_secondary_2legs_2_mimage_1 -wi--- unknown device(75) syncd_secondary_2legs_2_mlog -wi-s- /dev/sde1(1) syncd_secondary_2legs_3 -wi-ao /dev/sdc1(150) syncd_secondary_2legs_3_mimage_0 vwi-a- syncd_secondary_2legs_3_mimage_1 -wi--- unknown device(150) syncd_secondary_2legs_3_mlog -wi-s- /dev/sde1(2) syncd_secondary_2legs_4 -wi-ao /dev/sdc1(225) syncd_secondary_2legs_4_mimage_0 vwi-a- syncd_secondary_2legs_4_mimage_1 -wi--- unknown device(225) syncd_secondary_2legs_4_mlog -wi-s- /dev/sde1(3) syncd_secondary_2legs_5 -wi-ao /dev/sdc1(300) syncd_secondary_2legs_5_mimage_0 vwi-a- syncd_secondary_2legs_5_mimage_1 -wi--- unknown device(300) syncd_secondary_2legs_5_mlog -wi-s- /dev/sde1(4) syncd_secondary_2legs_6 -wi-ao /dev/sdc1(375) syncd_secondary_2legs_6_mimage_0 vwi-a- syncd_secondary_2legs_6_mimage_1 -wi--- unknown device(375) syncd_secondary_2legs_6_mlog -wi-s- /dev/sde1(5) syncd_secondary_2legs_7 -wi-ao /dev/sdc1(450) syncd_secondary_2legs_7_mimage_0 vwi-a- syncd_secondary_2legs_7_mimage_1 -wi--- unknown device(450) syncd_secondary_2legs_7_mlog -wi-s- /dev/sde1(6) syncd_secondary_2legs_8 -wi-ao /dev/sdc1(525) syncd_secondary_2legs_8_mimage_0 vwi-a- syncd_secondary_2legs_8_mimage_1 -wi--- unknown device(525) syncd_secondary_2legs_8_mlog -wi-s- /dev/sde1(7) syncd_secondary_2legs_9 -wi-ao /dev/sdc1(600) syncd_secondary_2legs_9_mimage_0 vwi-a- syncd_secondary_2legs_9_mimage_1 -wi--- unknown device(600) syncd_secondary_2legs_9_mlog -wi-s- /dev/sde1(8) 2.6.32-168.el6.x86_64 lvm2-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 lvm2-libs-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 lvm2-cluster-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 udev-147-2.35.el6 BUILT: Wed Mar 30 07:32:05 CDT 2011 device-mapper-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-event-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-event-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 cmirror-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
Created attachment 515316 [details] log from taft-01
Created attachment 515317 [details] log from taft-02
Created attachment 515319 [details] log from taft-03
Created attachment 515321 [details] log from taft-04
Is this bug due to timing issues? Is the problem detected and address when only some of the machines have had their device failed? If so, CLVM does not handle this case. It is a long standing issue with CLVM that - while it can handle when a device fails - it cannot handle when a shared device is inaccessible to only a subset of the machines in the cluster.
The cmirror devices being failed are failed on all nodes in the cluster. Subset device failure testing was turned off years ago.
That's not quite what I'm asking... I know you are killing the device on all the nodes, but I'm wondering if they are killed at the same instant. If they are killed serially, then it is possible that the fault handling code is triggered when some, but not all, of the machines have had the device disabled. I'm trying to figure out if this is possibly what is happening here.
As close to the same instant as possible since they're run in the back ground, but technically they're run serially. foreach $node (@cluster) { foreach device (@devices) { 'echo offline > /sys/block/$device/device/state &' } }
The '&' helps a lot, so I'm going to assume they are "failed" at the same time - unless I can prove otherwise.
I think that this bug and bug 743112 may be duplicates in that they are both responses to cluster locking failures - which I believe to be a result of improper dependency handling when dealing with cluster mirrors. In fact, since it was reported that 743112 might be possible to reproduce in a single machine test, this bug may fit the solution proposed for 743112 better than that bug! I have been using helter_skelter to test 743112. While I seem to have hit different problems, I have not hit this one or 743112. This bug will have to be revalidated once the patches are in place for 743112. It may already be fixed (with those patches).
I'm no longer able to reproduce this issue with the latest 6.3 rpms, Closing.