| Summary: | Leg and Log failure in a 2-way mirror can cause the Repair to fail | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> |
| Component: | lvm2 | Assignee: | Jonathan Earl Brassow <jbrassow> |
| Status: | CLOSED WONTFIX | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.5 | CC: | agk, dwysocha, heinzm, jbrassow, msnitzer, prajnoha, prockai, thornber, zkabelac |
| Target Milestone: | rc | Keywords: | TestBlocker |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-07-26 02:33:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Here's the second case of what appears to be the exact same issue:
Scenario kill_secondary_and_log_synced_2_legs: Kill secondary leg and disk log of synced 2 leg mirror(s)
********* Mirror hash info for this scenario *********
* names: syncd_secondary_log_2legs_1
* sync: 1
* striped: 0
* leg devices: /dev/sdc1 /dev/sdh1
* log devices: /dev/sdd1
* no MDA devices:
* failpv(s): /dev/sdh1 /dev/sdd1
* failnode(s): virt-024.cluster-qe.lab.eng.brq.redhat.com
* lvmetad: 1
* leg fault policy: allocate
* log fault policy: allocate
******************************************************
Creating mirror(s) on virt-024.cluster-qe.lab.eng.brq.redhat.com...
virt-024.cluster-qe.lab.eng.brq.redhat.com: lvcreate -m 1 -n syncd_secondary_log_2legs_1 -L 500M helter_skelter /dev/sdc1:0-1000 /dev/sdh1:0-1000 /dev/sdd1:0-150
Current mirror/raid device structure(s):
LV Attr LSize Cpy%Sync Devices
syncd_secondary_log_2legs_1 mwi-a-m--- 500.00m 15.20 syncd_secondary_log_2legs_1_mimage_0(0),syncd_secondary_log_2legs_1_mimage_1(0)
[syncd_secondary_log_2legs_1_mimage_0] Iwi-aom--- 500.00m /dev/sdc1(0)
[syncd_secondary_log_2legs_1_mimage_1] Iwi-aom--- 500.00m /dev/sdh1(0)
[syncd_secondary_log_2legs_1_mlog] lwi-aom--- 4.00m /dev/sdd1(0)
Waiting until all mirror|raid volumes become fully syncd...
1/1 mirror(s) are fully synced: ( 100.00% )
Creating ext on top of mirror(s) on virt-024.cluster-qe.lab.eng.brq.redhat.com...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on virt-024.cluster-qe.lab.eng.brq.redhat.com...
PV=/dev/sdh1
syncd_secondary_log_2legs_1_mimage_1: 5.1
PV=/dev/sdd1
syncd_secondary_log_2legs_1_mlog: 1.2
PV=/dev/sdh1
syncd_secondary_log_2legs_1_mimage_1: 5.1
PV=/dev/sdd1
syncd_secondary_log_2legs_1_mlog: 1.2
Writing verification files (checkit) to mirror(s) on...
---- virt-024.cluster-qe.lab.eng.brq.redhat.com ----
Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure
Verifying files (checkit) on mirror(s) on...
---- virt-024.cluster-qe.lab.eng.brq.redhat.com ----
Disabling device sdh on virt-024.cluster-qe.lab.eng.brq.redhat.com (lvmetad is on)
Disabling device sdd on virt-024.cluster-qe.lab.eng.brq.redhat.com (lvmetad is on)
Getting recovery check start time from /var/log/messages: Oct 5 04:03
Attempting I/O to cause mirror down conversion(s) on virt-024.cluster-qe.lab.eng.brq.redhat.com
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.304512 s, 138 MB/s
Verifying current sanity of lvm after the failure
Current mirror/raid device structure(s):
PV BKrdgC-KTeE-bfdt-HX5l-6V6t-H4Ey-Mo6Ny3 not recognised. Is the device missing?
[...]
PV iUqSL5-5i7L-tJFh-fI2G-olZt-7Uds-IYOKa4 not recognised. Is the device missing?
LV Attr LSize Cpy%Sync Devices
syncd_secondary_log_2legs_1 -wi-ao---- 500.00m /dev/sdc1(0)
syncd_secondary_log_2legs_1_mlog -wi------- 4.00m /dev/sdb1(0)
Verifying FAILED device /dev/sdh1 is *NOT* in the volume(s)
Verifying FAILED device /dev/sdd1 is *NOT* in the volume(s)
Verifying LOG device(s) /dev/sdd1 are *NOT* in the mirror/linear(s)
Verifying LEG device /dev/sdc1 *IS* in the volume(s)
verify the newly allocated dm devices were added as a result of the failures
Checking EXISTENCE of syncd_secondary_log_2legs_1_mimage_0 on: virt-024.cluster-qe.lab.eng.brq.redhat.com
syncd_secondary_log_2legs_1_mimage_0 on virt-024.cluster-qe.lab.eng.brq.redhat.com should now exist
** ARE THERE ENOUGH SPARE PVS TO ALLOCATE? **
FI_engine: recover() method failed
[root@virt-024 ~]# lvs -a -o +devices
PV BKrdgC-KTeE-bfdt-HX5l-6V6t-H4Ey-Mo6Ny3 not recognised. Is the device missing?
PV iUqSL5-5i7L-tJFh-fI2G-olZt-7Uds-IYOKa4 not recognised. Is the device missing?
PV BKrdgC-KTeE-bfdt-HX5l-6V6t-H4Ey-Mo6Ny3 not recognised. Is the device missing?
PV iUqSL5-5i7L-tJFh-fI2G-olZt-7Uds-IYOKa4 not recognised. Is the device missing?
LV VG Attr LSize Devices
syncd_secondary_log_2legs_1 helter_skelter -wi-ao---- 500.00m /dev/sdc1(0)
syncd_secondary_log_2legs_1_mlog helter_skelter -wi------- 4.00m /dev/sdb1(0)
LOG:
qarshd[18480]: Running cmdline: echo 1 > /sys/block/sdh/device/delete &
lvm[3710]: Secondary mirror device 253:4 has failed (D).
lvm[3710]: Device failure in helter_skelter-syncd_secondary_log_2legs_1.
qarshd[18488]: Running cmdline: echo 1 > /sys/block/sdd/device/delete &
lvm[3710]: PV iUqSL5-5i7L-tJFh-fI2G-olZt-7Uds-IYOKa4 not recognised. Is the device missing?
lvm[3710]: Mirror status: 1 of 2 images failed.
lvm[3710]: Trying to up-convert to 2 images, 1 logs.
lvm[3710]: /dev/sdd1: read failed after 0 of 1024 at 4096: Input/output error
lvm[3710]: Trying to up-convert to 2 images, 0 logs.
lvm[3710]: intermediate VG write failed.
lvm[3710]: WARNING: Failed to replace 1 of 1 logs in volume syncd_secondary_log_2legs_1
lvm[3710]: PV BKrdgC-KTeE-bfdt-HX5l-6V6t-H4Ey-Mo6Ny3 not recognised. Is the device missing?
lvm[3710]: PV iUqSL5-5i7L-tJFh-fI2G-olZt-7Uds-IYOKa4 not recognised. Is the device missing?
lvm[3710]: ABORTING: Mirror percentage check failed.
lvm[3710]: Repair of mirrored device helter_skelter-syncd_secondary_log_2legs_1 failed.
lvm[3710]: Failed to remove faulty devices in helter_skelter-syncd_secondary_log_2legs_1.
lvm[3710]: No longer monitoring mirror device helter_skelter-syncd_secondary_log_2legs_1 for events.
Intermediate operations are failing: lvm[3522]: intermediate VG write failed. This is going to cause things to fail. In the cases above, it causes the repair procedure to fail part way through. It is failing after the mirror has been down converted to linear and before it can be up-converted and the old log removed. Couple things about this. 1) Users generally don't have 'mirror_image_fault_policy' set to "allocate" - this is not the default. The repair happens in two stages - remove the bad devices and then allocate new devices according to user-set policy. It is the "allocate" portion of this that is failing. 2) There is no corruption. Clean-up is required because the command failed half-way through, but that isn't preventing things from continuing to operation. The user must - at their convenience - remove the now-visible log device and attempt a fresh up-convert. 3) Failing intermediate operations cannot be completely handled. The devices are failing in stages. One device fails during the repair operation for the first failure. It would not be all that dissimilar to kill the power to the machine during a repair operation. Upon booting up, the user would find a similar situation with a half-completed repair. I'm not sure to what extent lvmetad is responsible here. But here are the steps to test: 1) disable dmeventd (making it easier to perform steps by hand) 2) kill one of the mirror devices 3) do some I/O to trigger kernel to recognise the failure (this will block I/O) 4) use gdb to run 'lvconvert --repair' and stop before adding new mirror images 5) kill second device (i.e. the log device) 6) proceed to perform vg_write during final "allocate" stage We can then determine if this is a special problem with lvmetad. (My guess is that although this has only showed up with lvmetad, it is not isolated to it.) |
Description of problem: I've seen this twice now on two different machines running similar test cases. One involved failing the primary and log device, and the other the secondary and log device. In both cases, it appear that mirror recovery had started just a bit before the second device was failed. Also, in both of these cases, lvmetad was running. Scenario kill_primary_and_log_synced_2_legs: Kill primary leg and disk log of synced 2 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_primary_log_2legs_1 * sync: 1 * striped: 0 * leg devices: /dev/sdd1 /dev/sda1 * log devices: /dev/sdc1 * no MDA devices: * failpv(s): /dev/sdd1 /dev/sdc1 * failnode(s): virt-049.cluster-qe.lab.eng.brq.redhat.com * lvmetad: 1 * leg fault policy: allocate * log fault policy: allocate ****************************************************** Creating mirror(s) on virt-049.cluster-qe.lab.eng.brq.redhat.com... virt-049.cluster-qe.lab.eng.brq.redhat.com: lvcreate -m 1 -n syncd_primary_log_2legs_1 -L 500M helter_skelter /dev/sdd1:0-1000 /dev/sda1:0-1000 /dev/sdc1:0-150 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices syncd_primary_log_2legs_1 mwi-a-m--- 500.00m 16.00 syncd_primary_log_2legs_1_mimage_0(0),syncd_primary_log_2legs_1_mimage_1(0) [syncd_primary_log_2legs_1_mimage_0] Iwi-aom--- 500.00m /dev/sdd1(0) [syncd_primary_log_2legs_1_mimage_1] Iwi-aom--- 500.00m /dev/sda1(0) [syncd_primary_log_2legs_1_mlog] lwi-aom--- 4.00m /dev/sdc1(0) Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Creating ext on top of mirror(s) on virt-049.cluster-qe.lab.eng.brq.redhat.com... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on virt-049.cluster-qe.lab.eng.brq.redhat.com... PV=/dev/sdd1 syncd_primary_log_2legs_1_mimage_0: 5.1 PV=/dev/sdc1 syncd_primary_log_2legs_1_mlog: 1.2 PV=/dev/sdd1 syncd_primary_log_2legs_1_mimage_0: 5.1 PV=/dev/sdc1 syncd_primary_log_2legs_1_mlog: 1.2 Writing verification files (checkit) to mirror(s) on... ---- virt-049.cluster-qe.lab.eng.brq.redhat.com ---- Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure Verifying files (checkit) on mirror(s) on... Disabling device sdd on virt-049.cluster-qe.lab.eng.brq.redhat.com (lvmetad is on) Disabling device sdc on virt-049.cluster-qe.lab.eng.brq.redhat.com (lvmetad is on) Getting recovery check start time from /var/log/messages: Oct 5 04:07 Attempting I/O to cause mirror down conversion(s) on virt-049.cluster-qe.lab.eng.brq.redhat.com 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.268018 s, 156 MB/s Verifying current sanity of lvm after the failure Current mirror/raid device structure(s): PV zzQtQB-zAWG-INcc-CcN6-4maz-6DsP-oHopY0 not recognised. Is the device missing? [...] PV DSvz2q-dvac-zxq5-KTQ5-BaKi-cLdo-Nc0lq1 not recognised. Is the device missing? LV Attr LSize Cpy%Sync Devices syncd_primary_log_2legs_1 -wi-ao---- 500.00m /dev/sda1(0) syncd_primary_log_2legs_1_mlog -wi------- 4.00m /dev/sdb1(0) Verifying FAILED device /dev/sdd1 is *NOT* in the volume(s) Verifying FAILED device /dev/sdc1 is *NOT* in the volume(s) Verifying LOG device(s) /dev/sdc1 are *NOT* in the mirror/linear(s) Verifying LEG device /dev/sda1 *IS* in the volume(s) verify the newly allocated dm devices were added as a result of the failures Checking EXISTENCE of syncd_primary_log_2legs_1_mimage_0 on: virt-049.cluster-qe.lab.eng.brq.redhat.com syncd_primary_log_2legs_1_mimage_0 on virt-049.cluster-qe.lab.eng.brq.redhat.com should now exist ** ARE THERE ENOUGH SPARE PVS TO ALLOCATE? ** FI_engine: recover() method failed [root@virt-049 ~]# lvs -a -o +devices PV zzQtQB-zAWG-INcc-CcN6-4maz-6DsP-oHopY0 not recognised. Is the device missing? PV DSvz2q-dvac-zxq5-KTQ5-BaKi-cLdo-Nc0lq1 not recognised. Is the device missing? PV zzQtQB-zAWG-INcc-CcN6-4maz-6DsP-oHopY0 not recognised. Is the device missing? PV DSvz2q-dvac-zxq5-KTQ5-BaKi-cLdo-Nc0lq1 not recognised. Is the device missing? LV VG Attr LSize Devices syncd_primary_log_2legs_1 helter_skelter -wi-ao---- 500.00m /dev/sda1(0) syncd_primary_log_2legs_1_mlog helter_skelter -wi------- 4.00m /dev/sdb1(0) LOG: qarshd[16771]: Running cmdline: echo 1 > /sys/block/sdd/device/delete & lvm[3522]: Primary mirror device 253:3 has failed (D). lvm[3522]: Device failure in helter_skelter-syncd_primary_log_2legs_1. lvm[3522]: PV zzQtQB-zAWG-INcc-CcN6-4maz-6DsP-oHopY0 not recognised. Is the device missing? lvm[3522]: Mirror status: 1 of 2 images failed. lvm[3522]: Trying to up-convert to 2 images, 1 logs. qarshd[16798]: Running cmdline: echo 1 > /sys/block/sdc/device/delete & lvm[3522]: /dev/sdc1: read failed after 0 of 1024 at 4096: Input/output error lvm[3522]: Trying to up-convert to 2 images, 0 logs. lvm[3522]: intermediate VG write failed. lvm[3522]: WARNING: Failed to replace 1 of 1 logs in volume syncd_primary_log_2legs_1 lvm[3522]: PV zzQtQB-zAWG-INcc-CcN6-4maz-6DsP-oHopY0 not recognised. Is the device missing? lvm[3522]: ABORTING: Mirror percentage check failed. lvm[3522]: Repair of mirrored device helter_skelter-syncd_primary_log_2legs_1 failed. lvm[3522]: Failed to remove faulty devices in helter_skelter-syncd_primary_log_2legs_1. lvm[3522]: No longer monitoring mirror device helter_skelter-syncd_primary_log_2legs_1 for events. Version-Release number of selected component (if applicable): 2.6.32-420.el6.x86_64 lvm2-2.02.100-4.el6 BUILT: Fri Sep 27 16:05:32 CEST 2013 lvm2-libs-2.02.100-4.el6 BUILT: Fri Sep 27 16:05:32 CEST 2013 lvm2-cluster-2.02.100-4.el6 BUILT: Fri Sep 27 16:05:32 CEST 2013 udev-147-2.48.el6 BUILT: Fri Aug 9 13:09:50 CEST 2013 device-mapper-1.02.79-4.el6 BUILT: Fri Sep 27 16:05:32 CEST 2013 device-mapper-libs-1.02.79-4.el6 BUILT: Fri Sep 27 16:05:32 CEST 2013 device-mapper-event-1.02.79-4.el6 BUILT: Fri Sep 27 16:05:32 CEST 2013 device-mapper-event-libs-1.02.79-4.el6 BUILT: Fri Sep 27 16:05:32 CEST 2013 cmirror-2.02.100-4.el6 BUILT: Fri Sep 27 16:05:32 CEST 2013