Bug 1397589
| Summary: | Raid 1/4/5/6 device failure repair regression (Unable to extract RAID image while RAID array is not in-sync) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> |
| Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> |
| lvm2 sub component: | Mirroring and RAID (RHEL6) | QA Contact: | cluster-qe <cluster-qe> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac |
| Version: | 6.8 | Keywords: | Regression, TestBlocker |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | lvm2-2.02.143-10.el6 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-03-21 12:04:06 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1311765 | ||
| Bug Blocks: | |||
|
Description
Corey Marthaler
2016-11-22 21:56:08 UTC
# 6.8 raid5 attempt [root@host-091 ~]# lvs -a -o +devices WARNING: Device for PV h9L4KG-Rclw-ZDAd-nHxG-iLFN-YQkL-WmyADY not found or rejected by a filter. Couldn't find device for segment belonging to black_bird/synced_random_raid5_2legs_1_rimage_1 while checking used and assumed devices. LV VG Attr LSize Cpy%Sync Devices synced_random_raid5_2legs_1 black_bird rwi-aor-p- 504.00m 100.00 synced_random_raid5_2legs_1_rimage_0(0),synced_random_raid5_2legs_1_rimage_1(0),synced_random_raid5_2legs_1_rimage_2(0) [synced_random_raid5_2legs_1_rimage_0] black_bird iwi-aor--- 252.00m /dev/sdc1(1) [synced_random_raid5_2legs_1_rimage_1] black_bird iwi-aor-p- 252.00m unknown device(1) [synced_random_raid5_2legs_1_rimage_2] black_bird iwi-aor--- 252.00m /dev/sdf1(1) [synced_random_raid5_2legs_1_rmeta_0] black_bird ewi-aor--- 4.00m /dev/sdc1(0) [synced_random_raid5_2legs_1_rmeta_1] black_bird ewi-aor-p- 4.00m unknown device(0) [synced_random_raid5_2legs_1_rmeta_2] black_bird ewi-aor--- 4.00m /dev/sdf1(0) [root@host-091 ~]# lvconvert --yes --repair black_bird/synced_random_raid5_2legs_1 WARNING: Device for PV h9L4KG-Rclw-ZDAd-nHxG-iLFN-YQkL-WmyADY not found or rejected by a filter. Couldn't find device for segment belonging to black_bird/synced_random_raid5_2legs_1_rimage_1 while checking used and assumed devices. Faulty devices in black_bird/synced_random_raid5_2legs_1 successfully replaced. # 6.9 raid5 attempt [root@host-078 ~]# lvs -a -o +devices WARNING: Device for PV zKjceH-1t0W-6r4f-Vi22-ssDp-a2B3-vxoBux not found or rejected by a filter. Couldn't find device for segment belonging to black_bird/synced_random_raid5_2legs_1_rimage_2 while checking used and assumed devices. LV VG Attr LSize Cpy%Sync Devices synced_random_raid5_2legs_1 black_bird rwi-aor-p- 504.00m 100.00 synced_random_raid5_2legs_1_rimage_0(0),synced_random_raid5_2legs_1_rimage_1(0),synced_random_raid5_2legs_1_rimage_2(0) [synced_random_raid5_2legs_1_rimage_0] black_bird iwi-aor--- 252.00m /dev/sdc1(1) [synced_random_raid5_2legs_1_rimage_1] black_bird iwi-aor--- 252.00m /dev/sdd1(1) [synced_random_raid5_2legs_1_rimage_2] black_bird iwi-aor-p- 252.00m unknown device(1) [synced_random_raid5_2legs_1_rmeta_0] black_bird ewi-aor--- 4.00m /dev/sdc1(0) [synced_random_raid5_2legs_1_rmeta_1] black_bird ewi-aor--- 4.00m /dev/sdd1(0) [synced_random_raid5_2legs_1_rmeta_2] black_bird ewi-aor-p- 4.00m unknown device(0) [root@host-078 ~]# lvconvert --yes --repair black_bird/synced_random_raid5_2legs_1 WARNING: Device for PV zKjceH-1t0W-6r4f-Vi22-ssDp-a2B3-vxoBux not found or rejected by a filter. Couldn't find device for segment belonging to black_bird/synced_random_raid5_2legs_1_rimage_2 while checking used and assumed devices. Unable to extract RAID image while RAID array is not in-sync Failed to remove the specified images from black_bird/synced_random_raid5_2legs_1 Failed to replace faulty devices in black_bird/synced_random_raid5_2legs_1. Looks like this affects raid1 as well. This was a fully synced raid1 when the failure tool place. # allocation policy's automatic repair failed Nov 22 17:52:37 host-078 lvm[1997]: Device #0 of raid1 array, black_bird-synced_primary_raid1_2legs_1, has failed. Nov 22 17:52:37 host-078 lvm[1997]: WARNING: Device for PV 7QzEuP-y6sd-X0Nk-eqiI-uWco-R52P-l7NcAK not found or rejected by a filter. Nov 22 17:52:37 host-078 lvm[1997]: Couldn't find device for segment belonging to black_bird/synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Nov 22 17:52:37 host-078 lvm[1997]: WARNING: Device for PV 7QzEuP-y6sd-X0Nk-eqiI-uWco-R52P-l7NcAK already missing, skipping. Nov 22 17:52:37 host-078 lvm[1997]: WARNING: Device for PV 7QzEuP-y6sd-X0Nk-eqiI-uWco-R52P-l7NcAK not found or rejected by a filter. Nov 22 17:52:37 host-078 lvm[1997]: Couldn't find device for segment belonging to black_bird/synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Nov 22 17:52:37 host-078 lvm[1997]: Unable to extract primary RAID image while RAID array is not in-sync (use --force option to replace) Nov 22 17:52:37 host-078 lvm[1997]: Failed to remove the specified images from black_bird/synced_primary_raid1_2legs_1 Nov 22 17:52:37 host-078 lvm[1997]: Failed to replace faulty devices in black_bird/synced_primary_raid1_2legs_1. Nov 22 17:52:37 host-078 lvm[1997]: Failed to process event for black_bird-synced_primary_raid1_2legs_1. # the raid is in-sync [root@host-078 ~]# lvs -a -o +devices WARNING: Device for PV 7QzEuP-y6sd-X0Nk-eqiI-uWco-R52P-l7NcAK not found or rejected by a filter. Couldn't find device for segment belonging to black_bird/synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. LV VG Attr LSize Cpy%Sync Devices synced_primary_raid1_2legs_1 black_bird rwi-aor-p- 500.00m 100.00 synced_primary_raid1_2legs_1_rimage_0(0),synced_primary_raid1_2legs_1_rimage_1(0),synced_primary_raid1_2legs_1_rimage_2(0) [synced_primary_raid1_2legs_1_rimage_0] black_bird iwi-aor-p- 500.00m unknown device(1) [synced_primary_raid1_2legs_1_rimage_1] black_bird iwi-aor--- 500.00m /dev/sdc1(1) [synced_primary_raid1_2legs_1_rimage_2] black_bird iwi-aor--- 500.00m /dev/sdd1(1) [synced_primary_raid1_2legs_1_rmeta_0] black_bird ewi-aor-p- 4.00m unknown device(0) [synced_primary_raid1_2legs_1_rmeta_1] black_bird ewi-aor--- 4.00m /dev/sdc1(0) [synced_primary_raid1_2legs_1_rmeta_2] black_bird ewi-aor--- 4.00m /dev/sdd1(0) [root@host-078 ~]# lvconvert --yes --repair black_bird/synced_primary_raid1_2legs_1 WARNING: Device for PV 7QzEuP-y6sd-X0Nk-eqiI-uWco-R52P-l7NcAK not found or rejected by a filter. Couldn't find device for segment belonging to black_bird/synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Unable to extract primary RAID image while RAID array is not in-sync (use --force option to replace) Failed to remove the specified images from black_bird/synced_primary_raid1_2legs_1 Failed to replace faulty devices in black_bird/synced_primary_raid1_2legs_1. # it's looking for the force now on all repairs? [root@host-078 ~]# lvconvert --yes --force --repair black_bird/synced_primary_raid1_2legs_1 WARNING: Device for PV 7QzEuP-y6sd-X0Nk-eqiI-uWco-R52P-l7NcAK not found or rejected by a filter. Couldn't find device for segment belonging to black_bird/synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Faulty devices in black_bird/synced_primary_raid1_2legs_1 successfully replaced. Issue here with 'raid1' is - we do not allow fix of primary leg failure. Reason is - with mdraid - we are not able to see a difference between fail before or after initial synchronization - md raid kernel is not providing this info - and lvm2 is not yet storing this in lvm2 metadata. So technically it's quite bad user needs to use '--force' option to repair raid with failed primary leg. There is no way dmeventd can do allocate since it's not using --force option. At this moment it's on a user to decided (from look at kernel message log) if the initial sync has proceeded and 'other legs' are safe to use - or there is still some junk - that's why '--force with lvconvert --repair' is needed ATM. This is quite the change in behavior. This is something that "just worked" all through lvm raid history, and now will no longer anymore in 7.4 and 6.9 going forward? I understand your argument, that it may have never "just worked" properly, but this will require documentation to let users know that the allocate fault polices will no longer automatically repair raids anymore when a primary leg that was in-sync fails, even though it used to in every release prior. Only raid1 should be restricted to reject repair unless --force provided for the said argument as of comment #4, not raid4/5/6/10. Upstream commit e611f82a11fb wasn't added to the lvm2-2.02.143-9.el6 build, thus not restricting checks to raid1. The three scenarios listed in comments #0, #1, #2 now work as expected. That is, raid5 and raid6 are back to their normal behavior, and raid1 in 6.9 now requires a --force in order to repair a primary leg failure regardless of sync status. 2.6.32-671.el6.x86_64 lvm2-2.02.143-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 lvm2-libs-2.02.143-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 lvm2-cluster-2.02.143-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 udev-147-2.73.el6_8.2 BUILT: Tue Aug 30 08:17:19 CDT 2016 device-mapper-1.02.117-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 device-mapper-libs-1.02.117-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 device-mapper-event-1.02.117-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 device-mapper-event-libs-1.02.117-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 # raid6 Scenario kill_multiple_synced_raid6_3legs: Kill multiple legs of synced 3 leg raid6 volume(s) ********* RAID hash info for this scenario ********* * names: synced_multiple_raid6_3legs_1 * sync: 1 * type: raid6 * -m |-i value: 3 * leg devices: /dev/mapper/mpathap1 /dev/mapper/mpathfp1 /dev/mapper/mpathhp1 /dev/mapper/mpathdp1 /dev/mapper/mpathep1 * spanned legs: 0 * manual repair: 0 * failpv(s): /dev/mapper/mpathdp1 /dev/mapper/mpathep1 * failnode(s): taft-04 * lvmetad: 0 * raid fault policy: warn ****************************************************** Creating raids(s) on taft-04... taft-04: lvcreate --type raid6 -i 3 -n synced_multiple_raid6_3legs_1 -L 500M black_bird /dev/mapper/mpathap1:0-2400 /dev/mapper/mpathfp1:0-2400 /dev/mapper/mpathhp1:0-2400 /dev/mapper/mpathdp1:0-2400 /dev/mapper/mpathep1:0-2400 [...] Fault policy is warn... Manually repairing failed raid volumes taft-04: 'lvconvert --yes --repair black_bird/synced_multiple_raid6_3legs_1' Couldn't find device with uuid 2xpwZb-6teA-p2q1-Ghnh-j62n-Y6Xz-swD5WS. Couldn't find device with uuid aRC94g-llmD-ceN2-5GOS-IsGB-Ecaf-k6cLJc. Couldn't find device for segment belonging to black_bird/synced_multiple_raid6_3legs_1_rimage_3 while checking used and assumed devices. Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) # raid5 Scenario kill_random_synced_raid5_2legs: Kill random leg of synced 2 leg raid5 volume(s) ********* RAID hash info for this scenario ********* * names: synced_random_raid5_2legs_1 * sync: 1 * type: raid5 * -m |-i value: 2 * leg devices: /dev/sde1 /dev/sdh1 /dev/sda1 * spanned legs: 0 * manual repair: 0 * failpv(s): /dev/sdh1 * failnode(s): host-076 * lvmetad: 0 * raid fault policy: warn ****************************************************** Creating raids(s) on host-076... host-076: lvcreate --type raid5 -i 2 -n synced_random_raid5_2legs_1 -L 500M black_bird /dev/sde1:0-2400 /dev/sdh1:0-2400 /dev/sda1:0-2400 [...] Fault policy is warn... Manually repairing failed raid volumes host-076: 'lvconvert --yes --repair black_bird/synced_random_raid5_2legs_1' /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error /dev/sdh1: read failed after 0 of 512 at 21467824128: Input/output error /dev/sdh1: read failed after 0 of 512 at 21467938816: Input/output error /dev/sdh1: read failed after 0 of 512 at 0: Input/output error /dev/sdh1: read failed after 0 of 512 at 4096: Input/output error Couldn't find device with uuid GlIYa2-2lz1-9R9M-hQls-B0Hd-Os1z-2kSXQy. Couldn't find device for segment belonging to black_bird/synced_random_raid5_2legs_1_rimage_1 while checking used and assumed devices. Waiting until all mirror|raid volumes become fully syncd... 0/1 mirror(s) are fully synced: ( 84.97% ) 1/1 mirror(s) are fully synced: ( 100.00% ) # raid1 Scenario kill_primary_synced_raid1_2legs: Kill primary leg of synced 2 leg raid1 volume(s) ********* RAID hash info for this scenario ********* * names: synced_primary_raid1_2legs_1 * sync: 1 * type: raid1 * -m |-i value: 2 * leg devices: /dev/mapper/mpathhp1 /dev/mapper/mpathcp1 /dev/mapper/mpathfp1 * spanned legs: 0 * manual repair: 1 * failpv(s): /dev/mapper/mpathhp1 * additional snap: /dev/mapper/mpathcp1 * failnode(s): taft-04 * lvmetad: 0 * raid fault policy: allocate ****************************************************** Creating raids(s) on taft-04... taft-04: lvcreate --type raid1 -m 2 -n synced_primary_raid1_2legs_1 -L 500M black_bird /dev/mapper/mpathhp1:0-2400 /dev/mapper/mpathcp1:0-2400 /dev/mapper/mpathfp1:0-2400 [...] Manually repairing failed raid volumes (but first, verify that a non-force repair attempt fails, check for bug 1311765) Couldn't find device for segment belonging to black_bird/synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Unable to extract primary RAID image while RAID array is not in-sync (use --force option to replace) Failed to remove the specified images from black_bird/synced_primary_raid1_2legs_1 Failed to replace faulty devices in black_bird/synced_primary_raid1_2legs_1. taft-04: 'lvconvert --force --yes --repair black_bird/synced_primary_raid1_2legs_1' Couldn't find device for segment belonging to black_bird/synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0798.html |