Bug 1311765
Summary: | non synced primary leg raid1 recovery allocation unable to take place | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> | ||||
lvm2 sub component: | Mirroring and RAID (RHEL6) | QA Contact: | cluster-qe <cluster-qe> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | medium | ||||||
Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac | ||||
Version: | 6.8 | ||||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | lvm2-2.02.143-8.el6 | Doc Type: | Enhancement | ||||
Doc Text: |
Cause:
Data loss on converted linear LV to raid1.
A failed primary raid1 leg during initial resynchronization of an upconverted linear LV (e.g "lvconvert -m1 $LinearLV") could be replaced, thus losing the previous linear LV and any of its still recoverable data.
Consequence:
Any still recoverable data on the primary, previously linear leg is lost.
Fix:
Reject repair of a raid1 with failed primary leg during initial synchronization.
Result:
A monitored raid1 LV is not silently repaired any more.
When repairing manually, a message is displayed that repair is only possible using --force thus enabling the user to do data recovery on any still partially readable primary leg.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-03-21 12:02:26 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1397589, 1446754 | ||||||
Attachments: |
|
Description
Corey Marthaler
2016-02-24 23:10:45 UTC
this might be another occurrence of the sync state not being valid. (In reply to Jonathan Earl Brassow from comment #12) > this might be another occurrence of the sync state not being valid. bug 1210637 is the reference. Based on discussion with Jon and Corey, it is actually the correct behaviour to prevent repairing failures of the primary leg of a raid1 array (same applies to "mirror"). We should not silently allow automatic replacement, because the raid1 LV may be an upconverted linear one thus the user may loose data when it still could be dd_rescue'd off of the primary leg. We agreed on requesting the --force option on such repairs. Without the force option, repairs of the primary leg will be rejected. It appears there are cases where the automatic repair of a not in sync failed primary leg succeeds. Scenario kill_primary_non_synced_raid1_1legs: Kill primary leg of NON synced 1 leg raid1 volume(s) ********* RAID hash info for this scenario ********* * names: non_synced_primary_raid1_2legs_1 * sync: 0 * type: raid1 * -m |-i value: 1 * leg devices: /dev/sdf1 /dev/sda1 * spanned legs: 0 * manual repair: 1 * failpv(s): /dev/sdf1 * failnode(s): host-114.virt.lab.msp.redhat.com * lvmetad: 0 * raid fault policy: allocate ****************************************************** Creating raids(s) on host-114.virt.lab.msp.redhat.com... host-114.virt.lab.msp.redhat.com: lvcreate --type raid1 -m 1 -n non_synced_primary_raid1_2legs_1 -L 3G black_bird /dev/sdf1:0-2400 /dev/sda1:0-2400 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices non_synced_primary_raid1_2legs_1 rwi-a-r--- 3.00g 0.00 non_synced_primary_raid1_2legs_1_rimage_0(0),non_synced_primary_raid1_2legs_1_rimage_1(0) [non_synced_primary_raid1_2legs_1_rimage_0] Iwi-aor--- 3.00g /dev/sdf1(1) [non_synced_primary_raid1_2legs_1_rimage_1] Iwi-aor--- 3.00g /dev/sda1(1) [non_synced_primary_raid1_2legs_1_rmeta_0] ewi-aor--- 4.00m /dev/sdf1(0) [non_synced_primary_raid1_2legs_1_rmeta_1] ewi-aor--- 4.00m /dev/sda1(0) Creating ext on top of mirror(s) on host-114.virt.lab.msp.redhat.com... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on host-114.virt.lab.msp.redhat.com... PV=/dev/sdf1 non_synced_primary_raid1_2legs_1_rimage_0: 1.0 non_synced_primary_raid1_2legs_1_rmeta_0: 1.0 Writing verification files (checkit) to mirror(s) on... ---- host-114.virt.lab.msp.redhat.com ---- Verifying files (checkit) on mirror(s) on... ---- host-114.virt.lab.msp.redhat.com ---- Current sync percent just before failure ( 17.05% ) Disabling device sdf on host-114.virt.lab.msp.redhat.com Getting recovery check start time from /var/log/messages: Apr 6 17:36 Attempting I/O to cause mirror down conversion(s) on host-114.virt.lab.msp.redhat.com dd if=/dev/zero of=/mnt/non_synced_primary_raid1_2legs_1/ddfile count=10 bs=4M oflag=direct 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.358481 s, 117 MB/s Apr 6 17:37:03 host-114 lvm[4619]: Device #0 of raid1 array, black_bird-non_synced_primary_raid1_2legs_1, has failed. Apr 6 17:37:03 host-114 lvm[4619]: Couldn't find device with uuid TxkSOQ-FWnC-3zYx-IytA-sexH-Wjh8-Yz34er. Apr 6 17:37:03 host-114 lvm[4619]: Couldn't find device with uuid TxkSOQ-FWnC-3zYx-IytA-sexH-Wjh8-Yz34er. Apr 6 17:37:04 host-114 lvm[4619]: Faulty devices in black_bird/non_synced_primary_raid1_2legs_1 successfully replaced. Verifying current sanity of lvm after the failure Current mirror/raid device structure(s): Couldn't find device with uuid TxkSOQ-FWnC-3zYx-IytA-sexH-Wjh8-Yz34er. LV Attr LSize Cpy%Sync Devices non_synced_primary_raid1_2legs_1 rwi-aor--- 3.00g 100.00 non_synced_primary_raid1_2legs_1_rimage_0(0),non_synced_primary_raid1_2legs_1_rimage_1(0) [non_synced_primary_raid1_2legs_1_rimage_0] iwi-aor--- 3.00g /dev/sdc1(1) [non_synced_primary_raid1_2legs_1_rimage_1] iwi-aor--- 3.00g /dev/sda1(1) [non_synced_primary_raid1_2legs_1_rmeta_0] ewi-aor--- 4.00m /dev/sdc1(0) [non_synced_primary_raid1_2legs_1_rmeta_1] ewi-aor--- 4.00m /dev/sda1(0) 2.6.32-639.el6.x86_64 lvm2-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 lvm2-libs-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 lvm2-cluster-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 udev-147-2.72.el6 BUILT: Tue Mar 1 06:14:05 CST 2016 device-mapper-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-libs-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-event-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-event-libs-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-persistent-data-0.6.2-0.1.rc7.el6 BUILT: Tue Mar 22 08:58:09 CDT 2016 cmirror-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 I'm working on a patch to prevent automatic repair of the primary leg w/o the --force option as stated in comment #16 Created attachment 1144749 [details]
Preliminary patch to reject repair of failed primary leg unless --force
Patch adds detection of in-sync ratio 100% _and_ failed primary leg
and prohibits its repair because of potential data loss (e.g. if
this was an upconvert of a linear LV) unless the user intentially.
requests it via the --force option to "lvconvert --repair $LV".
To add to comment #18, it appears that *only* 2 legged (-m 1) not in sync raid volumes have automatic repair take place when it should not ## lvcreate -m 1 lvm[4000]: Faulty devices in black_bird/non_synced_primary_raid1_2legs_1 successfully replaced. lvm[4000]: Device #0 of raid1 array, black_bird-non_synced_primary_raid1_2legs_1, has failed. lvm[4000]: Couldn't find device with uuid f4PXe5-A0bK-keOu-T0lh-93Jc-xz9X-KBnWrg. lvm[4000]: Couldn't find device with uuid f4PXe5-A0bK-keOu-T0lh-93Jc-xz9X-KBnWrg. lvm[4000]: Faulty devices in black_bird/non_synced_primary_raid1_2legs_1 successfully replaced. ## lvcreate -m 2|3 lvm[4003]: Unable to extract primary RAID image while RAID array is not in-sync lvm[4003]: Failed to remove the specified images from black_bird/non_synced_primary_raid1_3legs_1 lvm[4003]: Failed to replace faulty devices in black_bird/non_synced_primary_raid1_3legs_1. lvm[4003]: Failed to process event for black_bird-non_synced_primary_raid1_3legs_1-real. We have to reject repairs of non-synced raid1 LVs to prevent further data loss. Upstream commit 8270ff5702e0 reject repair on nonsynced raid1 LV unless --force option provided as of settlement in comment #16. This provides the option to rescue data off of a primary leg after an upconvert from linear taking a primary leg failure during initial sync or to enforce rapair knowingly. Verified that although this bug was originally about non-synced raid1 volumes unexpectedly not automatically repairing, that this is now the expected behavior (see comment #16). Also, that the only way to repair a primary leg raid1 failure (whether it had been in sync previously or not) now is for the user to specifically attempt with the '--force' option. 2.6.32-671.el6.x86_64 lvm2-2.02.143-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 lvm2-libs-2.02.143-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 lvm2-cluster-2.02.143-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 udev-147-2.73.el6_8.2 BUILT: Tue Aug 30 08:17:19 CDT 2016 device-mapper-1.02.117-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 device-mapper-libs-1.02.117-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 device-mapper-event-1.02.117-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 device-mapper-event-libs-1.02.117-10.el6 BUILT: Thu Nov 24 03:58:43 CST 2016 ### Non synced raid1 kill primary leg w/ warn fault policy ./black_bird -i 2 -f -e kill_primary_non_synced_raid1_1legs,kill_primary_non_synced_raid1_2legs [...] Fault policy is warn... Manually repairing failed raid volumes (but first, verify that a non-force repair attempt fails, check for bug 1311765) [...] Couldn't find device for segment belonging to black_bird/non_synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Unable to extract primary RAID image while RAID array is not in-sync (use --force option to replace) Failed to remove the specified images from black_bird/non_synced_primary_raid1_2legs_1 Failed to replace faulty devices in black_bird/non_synced_primary_raid1_2legs_1. taft-04: 'lvconvert --force --yes --repair black_bird/non_synced_primary_raid1_2legs_1' [...] Couldn't find device for segment belonging to black_bird/non_synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Waiting until all mirror|raid volumes become fully syncd... 0/1 mirror(s) are fully synced: ( 41.08% ) 0/1 mirror(s) are fully synced: ( 71.47% ) 1/1 mirror(s) are fully synced: ( 100.00% ) ### Non synced raid1 kill primary leg w/ allocate fault policy ./black_bird -i 2 -F -e kill_primary_non_synced_raid1_1legs,kill_primary_non_synced_raid1_2legs # Automatic allocation attempt now fails as expected: Nov 30 13:27:26 taft-04 lvm[32257]: Unable to extract primary RAID image while RAID array is not in-sync (use --force option to replace) Nov 30 13:27:26 taft-04 lvm[32257]: Failed to remove the specified images from black_bird/non_synced_primary_raid1_2legs_1 Nov 30 13:27:26 taft-04 lvm[32257]: Failed to replace faulty devices in black_bird/non_synced_primary_raid1_2legs_1. Nov 30 13:27:26 taft-04 lvm[32257]: Failed to process event for black_bird-non_synced_primary_raid1_2legs_1. [...] Nov 30 13:27:26 taft-04 lvm[32257]: Couldn't find device with uuid kfRQwe-NpU3-LKkk-wlev-NW3k-uThl-NLvdog. Nov 30 13:27:26 taft-04 lvm[32257]: Couldn't find device for segment belonging to black_bird/non_synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Nov 30 13:27:26 taft-04 lvm[32257]: Couldn't find device with uuid kfRQwe-NpU3-LKkk-wlev-NW3k-uThl-NLvdog. [...] Nov 30 13:27:26 taft-04 lvm[32257]: Couldn't find device for segment belonging to black_bird/non_synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Nov 30 13:27:27 taft-04 lvm[32257]: Unable to extract primary RAID image while RAID array is not in-sync (use --force option to replace) Nov 30 13:27:27 taft-04 lvm[32257]: Failed to remove the specified images from black_bird/non_synced_primary_raid1_2legs_2 Nov 30 13:27:27 taft-04 lvm[32257]: Failed to replace faulty devices in black_bird/non_synced_primary_raid1_2legs_2. Nov 30 13:27:27 taft-04 lvm[32257]: Failed to process event for black_bird-non_synced_primary_raid1_2legs_2. # Manual attempt w/o a --force also fails Manually repairing failed raid volumes (but first, verify that a non-force repair attempt fails, check for bug 1311765) [...] Couldn't find device with uuid kfRQwe-NpU3-LKkk-wlev-NW3k-uThl-NLvdog. Couldn't find device for segment belonging to black_bird/non_synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. Unable to extract primary RAID image while RAID array is not in-sync (use --force option to replace) Failed to remove the specified images from black_bird/non_synced_primary_raid1_2legs_1 Failed to replace faulty devices in black_bird/non_synced_primary_raid1_2legs_1. taft-04: 'lvconvert --force --yes --repair black_bird/non_synced_primary_raid1_2legs_1' [...] Couldn't find device with uuid kfRQwe-NpU3-LKkk-wlev-NW3k-uThl-NLvdog. Couldn't find device for segment belonging to black_bird/non_synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices. (but first, verify that a non-force repair attempt fails, check for bug 1311765) [...] Couldn't find device with uuid kfRQwe-NpU3-LKkk-wlev-NW3k-uThl-NLvdog. Couldn't find device for segment belonging to black_bird/non_synced_primary_raid1_2legs_2_rimage_0 while checking used and assumed devices. Unable to extract primary RAID image while RAID array is not in-sync (use --force option to replace) Failed to remove the specified images from black_bird/non_synced_primary_raid1_2legs_2 Failed to replace faulty devices in black_bird/non_synced_primary_raid1_2legs_2. taft-04: 'lvconvert --force --yes --repair black_bird/non_synced_primary_raid1_2legs_2' [...] Couldn't find device with uuid kfRQwe-NpU3-LKkk-wlev-NW3k-uThl-NLvdog. Couldn't find device for segment belonging to black_bird/non_synced_primary_raid1_2legs_2_rimage_0 while checking used and assumed devices. Waiting until all mirror|raid volumes become fully syncd... 0/2 mirror(s) are fully synced: ( 51.70% 18.53% ) 0/2 mirror(s) are fully synced: ( 73.46% 37.75% ) 0/2 mirror(s) are fully synced: ( 93.42% 58.54% ) 1/2 mirror(s) are fully synced: ( 100.00% 92.59% ) 2/2 mirror(s) are fully synced: ( 100.00% 100.00% ) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0798.html |