Bug 1420375
Summary: | impossible to replace a failed drive in lvm raid | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Yuri Arabadji <yuri> |
Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> |
lvm2 sub component: | Manual pages and Documentation | QA Contact: | cluster-qe <cluster-qe> |
Status: | CLOSED NOTABUG | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | agk, cmarthal, heinzm, jbrassow, mcsontos, msnitzer, prajnoha, prockai, zkabelac |
Version: | 7.3 | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-04-26 15:55:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yuri Arabadji
2017-02-08 14:30:09 UTC
A failed raid volume needs to be repaired and consistent before devices can be additionally swapped out. This repair operation will do the swap itself and should happen automatically if you have the raid_fault_policy set to 'allocate'. If not manual repair is done like this: # Raid10 with failed leg [root@host-113 ~]# lvs -a -o +devices /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error /dev/sdc1: read failed after 0 of 1024 at 22545367040: Input/output error /dev/sdc1: read failed after 0 of 1024 at 22545448960: Input/output error /dev/sdc1: read failed after 0 of 1024 at 0: Input/output error /dev/sdc1: read failed after 0 of 1024 at 4096: Input/output error Couldn't find device with uuid XLLqx9-X85X-xlxy-xpd1-Aset-KSYb-bjVScz. Couldn't find device for segment belonging to VG/raid10_lv_rimage_3 while checking used and assumed devices. LV VG Attr LSize Cpy%Sync Devices raid10_lv VG rwi-aor-p- 504.00m 100.00 raid10_lv_rimage_0(0),raid10_lv_rimage_1(0),raid10_lv_rimage_2(0),raid10_lv_rimage_3(0) [raid10_lv_rimage_0] VG iwi-aor--- 252.00m /dev/sdh1(1) [raid10_lv_rimage_1] VG iwi-aor--- 252.00m /dev/sde1(1) [raid10_lv_rimage_2] VG iwi-aor--- 252.00m /dev/sdb1(1) [raid10_lv_rimage_3] VG iwi-aor-p- 252.00m unknown device(1) [raid10_lv_rmeta_0] VG ewi-aor--- 4.00m /dev/sdh1(0) [raid10_lv_rmeta_1] VG ewi-aor--- 4.00m /dev/sde1(0) [raid10_lv_rmeta_2] VG ewi-aor--- 4.00m /dev/sdb1(0) [raid10_lv_rmeta_3] VG ewi-aor-p- 4.00m unknown device(0) [root@host-113 ~]# dmsetup status VG-raid10_lv_rmeta_2: 0 8192 linear VG-raid10_lv_rimage_1: 0 516096 linear VG-raid10_lv_rmeta_1: 0 8192 linear VG-raid10_lv_rimage_0: 0 516096 linear VG-raid10_lv_rmeta_0: 0 8192 linear VG-raid10_lv_rimage_3: 0 516096 linear VG-raid10_lv: 0 1032192 raid raid10 4 AAAD 1032192/1032192 idle 0 VG-raid10_lv_rmeta_3: 0 8192 linear VG-raid10_lv_rimage_2: 0 516096 linear [root@host-113 ~]# lvconvert --yes --repair VG/raid10_lv /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error /dev/sdc1: read failed after 0 of 1024 at 22545367040: Input/output error /dev/sdc1: read failed after 0 of 1024 at 22545448960: Input/output error /dev/sdc1: read failed after 0 of 1024 at 0: Input/output error /dev/sdc1: read failed after 0 of 1024 at 4096: Input/output error Couldn't find device with uuid XLLqx9-X85X-xlxy-xpd1-Aset-KSYb-bjVScz. Couldn't find device for segment belonging to VG/raid10_lv_rimage_3 while checking used and assumed devices. Faulty devices in VG/raid10_lv successfully replaced. [root@host-113 ~]# lvs -a -o +devices /dev/sdc1: open failed: No such device or address Couldn't find device with uuid XLLqx9-X85X-xlxy-xpd1-Aset-KSYb-bjVScz. LV VG Attr LSize Cpy%Sync Devices raid10_lv VG rwi-aor--- 504.00m 100.00 raid10_lv_rimage_0(0),raid10_lv_rimage_1(0),raid10_lv_rimage_2(0),raid10_lv_rimage_3(0) [raid10_lv_rimage_0] VG iwi-aor--- 252.00m /dev/sdh1(1) [raid10_lv_rimage_1] VG iwi-aor--- 252.00m /dev/sde1(1) [raid10_lv_rimage_2] VG iwi-aor--- 252.00m /dev/sdb1(1) [raid10_lv_rimage_3] VG iwi-aor--- 252.00m /dev/sda1(1) <- newly allocated device [raid10_lv_rmeta_0] VG ewi-aor--- 4.00m /dev/sdh1(0) [raid10_lv_rmeta_1] VG ewi-aor--- 4.00m /dev/sde1(0) [raid10_lv_rmeta_2] VG ewi-aor--- 4.00m /dev/sdb1(0) [raid10_lv_rmeta_3] VG ewi-aor--- 4.00m /dev/sda1(0) <- newly allocated device [root@host-113 ~]# vgreduce --removemissing VG Couldn't find device with uuid XLLqx9-X85X-xlxy-xpd1-Aset-KSYb-bjVScz. Wrote out consistent volume group VG # Now you can additionally swap to another device if you'd like [root@host-113 ~]# lvconvert -d --replace /dev/sda1 VG/raid10_lv /dev/sdf1 [root@host-113 ~]# lvs -a -o +devices /dev/sdc1: open failed: No such device or address LV VG Attr LSize Cpy%Sync Devices raid10_lv VG rwi-aor--- 504.00m 100.00 raid10_lv_rimage_0(0),raid10_lv_rimage_1(0),raid10_lv_rimage_2(0),raid10_lv_rimage_3(0) [raid10_lv_rimage_0] VG iwi-aor--- 252.00m /dev/sdh1(1) [raid10_lv_rimage_1] VG iwi-aor--- 252.00m /dev/sde1(1) [raid10_lv_rimage_2] VG iwi-aor--- 252.00m /dev/sdb1(1) [raid10_lv_rimage_3] VG iwi-aor--- 252.00m /dev/sdf1(1) <- replaced device [raid10_lv_rmeta_0] VG ewi-aor--- 4.00m /dev/sdh1(0) [raid10_lv_rmeta_1] VG ewi-aor--- 4.00m /dev/sde1(0) [raid10_lv_rmeta_2] VG ewi-aor--- 4.00m /dev/sdb1(0) [raid10_lv_rmeta_3] VG ewi-aor--- 4.00m /dev/sdf1(0) <- replaced device Yeah, guys, this all should be clearly documented. Imagine someone dealing with this for the first time. When looking for solution, I tried "4.4.3.8.3. Replacing a RAID device", because it's quite the obvious thing to try first. I mean, you've got a failing device, you search docs for "device", "failure" and so on, you don't search for "fault policy". lvconvert man page describes 2 options - "--repair" and "--replace". I mean, it's natural to try to "replace" a failed drive first, and after that, "repair" the raid array with replaced drive. Also, --repair has a PhysicalVolume optional arg that isn't documented. A redhat KB article titled "Recovering from LVM RAID device failure" with steps outlined would be very handy. ps: I randomly managed to stumble upon another solution, where you replace a failed PV (6.5. REPLACING A MISSING PHYSICAL VOLUME) with --norestorefile keeping the PV UUID, then you --refresh the VG metadata, forcing an update to _rimage_ table pointing to new PV. Knowledge Base search [1] returned the following article: https://access.redhat.com/solutions/2537061 It is about thin pool though, and may be repairing simple RAID should be mentioned too. I agree the LVM Administration Guide[2] is confusing. In section 4.4.3.8.3 it says: Therefore, rather than removing a failed device unconditionally and potentially allocating a replacement, LVM allows you to replace a device in a RAID volume in a one-step solution by using the --replace argument of the lvconvert command which is apparently inaccurate. I would like to promote lvmraid(7) man page, which is fine. Re: Undocumented optional PV argument - any LVM command allocating space takes PV. This used to be documented in earlier versions of manpage, but was removed recently. It's still in --help output. Let's see if we can get it back. [1]: https://access.redhat.com/search/#/?q=lvm%20raid%20replace%20failed%20device&p=1&srch=any&documentKind= [2]: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Logical_Volume_Manager_Administration/LV.html#raid_volume_create (In reply to Yuri Arabadji from comment #2) > Yeah, guys, this all should be clearly documented. Imagine someone dealing > with this for the first time. > > When looking for solution, I tried "4.4.3.8.3. Replacing a RAID device", > because it's quite the obvious thing to try first. I mean, you've got a > failing device, you search docs for "device", "failure" and so on, you don't > search for "fault policy". > > lvconvert man page describes 2 options - "--repair" and "--replace". I mean, > it's natural to try to "replace" a failed drive first, and after that, > "repair" the raid array with replaced drive. > > Also, --repair has a PhysicalVolume optional arg that isn't documented. > > A redhat KB article titled "Recovering from LVM RAID device failure" with > steps outlined would be very handy. > > > ps: I randomly managed to stumble upon another solution, where you replace a > failed PV (6.5. REPLACING A MISSING PHYSICAL VOLUME) with --norestorefile > keeping the PV UUID, then you --refresh the VG metadata, forcing an update > to _rimage_ table pointing to new PV. Yuri, you're right that there could be better documentation - esp the kbase article you mention. We are working on it. Have you tried the 'lvmraid' man page? It describes in detail the options available when a failure occurs. The distinction between "lvconvert --repair" and --replace and the use cases with respect to failed and accessible devices are documented in lvmraid(7), i.e in section "Replacing Devices". |