Bug 1420375

Summary:	impossible to replace a failed drive in lvm raid
Product:	Red Hat Enterprise Linux 7	Reporter:	Yuri Arabadji <yuri>
Component:	lvm2	Assignee:	Heinz Mauelshagen <heinzm>
lvm2 sub component:	Manual pages and Documentation	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	agk, cmarthal, heinzm, jbrassow, mcsontos, msnitzer, prajnoha, prockai, zkabelac
Version:	7.3
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-04-26 15:55:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yuri Arabadji 2017-02-08 14:30:09 UTC

I'm on EL7.3, but with upstream kernel (3.18), if that matters. The issue is that there's no sane way to deal with drive failures in raid configuration. 

I hope it's me who misunderstood the docs and concept, but if not, you can just delete whole raid support from LVM and not state it's working because it's useless.

The usecase is simple: you build raid10 of 6 drives, you make one drive fail, you want to replace that drive with a clean one. 
And you can't. Because once a drive fails, you automatically get a missing PV (LVM can't work like mdraid, with pure blockdevs, it needs an LV belonging to same VG the raid is in).


# lvcreate --nosync --regionsize 128MiB --stripesize 1MiB  --type raid10 -L 12GiB -i 3  -n big1  ssd /dev/sdc1 /dev/sdd1 /dev/sdc2 /dev/sdd2 /dev/fused/d1 /dev/fused/d2 
  WARNING: New raid10 won't be synchronised. Don't read what you didn't write!
  Logical volume "big1" created.

# pvs
  PV            VG     Fmt  Attr PSize   PFree 
  /dev/fused/d1 ssd    lvm2 a--   10.00g  5.99g
  /dev/fused/d2 ssd    lvm2 a--   10.00g  5.99g
  /dev/fused/d3 ssd    lvm2 a--   10.00g 10.00g
  /dev/md2      fused  lvm2 a--    1.09t  1.00t
  /dev/sdc1     ssd    lvm2 a--  100.00g 85.99g
  /dev/sdc2     ssd    lvm2 a--  100.00g 85.99g
  /dev/sdd1     ssd    lvm2 a--  100.00g 85.99g
  /dev/sdd2     ssd    lvm2 a--  100.00g 85.99g
  
# lvs
  LV    VG     Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  d1    fused  -wi-ao---- 10.00g                                                    
  d2    fused  -wi-ao---- 10.00g                                                    
  d3    fused  -wi-a----- 10.00g                                                    
  root  fused  -wi-ao----  5.00g                                                    
  slow  fused  -wi-ao---- 25.00g                                                    
  slow2 fused  -wi-a----- 20.00g                                                    
  swap  fused  -wi-ao----  2.00g                                                    
  tmp   fused  -wi-ao----  2.00g                                                    
  var   fused  -wi-ao----  5.00g                                                    
  big1  ssd    Rwi-a-r--- 12.00g                                    100.00          
  rd10  ssd    Rwi-aor--- 20.00g                                    100.00          
  # lvs -a -o+seg_pe_ranges ssd/big1
  LV   VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert PE Ranges                                                                                                                    
  big1 ssd Rwi-a-r--- 12.00g                                    100.00           big1_rimage_0:0-1023 big1_rimage_1:0-1023 big1_rimage_2:0-1023 big1_rimage_3:0-1023 big1_rimage_4:0-1023 big1_rimage_5:0-1023

# dmsetup table ssd-big1
0 25165824 raid raid10 4 2048 nosync region_size 262144 6 253:18 253:19 253:20 253:21 253:22 253:23 253:24 253:25 253:26 253:27 253:28 253:29
# dmsetup status ssd-big1
0 25165824 raid raid10 6 AAAAAA 25165824/25165824 idle 0

# lsblk
NAME                      MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                         8:0    0   1.1T  0 disk  
├─sda1                      8:1    0   700M  0 part  
│ └─md0                     9:0    0   700M  0 raid1 /boot
├─sda2                      8:2    0    99M  0 part  
│ └─md1                     9:1    0    99M  0 raid1 /boot/efi
└─sda3                      8:3    0   1.1T  0 part  
  └─md2                     9:2    0   1.1T  0 raid1 
    ├─fused-root          253:0    0     5G  0 lvm   /
    ├─fused-swap          253:1    0     2G  0 lvm   [SWAP]
    ├─fused-var           253:2    0     5G  0 lvm   /var
    ├─fused-tmp           253:3    0     2G  0 lvm   /tmp
    ├─fused-slow          253:4    0    25G  0 lvm   
    ├─fused-slow2         253:5    0    20G  0 lvm   
    ├─fused-d1            253:16   0    10G  0 lvm   
    │ ├─ssd-big1_rmeta_4  253:26   0     4M  0 lvm   
    │ │ └─ssd-big1        253:30   0    12G  0 lvm   
    │ └─ssd-big1_rimage_4 253:27   0     4G  0 lvm   
    │   └─ssd-big1        253:30   0    12G  0 lvm   
    ├─fused-d2            253:17   0    10G  0 lvm   
    │ ├─ssd-big1_rmeta_5  253:28   0     4M  0 lvm   
    │ │ └─ssd-big1        253:30   0    12G  0 lvm   
    │ └─ssd-big1_rimage_5 253:29   0     4G  0 lvm   
    │   └─ssd-big1        253:30   0    12G  0 lvm   
    └─fused-d3            253:31   0    10G  0 lvm   
sdb                         8:16   0   1.1T  0 disk  
├─sdb1                      8:17   0   700M  0 part  
│ └─md0                     9:0    0   700M  0 raid1 /boot
├─sdb2                      8:18   0    99M  0 part  
│ └─md1                     9:1    0    99M  0 raid1 /boot/efi
└─sdb3                      8:19   0   1.1T  0 part  
  └─md2                     9:2    0   1.1T  0 raid1 
    ├─fused-root          253:0    0     5G  0 lvm   /
    ├─fused-swap          253:1    0     2G  0 lvm   [SWAP]
    ├─fused-var           253:2    0     5G  0 lvm   /var
    ├─fused-tmp           253:3    0     2G  0 lvm   /tmp
    ├─fused-slow          253:4    0    25G  0 lvm   
    ├─fused-slow2         253:5    0    20G  0 lvm   
    ├─fused-d1            253:16   0    10G  0 lvm   
    │ ├─ssd-big1_rmeta_4  253:26   0     4M  0 lvm   
    │ │ └─ssd-big1        253:30   0    12G  0 lvm   
    │ └─ssd-big1_rimage_4 253:27   0     4G  0 lvm   
    │   └─ssd-big1        253:30   0    12G  0 lvm   
    ├─fused-d2            253:17   0    10G  0 lvm   
    │ ├─ssd-big1_rmeta_5  253:28   0     4M  0 lvm   
    │ │ └─ssd-big1        253:30   0    12G  0 lvm   
    │ └─ssd-big1_rimage_5 253:29   0     4G  0 lvm   
    │   └─ssd-big1        253:30   0    12G  0 lvm   
    └─fused-d3            253:31   0    10G  0 lvm   
sdc                         8:32   0 232.9G  0 disk  
├─sdc1                      8:33   0   100G  0 part  
│ ├─ssd-rd10_rmeta_0      253:6    0     4M  0 lvm   
│ │ └─ssd-rd10            253:14   0    20G  0 lvm   
│ ├─ssd-rd10_rimage_0     253:7    0    10G  0 lvm   
│ │ └─ssd-rd10            253:14   0    20G  0 lvm   
│ ├─ssd-big1_rmeta_0      253:18   0     4M  0 lvm   
│ │ └─ssd-big1            253:30   0    12G  0 lvm   
│ └─ssd-big1_rimage_0     253:19   0     4G  0 lvm   
│   └─ssd-big1            253:30   0    12G  0 lvm   
└─sdc2                      8:34   0   100G  0 part  
  ├─ssd-rd10_rmeta_1      253:8    0     4M  0 lvm   
  │ └─ssd-rd10            253:14   0    20G  0 lvm   
  ├─ssd-rd10_rimage_1     253:9    0    10G  0 lvm   
  │ └─ssd-rd10            253:14   0    20G  0 lvm   
  ├─ssd-big1_rmeta_2      253:22   0     4M  0 lvm   
  │ └─ssd-big1            253:30   0    12G  0 lvm   
  └─ssd-big1_rimage_2     253:23   0     4G  0 lvm   
    └─ssd-big1            253:30   0    12G  0 lvm   
sdd                         8:48   0 279.5G  0 disk  
├─sdd1                      8:49   0   100G  0 part  
│ ├─ssd-rd10_rmeta_2      253:10   0     4M  0 lvm   
│ │ └─ssd-rd10            253:14   0    20G  0 lvm   
│ ├─ssd-rd10_rimage_2     253:11   0    10G  0 lvm   
│ │ └─ssd-rd10            253:14   0    20G  0 lvm   
│ ├─ssd-big1_rmeta_1      253:20   0     4M  0 lvm   
│ │ └─ssd-big1            253:30   0    12G  0 lvm   
│ └─ssd-big1_rimage_1     253:21   0     4G  0 lvm   
│   └─ssd-big1            253:30   0    12G  0 lvm   
└─sdd2                      8:50   0   100G  0 part  
  ├─ssd-rd10_rmeta_3      253:12   0     4M  0 lvm   
  │ └─ssd-rd10            253:14   0    20G  0 lvm   
  ├─ssd-rd10_rimage_3     253:13   0    10G  0 lvm   
  │ └─ssd-rd10            253:14   0    20G  0 lvm   
  ├─ssd-big1_rmeta_3      253:24   0     4M  0 lvm   
  │ └─ssd-big1            253:30   0    12G  0 lvm   
  └─ssd-big1_rimage_3     253:25   0     4G  0 lvm   
    └─ssd-big1            253:30   0    12G  0 lvm   

# simulate failure
# dmsetup load  fused-d2 --table '0 20971520 error'
# dmsetup suspend fused-d2
# dmsetup resume fused-d2

ddrescue  -f /dev/zero  /dev/ssd/big1
[70269.831590] md/raid10:mdX: Disk failure on dm-29, disabling device.
md/raid10:mdX: Operation continuing on 5 devices.
[70269.879922] RAID10 conf printout:
[70269.879927]  --- wd:5 rd:6
[70269.879931]  disk 0, wo:0, o:1, dev:dm-19
[70269.879934]  disk 1, wo:0, o:1, dev:dm-21
[70269.879936]  disk 2, wo:0, o:1, dev:dm-23
[70269.879938]  disk 3, wo:0, o:1, dev:dm-25
[70269.879941]  disk 4, wo:0, o:1, dev:dm-27
[70269.879943]  disk 5, wo:1, o:0, dev:dm-29
[70269.911143] RAID10 conf printout:
[70269.911149]  --- wd:5 rd:6
[70269.911153]  disk 0, wo:0, o:1, dev:dm-19
[70269.911157]  disk 1, wo:0, o:1, dev:dm-21
[70269.911159]  disk 2, wo:0, o:1, dev:dm-23
[70269.911162]  disk 3, wo:0, o:1, dev:dm-25
[70269.911165]  disk 4, wo:0, o:1, dev:dm-27

# dmsetup status ssd-big1
0 25165824 raid raid10 6 AAAAAD 25165824/25165824 idle 0
# dmsetup table ssd-big1
0 25165824 raid raid10 4 2048 nosync region_size 262144 6 253:18 253:19 253:20 253:21 253:22 253:23 253:24 253:25 253:26 253:27 253:28 253:29

# lvconvert -d   --replace /dev/fused/d2  ssd/big1 /dev/fused/d3
  WARNING: Device for PV R4Ajxz-Ilr4-RpoT-yMGx-TRgV-EtL0-1Aeu5Z not found or rejected by a filter.
  WARNING: Couldn't find all devices for LV ssd/big1_rimage_5 while checking used and assumed devices.
  WARNING: Couldn't find all devices for LV ssd/big1_rmeta_5 while checking used and assumed devices.
  Cannot change VG ssd while PVs are missing.
  Consider vgreduce --removemissing.
  Cannot process volume group ssd

If you try to remove failed LV now, you'll end up in much more trouble than you're now.

I'm wondering if this is expected behaviour and if not, what's the proper way to get out of this situation. 

Thanks.

Comment 1 Corey Marthaler 2017-02-08 16:33:09 UTC

A failed raid volume needs to be repaired and consistent before devices can be additionally swapped out. This repair operation will do the swap itself and should happen automatically if you have the raid_fault_policy set to 'allocate'. 
If not manual repair is done like this:


# Raid10 with failed leg

[root@host-113 ~]# lvs -a -o +devices
  /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdc1: read failed after 0 of 1024 at 22545367040: Input/output error
  /dev/sdc1: read failed after 0 of 1024 at 22545448960: Input/output error
  /dev/sdc1: read failed after 0 of 1024 at 0: Input/output error
  /dev/sdc1: read failed after 0 of 1024 at 4096: Input/output error
  Couldn't find device with uuid XLLqx9-X85X-xlxy-xpd1-Aset-KSYb-bjVScz.
  Couldn't find device for segment belonging to VG/raid10_lv_rimage_3 while checking used and assumed devices.
  LV                   VG Attr       LSize   Cpy%Sync Devices
  raid10_lv            VG rwi-aor-p- 504.00m 100.00   raid10_lv_rimage_0(0),raid10_lv_rimage_1(0),raid10_lv_rimage_2(0),raid10_lv_rimage_3(0)
  [raid10_lv_rimage_0] VG iwi-aor--- 252.00m          /dev/sdh1(1)
  [raid10_lv_rimage_1] VG iwi-aor--- 252.00m          /dev/sde1(1)
  [raid10_lv_rimage_2] VG iwi-aor--- 252.00m          /dev/sdb1(1)
  [raid10_lv_rimage_3] VG iwi-aor-p- 252.00m          unknown device(1)
  [raid10_lv_rmeta_0]  VG ewi-aor---   4.00m          /dev/sdh1(0)
  [raid10_lv_rmeta_1]  VG ewi-aor---   4.00m          /dev/sde1(0)
  [raid10_lv_rmeta_2]  VG ewi-aor---   4.00m          /dev/sdb1(0)
  [raid10_lv_rmeta_3]  VG ewi-aor-p-   4.00m          unknown device(0)

[root@host-113 ~]# dmsetup status
VG-raid10_lv_rmeta_2: 0 8192 linear 
VG-raid10_lv_rimage_1: 0 516096 linear 
VG-raid10_lv_rmeta_1: 0 8192 linear 
VG-raid10_lv_rimage_0: 0 516096 linear 
VG-raid10_lv_rmeta_0: 0 8192 linear 
VG-raid10_lv_rimage_3: 0 516096 linear 
VG-raid10_lv: 0 1032192 raid raid10 4 AAAD 1032192/1032192 idle 0
VG-raid10_lv_rmeta_3: 0 8192 linear 
VG-raid10_lv_rimage_2: 0 516096 linear 

[root@host-113 ~]# lvconvert --yes --repair VG/raid10_lv
  /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdc1: read failed after 0 of 1024 at 22545367040: Input/output error
  /dev/sdc1: read failed after 0 of 1024 at 22545448960: Input/output error
  /dev/sdc1: read failed after 0 of 1024 at 0: Input/output error
  /dev/sdc1: read failed after 0 of 1024 at 4096: Input/output error
  Couldn't find device with uuid XLLqx9-X85X-xlxy-xpd1-Aset-KSYb-bjVScz.
  Couldn't find device for segment belonging to VG/raid10_lv_rimage_3 while checking used and assumed devices.
  Faulty devices in VG/raid10_lv successfully replaced.

[root@host-113 ~]# lvs -a -o +devices
  /dev/sdc1: open failed: No such device or address
  Couldn't find device with uuid XLLqx9-X85X-xlxy-xpd1-Aset-KSYb-bjVScz.
  LV                   VG Attr       LSize   Cpy%Sync Devices
  raid10_lv            VG rwi-aor--- 504.00m 100.00   raid10_lv_rimage_0(0),raid10_lv_rimage_1(0),raid10_lv_rimage_2(0),raid10_lv_rimage_3(0)
  [raid10_lv_rimage_0] VG iwi-aor--- 252.00m          /dev/sdh1(1)
  [raid10_lv_rimage_1] VG iwi-aor--- 252.00m          /dev/sde1(1)
  [raid10_lv_rimage_2] VG iwi-aor--- 252.00m          /dev/sdb1(1)
  [raid10_lv_rimage_3] VG iwi-aor--- 252.00m          /dev/sda1(1)  <- newly allocated device
  [raid10_lv_rmeta_0]  VG ewi-aor---   4.00m          /dev/sdh1(0)
  [raid10_lv_rmeta_1]  VG ewi-aor---   4.00m          /dev/sde1(0)
  [raid10_lv_rmeta_2]  VG ewi-aor---   4.00m          /dev/sdb1(0)
  [raid10_lv_rmeta_3]  VG ewi-aor---   4.00m          /dev/sda1(0)  <- newly allocated device

[root@host-113 ~]# vgreduce --removemissing VG
  Couldn't find device with uuid XLLqx9-X85X-xlxy-xpd1-Aset-KSYb-bjVScz.
  Wrote out consistent volume group VG


# Now you can additionally swap to another device if you'd like
[root@host-113 ~]# lvconvert -d --replace /dev/sda1 VG/raid10_lv /dev/sdf1

[root@host-113 ~]# lvs -a -o +devices
  /dev/sdc1: open failed: No such device or address
  LV                   VG Attr       LSize   Cpy%Sync Devices
  raid10_lv            VG rwi-aor--- 504.00m 100.00   raid10_lv_rimage_0(0),raid10_lv_rimage_1(0),raid10_lv_rimage_2(0),raid10_lv_rimage_3(0)
  [raid10_lv_rimage_0] VG iwi-aor--- 252.00m          /dev/sdh1(1)
  [raid10_lv_rimage_1] VG iwi-aor--- 252.00m          /dev/sde1(1)
  [raid10_lv_rimage_2] VG iwi-aor--- 252.00m          /dev/sdb1(1)
  [raid10_lv_rimage_3] VG iwi-aor--- 252.00m          /dev/sdf1(1)  <- replaced device
  [raid10_lv_rmeta_0]  VG ewi-aor---   4.00m          /dev/sdh1(0)
  [raid10_lv_rmeta_1]  VG ewi-aor---   4.00m          /dev/sde1(0)
  [raid10_lv_rmeta_2]  VG ewi-aor---   4.00m          /dev/sdb1(0)
  [raid10_lv_rmeta_3]  VG ewi-aor---   4.00m          /dev/sdf1(0)  <- replaced device

Comment 2 Yuri Arabadji 2017-02-08 17:50:52 UTC

Yeah, guys, this all should be clearly documented. Imagine someone dealing with this for the first time. 

When looking for solution, I tried "4.4.3.8.3. Replacing a RAID device", because it's quite the obvious thing to try first. I mean, you've got a failing device, you search docs for "device", "failure" and so on, you don't search for "fault policy".

lvconvert man page describes 2 options - "--repair" and "--replace". I mean, it's natural to try to "replace" a failed drive first, and after that, "repair" the raid array with replaced drive. 

Also, --repair has a PhysicalVolume optional arg that isn't documented. 

A redhat KB article titled "Recovering from LVM RAID device failure" with steps outlined would be very handy.


ps: I randomly managed to stumble upon another solution, where you replace a failed PV (6.5. REPLACING A MISSING PHYSICAL VOLUME) with --norestorefile keeping the PV UUID, then you --refresh the VG metadata, forcing an update to _rimage_ table pointing to new PV.

Comment 4 Marian Csontos 2017-02-13 14:16:40 UTC

Knowledge Base search [1] returned the following article:

https://access.redhat.com/solutions/2537061

It is about thin pool though, and may be repairing simple RAID should be mentioned too.

I agree the LVM Administration Guide[2] is confusing. In section 4.4.3.8.3 it says:

    Therefore, rather than removing a failed device unconditionally
    and potentially allocating a replacement, LVM allows you to replace
    a device in a RAID volume in a one-step solution by using the
    --replace argument of the lvconvert command

which is apparently inaccurate.

I would like to promote lvmraid(7) man page, which is fine.

Re: Undocumented optional PV argument - any LVM command allocating space takes PV. This used to be documented in earlier versions of manpage, but was removed recently. It's still in --help output. Let's see if we can get it back.

[1]: https://access.redhat.com/search/#/?q=lvm%20raid%20replace%20failed%20device&p=1&srch=any&documentKind=

[2]: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Logical_Volume_Manager_Administration/LV.html#raid_volume_create

Comment 5 Jonathan Earl Brassow 2017-03-08 16:27:36 UTC

(In reply to Yuri Arabadji from comment #2)
> Yeah, guys, this all should be clearly documented. Imagine someone dealing
> with this for the first time. 
> 
> When looking for solution, I tried "4.4.3.8.3. Replacing a RAID device",
> because it's quite the obvious thing to try first. I mean, you've got a
> failing device, you search docs for "device", "failure" and so on, you don't
> search for "fault policy".
> 
> lvconvert man page describes 2 options - "--repair" and "--replace". I mean,
> it's natural to try to "replace" a failed drive first, and after that,
> "repair" the raid array with replaced drive. 
> 
> Also, --repair has a PhysicalVolume optional arg that isn't documented. 
> 
> A redhat KB article titled "Recovering from LVM RAID device failure" with
> steps outlined would be very handy.
> 
> 
> ps: I randomly managed to stumble upon another solution, where you replace a
> failed PV (6.5. REPLACING A MISSING PHYSICAL VOLUME) with --norestorefile
> keeping the PV UUID, then you --refresh the VG metadata, forcing an update
> to _rimage_ table pointing to new PV.

Yuri, you're right that there could be better documentation - esp the kbase article you mention.  We are working on it.  Have you tried the 'lvmraid' man page?  It describes in detail the options available when a failure occurs.

Comment 6 Heinz Mauelshagen 2017-03-08 16:42:23 UTC

The distinction between "lvconvert --repair" and --replace and the use cases with respect to failed and accessible devices are documented in lvmraid(7), i.e in section "Replacing Devices".