Bug 593119 - RFE: LVM RAID - Handle transient failures of RAID1 images
RFE: LVM RAID - Handle transient failures of RAID1 images
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: Jonathan Earl Brassow
Corey Marthaler
: FutureFeature
: 319221 (view as bug list)
Depends On: 758552
Blocks: 697866 732458 756082
  Show dependency treegraph
 
Reported: 2010-05-17 16:59 EDT by Jonathan Earl Brassow
Modified: 2012-08-27 10:54 EDT (History)
14 users (show)

See Also:
Fixed In Version: lvm2-2.02.95-1.el6
Doc Type: Enhancement
Doc Text:
LVM RAID fully supported with the exception of RAID logical volumes in HA-LVM. The expanded RAID support in LVM is now fully supported in Red Hat Enterprise Linux 6.3. LVM now has the capability to create RAID 4/5/6 logical volumes and supports a new implementation of mirroring. The MD (software RAID) modules provide the backend support for these new features.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-20 10:51:01 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jonathan Earl Brassow 2010-05-17 16:59:32 EDT
Implement a solution such that when an image (leg) of a mirror fails, the areas of the address space that change during its absense are tracked.  If/when the mirror device can be revived, the changes can be quickly copied - bringing the mirror image back in-sync.

If possible, record the information in the mirror (set) device to maintain which image is unavailable.  This will help verify the correctness of the device when it is re-enabled as part of the group, and should eliminate the need for the administrator to specify the names of the devices that need to be readded.
Comment 1 Jonathan Earl Brassow 2010-05-18 14:15:52 EDT
*** Bug 319221 has been marked as a duplicate of this bug. ***
Comment 2 Siddharth Nagar 2010-11-23 17:11:06 EST
Deferring to RHEL 6.2.
Comment 5 Jonathan Earl Brassow 2011-08-31 12:40:34 EDT
This feature comes for free with the inclusion of RAID in LVM.  This bug will define how it works, how to test it, and what the release requirements are.
Comment 7 Corey Marthaler 2011-08-31 14:52:26 EDT
QE reviewed this BZ for QA_ACK but was unable to ack due to a lack of
requirements or description of how the new feature is supposed to work or be
tested.

Please add all the device failure cases to be tested/supported in 6.3.

Please see
https://wiki.test.redhat.com/ClusterStorage/WhyNoAck
Comment 9 Jonathan Earl Brassow 2011-12-06 15:46:42 EST
This feature is now committed upstream in LVM version 2.02.89.

In order to handle transient failures, the user must set the raid_fault_policy in lvm.conf to "warn".  This will prevent the automated response from immediately replacing a device that suffers a failure - instead, warning the user of the failure.

Once the user is informed of the failure, they can take steps to restore the failing device.  Then, simply 'deactivate' and re-'activate' the logical volume at the next most appropriate time.  This action will restore the device and update any portions of the device that may be out-of-sync.

[ I've also been considering adding some code to 'lvconvert --repair' to check the RAID LV's status and "recycle" it if there is a device which is listed as failed but has come back.  This way, the user would not need to go through the cumbersome 'unmount, deactivate, activate, mount' process.  However, I've not addressed this idea in this bug. ]
Comment 13 Corey Marthaler 2011-12-21 18:54:21 EST
Adding QA ack for 6.3.
Comment 14 Jonathan Earl Brassow 2012-01-17 16:52:17 EST
Showing the feature in action...

# 3-way RAID1
[root@bp-01 ~]# devices vg
  LV            Copy%  Devices                                     
  lv            100.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
  [lv_rimage_0]        /dev/sde1(1)                                
  [lv_rimage_1]        /dev/sdf1(1)                                
  [lv_rimage_2]        /dev/sdg1(1)                                
  [lv_rmeta_0]         /dev/sde1(0)                                
  [lv_rmeta_1]         /dev/sdf1(0)                                
  [lv_rmeta_2]         /dev/sdg1(0)

                 
# Kill a device
[root@bp-01 ~]# off.sh sdf
Turning off sdf


# Writing to the LV reveals device failure (note: no problem with I/O)
[root@bp-01 ~]# dd if=/dev/zero of=/dev/vg/lv bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.14839 s, 28.3 MB/s



# LVM messages found in system log after array failure
# (Lack of instructions to restore device may warrant some discussion...)
Jan 17 15:36:41 bp-01 lvm[8599]: Device #1 of raid1 array, vg-lv, has failed.
Jan 17 15:36:41 bp-01 lvm[8599]: /dev/sdf1: read failed after 0 of 2048 at 250994294784: Input/output error
Jan 17 15:36:41 bp-01 lvm[8599]: /dev/sdf1: read failed after 0 of 2048 at 250994376704: Input/output error
Jan 17 15:36:41 bp-01 lvm[8599]: /dev/sdf1: read failed after 0 of 2048 at 0: Input/output error
Jan 17 15:36:41 bp-01 lvm[8599]: /dev/sdf1: read failed after 0 of 2048 at 4096: Input/output error
Jan 17 15:36:42 bp-01 lvm[8599]: Couldn't find device with uuid VxseDx-HGqr-1Fan-TmvI-DS4S-5Xn9-OKpObo.
Jan 17 15:36:43 bp-01 lvm[8599]: Issue 'lvconvert --repair vg/lv' to replace failed device


# 'lvs' output
[root@bp-01 ~]# devices vg
  /dev/sdf1: read failed after 0 of 2048 at 250994294784: Input/output error
  /dev/sdf1: read failed after 0 of 2048 at 250994376704: Input/output error
  /dev/sdf1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdf1: read failed after 0 of 2048 at 4096: Input/output error
  Couldn't find device with uuid VxseDx-HGqr-1Fan-TmvI-DS4S-5Xn9-OKpObo.
  LV            Copy%  Devices                                     
  lv            100.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
  [lv_rimage_0]        /dev/sde1(1)                                
  [lv_rimage_1]        unknown device(1)                           
  [lv_rimage_2]        /dev/sdg1(1)                                
  [lv_rmeta_0]         /dev/sde1(0)                                
  [lv_rmeta_1]         unknown device(0)                           
  [lv_rmeta_2]         /dev/sdg1(0)                                


# Turn device back on
[root@bp-01 ~]# on.sh sdf
Turning on sdf


# 'lvs' shows device has recovered.
# (It might be worth a bug to have the Attr characters still report that
# one of the devices is considered "failed" still - at least until the LV
# is recycled.)
[root@bp-01 ~]# devices vg
  LV            Copy%  Devices                                     
  lv            100.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
  [lv_rimage_0]        /dev/sde1(1)                                
  [lv_rimage_1]        /dev/sdf1(1)                                
  [lv_rimage_2]        /dev/sdg1(1)                                
  [lv_rmeta_0]         /dev/sde1(0)                                
  [lv_rmeta_1]         /dev/sdf1(0)                                
  [lv_rmeta_2]         /dev/sdg1(0)


# 'dmsetup status' does show the device as "failed" still
[root@bp-01 ~]# dmsetup status vg-lv
0 204800 raid raid1 3 ADA 204800/204800



# Recycle the LV
[root@bp-01 ~]# lvchange -an vg/lv; lvchange -ay vg/lv



# 'dmsetup status' shows the device is "alive but recovering" immediately after
# (It would be nice if 'lvs' also showed this.)
[root@bp-01 ~]# dmsetup status vg-lv
0 204800 raid raid1 3 AaA 198016/204800


# Once the drive is in-sync again, 'dmsetup status' shows it as 'A' again.
[root@bp-01 ~]# dmsetup status vg-lv
0 204800 raid raid1 3 AAA 204800/204800
[root@bp-01 ~]# devices vg
  LV            Copy%  Devices                                     
  lv            100.00 lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0)
  [lv_rimage_0]        /dev/sde1(1)                                
  [lv_rimage_1]        /dev/sdf1(1)                                
  [lv_rimage_2]        /dev/sdg1(1)                                
  [lv_rmeta_0]         /dev/sde1(0)                                
  [lv_rmeta_1]         /dev/sdf1(0)                                
  [lv_rmeta_2]         /dev/sdg1(0)
Comment 17 Tom Coughlan 2012-03-28 17:49:33 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The expanded RAID support in LVM moves from Tech. Preview to full suport in 6.4.
Comment 18 Tom Coughlan 2012-03-28 17:54:33 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,3 @@
-The expanded RAID support in LVM moves from Tech. Preview to full suport in 6.4.+The expanded RAID support in LVM moves from Tech. Preview to full support in 6.3.
+
+LVM now has the capability to create RAID 4/5/6 logical volumes and supports a new implementation of mirroring. The MD (software RAID) modules provide the backend support for these new features.
Comment 19 Martin Prpic 2012-04-06 08:00:32 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1,3 @@
-The expanded RAID support in LVM moves from Tech. Preview to full support in 6.3.
+LVM RAID fully supported
 
-LVM now has the capability to create RAID 4/5/6 logical volumes and supports a new implementation of mirroring. The MD (software RAID) modules provide the backend support for these new features.+The expanded RAID support in LVM is now fully supported in Red Hat Enterprise Linux 6.3. LVM now has the capability to create RAID 4/5/6 logical volumes and supports a new implementation of mirroring. The MD (software RAID) modules provide the backend support for these new features.
Comment 20 Corey Marthaler 2012-05-03 18:46:40 EDT
Feature verified with the latest rpms.

2.6.32-269.el6.x86_64
lvm2-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
lvm2-libs-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
lvm2-cluster-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
udev-147-2.41.el6    BUILT: Thu Mar  1 13:01:08 CST 2012
device-mapper-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
device-mapper-libs-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
device-mapper-event-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
device-mapper-event-libs-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
cmirror-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
Comment 21 Jonathan Earl Brassow 2012-05-22 16:32:24 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1,3 @@
-LVM RAID fully supported
+LVM RAID fully supported with the exception of RAID logical volumes in HA-LVM.
 
 The expanded RAID support in LVM is now fully supported in Red Hat Enterprise Linux 6.3. LVM now has the capability to create RAID 4/5/6 logical volumes and supports a new implementation of mirroring. The MD (software RAID) modules provide the backend support for these new features.
Comment 23 errata-xmlrpc 2012-06-20 10:51:01 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0962.html

Note You need to log in before you can comment on or make changes to this bug.