Bug 985976

Summary: LVM RAID: Add ability to perform RAID scrubbing operations
Product: Red Hat Enterprise Linux 6 Reporter: Jonathan Earl Brassow <jbrassow>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.5CC: agk, cmarthal, dwysocha, heinzm, jbrassow, msnitzer, prajnoha, prockai, slevine, thornber, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: lvm2-2.02.100-1.el6 Doc Type: Bug Fix
Doc Text:
RAID logical volumes that are created via LVM are now capable of performing scrubbing operations. Scrubbing operations are user-initiated checks to ensure that the RAID volume is consistent. For example, a scrubbing "check" operation on a RAID1 logical volume would determine if there are any sectors in the mirror set that are not the same. There are two scrubbing operations that can be performed: "check" and "repair". The "check" operation will examine the logical volume for any discrepancies, but will not correct them. The "repair" operation will correct any discrepancies found. Once a "check" operation is performed, the user can tell if any mismatches were found by examining the 'lv_attr' field in the output of an 'lvs' command. The user can also find out the number of discrepancies found by including the 'raid_mismatch_count' field in the 'lvs' output. Here are a couple examples: # To perform a "check" on a RAID logical volume, do: ~> lvchange --syncaction check vg/lv # To perform a "repair" on a RAID logical volume, do: ~> lvchange --syncaction repair vg/lv # To determine the mismatch count after a "check", do: ~> lvs -o +raid_mismatch_count vg/lv
Story Points: ---
Clone Of:
: 986443 (view as bug list) Environment:
Last Closed: 2013-11-21 23:25:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 985920    
Bug Blocks: 986443, 986445    

Description Jonathan Earl Brassow 2013-07-18 16:04:15 UTC
From the upstream commit message:

commit ff64e3500f6acf93dce017388445c4828111d06f
Author: Jonathan Brassow <jbrassow>
Date:   Thu Apr 11 15:33:59 2013 -0500

    RAID:  Add scrubbing support for RAID LVs
    
    New options to 'lvchange' allow users to scrub their RAID LVs.
    Synopsis:
        lvchange --syncaction {check|repair} vg/raid_lv
    
    RAID scrubbing is the process of reading all the data and parity blocks in
    an array and checking to see whether they are coherent.  'lvchange' can
    now initaite the two scrubbing operations: "check" and "repair".  "check"
    will go over the array and recored the number of discrepancies but not
    repair them.  "repair" will correct the discrepancies as it finds them.
    
    'lvchange --syncaction repair vg/raid_lv' is not to be confused with
    'lvconvert --repair vg/raid_lv'.  The former initiates a background
    synchronization operation on the array, while the latter is designed to
    repair/replace failed devices in a mirror or RAID logical volume.
    
    Additional reporting has been added for 'lvs' to support the new
    operations.  Two new printable fields (which are not printed by
    default) have been added: "syncaction" and "mismatches".  These
    can be accessed using the '-o' option to 'lvs', like:
        lvs -o +syncaction,mismatches vg/lv
    "syncaction" will print the current synchronization operation that the
    RAID volume is performing.  It can be one of the following:
            - idle:   All sync operations complete (doing nothing)
            - resync: Initializing an array or recovering after a machine failur
            - recover: Replacing a device in the array
            - check: Looking for array inconsistencies
            - repair: Looking for and repairing inconsistencies
    The "mismatches" field will print the number of descrepancies found during
    a check or repair operation.
    
    The 'Cpy%Sync' field already available to 'lvs' will print the progress
    of any of the above syncactions, including check and repair.
    
    Finally, the lv_attr field has changed to accomadate the scrubbing operation
    as well.  The role of the 'p'artial character in the lv_attr report field
    as expanded.  "Partial" is really an indicator for the health of a
    logical volume and it makes sense to extend this include other health
    indicators as well, specifically:
            'm'ismatches:  Indicates that there are discrepancies in a RAID
                           LV.  This character is shown after a scrubbing
                           operation has detected that portions of the RAID
                           are not coherent.
            'r'efresh   :  Indicates that a device in a RAID array has suffered
                           a failure and the kernel regards it as failed -
                           even though LVM can read the device label and
                           considers the device to be ok.  The LV should be
                           'r'efreshed to notify the kernel that the device is
                           now available, or the device should be 'r'eplaced
                           if it is suspected of failing.

Comment 1 Jonathan Earl Brassow 2013-07-18 19:41:47 UTC
Testing procedure:

1) Testing proper 'lvs' output:
'lvs' has two new reportable fields (not printed by default): "syncaction" and "mismatches".  It must be possible to report these fields and have them be correct.  (This can be done while testing scrubbing functionality - i.e. #2 below)

Ex1> lvs -o syncaction --noheadings vg/lv
  Answer depends on current operation and can be:
     - idle:   All sync operations complete (doing nothing)
     - resync: Initializing an array or recovering after a machine failure
     - recover: Replacing a device in the array
     - check: Looking for array inconsistencies
     - repair: Looking for and repairing inconsistencies
An attempt should be made to validate these states are printed correctly.  During an initial sync, it should read 'resync'.  When finished, it should read 'idle'.  When replacing a device in the array, it should be 'recover'.  etc.

Ex2> lvs -o mismatches,attr --noheadings vg/lv
  Should be '0' at all times unless a "check" has been performed and discrepancies have been found in the array.  The last character in the attribute field should also read 'm' if there are mismatches after a "check" has been run.  (Note that in the case of a device failure or device transient failure, the 'p'artial and 'r'place/'r'efresh flags take precidence over the 'm'ismatches flag.)

2) Testing correctness of scrubbing operations:
There are currently 2 scrubbing operations which can be performed: "check" and "repair".  To test them, do the following (or similar):

for all RAID types {
  for all devices in the array {

    Create RAID array
    Wait for sync
      - 'lvs -o syncaction' should be "resync"
      - 'lvs -o sync_percent' should grow to 100%
    Perform "check" (lvchange --syncaction check vg/lv)
    Wait for sync
      - 'lvs -o syncaction' should be "check"
      - 'lvs -o sync_percent' should grow to 100%
    'lvs -o mismatches' should be 0

    Deactivate RAID array
    Write crap to $device (inner for loop)
      - if writing to PV directly, be sure to skip over LVM label, etc
      - lvm2/test/shell/lvchange-raid.sh has example of how to do this
    Activate RAID array

    Perform "check" (lvchange --syncaction check vg/lv)
    Wait for sync
      - 'lvs -o syncaction' should be "check"
      - 'lvs -o sync_percent' should grow to 100%
    'lvs -o mismatches' should be NON-ZERO

    Perform "repair" (lvchange --syncaction repair vg/lv)
    Wait for sync
      - 'lvs -o syncaction' should be "repair"
      - 'lvs -o sync_percent' should grow to 100%
    'lvs -o mismatches' should be 0

    Perform "check" (lvchange --syncaction check vg/lv)
    Wait for sync
      - 'lvs -o syncaction' should be "check"
      - 'lvs -o sync_percent' should grow to 100%
    'lvs -o mismatches' should be 0
  done
done

3) Other sanity checks:
  - You must not be able to start a "check"/"repair" while another sync operation is happening.
  - If throttling is available, it should function on "check" and "repair".  (https://bugzilla.redhat.com/show_bug.cgi?id=969171#c3 is the RHEL6 test suggestions for throttling.)

Comment 3 Jonathan Earl Brassow 2013-07-19 15:06:41 UTC
In addition to the commit in comment 0, there is a fix for a segfault that must also go in:

commit 4eea66019157abd992c8802564b675fd97420c01
Author: Jonathan Brassow <jbrassow>
Date:   Fri Jul 19 10:01:48 2013 -0500

    RAID: Fix segfault when reporting raid_syncaction field on older kernel
    
    The status printed for dm-raid targets on older kernels does not include
    the syncaction field.  This is handled by dev_manager_raid_status() just
    fine by populating the raid status structure with NULL for that field.
    However, lv_raid_sync_action() does not properly handle that field being
    NULL.  So, check for it and return 0 if it is NULL.

Comment 6 errata-xmlrpc 2013-11-21 23:25:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1704.html