Bug 1089369 - RAID1 not successfully repaired upon failure when lvmetad is running.
Summary: RAID1 not successfully repaired upon failure when lvmetad is running.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: lvm2
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Petr Rockai
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 1085553
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-18 14:17 UTC by Nenad Peric
Modified: 2021-09-03 12:40 UTC (History)
8 users (show)

Fixed In Version: lvm2-2.02.115-1.el7
Doc Type: Bug Fix
Doc Text:
Cause: When using lvmetad, dmeventd could see metadata that was not up to date at a time of a RAID volume repair. Consequence: The repair would not proceed, as based on the outdated information, the RAID volume was healthy. Fix: The repair code now forces a refresh of metadata for the PVs that host the RAID volume. Result: Automatic RAID volume repair using dmeventd and manual repair using lvconvert --repair now work as expected with or without lvmetad enabled.
Clone Of:
Environment:
Last Closed: 2015-03-05 13:08:21 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0513 0 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2015-03-05 16:14:41 UTC

Description Nenad Peric 2014-04-18 14:17:23 UTC
Description of problem:

If a device which holds one leg of a mirror fails in such a way that it returns I/O errors, raid does not get repaired (the policy is set to allocate) 

/var/log/messages says that raid1 has been repaired, however this is not true. 


Version-Release number of selected component (if applicable):


How reproducible:

Everytime

Steps to Reproduce:

Create a raid and make a device fail with I/O errors (it should still be present in the system but return errors). An easy way would be to unpam iscsi mapping on the storage server. 


Actual results:

the failure is detected and here are the messages from the log:


Apr 18 16:04:55 bucek-03 lvm[3453]: Device #0 of raid1 array, vg-raid, has failed.
Apr 18 16:04:55 bucek-03 lvm[3453]: /dev/sdb1: read failed after 0 of 1024 at 99994566656: Input/output error
Apr 18 16:04:55 bucek-03 lvm[3453]: /dev/sdb1: read failed after 0 of 1024 at 99994685440: Input/output error
Apr 18 16:04:55 bucek-03 lvm[3453]: /dev/sdb1: read failed after 0 of 1024 at 0: Input/output error
Apr 18 16:04:55 bucek-03 lvm[3453]: /dev/sdb1: read failed after 0 of 1024 at 4096: Input/output error
Apr 18 16:04:55 bucek-03 lvm[3453]: Faulty devices in vg/raid successfully replaced.


However this is not true:
(after pvscan --cache)

[root@bucek-03 ~]# lvs -a -o+devices
  PV QL855h-zBg4-ZO1A-Vatp-XhUF-gU6O-8G5h8j not recognised. Is the device missing?
  PV QL855h-zBg4-ZO1A-Vatp-XhUF-gU6O-8G5h8j not recognised. Is the device missing?
  LV              VG            Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert Devices                          
  home            rhel_bucek-03 -wi-ao---- 224.88g                                              /dev/sda2(1024)                  
  root            rhel_bucek-03 -wi-ao----  50.00g                                              /dev/sda2(58592)                 
  swap            rhel_bucek-03 -wi-ao----   4.00g                                              /dev/sda2(0)                     
  raid            vg            rwi-a-r-p-   2.00g                               100.00         raid_rimage_0(0),raid_rimage_1(0)
  [raid_rimage_0] vg            iwi-aor-p-   2.00g                                              unknown device(1)                
  [raid_rimage_1] vg            iwi-aor---   2.00g                                              /dev/sdc1(1)                     
  [raid_rmeta_0]  vg            ewi-aor-p-   4.00m                                              unknown device(0)                
  [raid_rmeta_1]  vg            ewi-aor---   4.00m                                              /dev/sdc1(0)              


The raid LV is marked as partial, even though another device should have been allocated. There are 5 unused PVs in that VG. 

Manually running lvconvert --repair seems to do the trick, but that means that automatic raid_fault_policy is not working when lvmetad is enabled and running.

Expected results:

That the repair of raid finishes successfully (and automatically) based on policies set in lvm.conf and if enough PVs are available in the VG.

Comment 1 Nenad Peric 2014-04-18 14:18:25 UTC
Forgot to write versions:

lvm2-2.02.105-14.el7.x86_64
device-mapper-1.02.84-14.el7.x86_64

Comment 3 Jonathan Earl Brassow 2014-04-22 19:22:29 UTC
This is a duplicate of bug 1085553, but i'd rather allow this bug to stay open and dependent on 1085553 since it is not immediately obvious.

Comment 4 RHEL Program Management 2014-04-30 05:47:54 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 5 Petr Rockai 2014-07-29 11:48:31 UTC
Should be fixed along with bug 1085553 and for same reason as bug 892991.

Comment 7 Nenad Peric 2015-01-20 15:48:54 UTC
This was tested with a scratch build and works in it. 

Jan 20 16:39:10 tardis-03 lvm[2085]: Monitoring RAID device vg-raid_lv for events.
Jan 20 16:39:10 tardis-03 lvm[2085]: Monitoring mirror device vg-mirror_lv for events.
Jan 20 16:39:10 tardis-03 lvm: 2 logical volume(s) in volume group "vg" now active
Jan 20 16:39:10 tardis-03 lvm[2085]: vg-mirror_lv is now in-sync.
Jan 20 16:42:15 tardis-03 lvm[2085]: Device #1 of raid1 array, vg-raid_lv, has failed.
Jan 20 16:42:15 tardis-03 lvm[2085]: /dev/sdf1: read failed after 0 of 2048 at 0: Input/output error
Jan 20 16:42:15 tardis-03 lvm[2085]: No PV label found on /dev/sdf1.
Jan 20 16:42:15 tardis-03 lvm[2085]: WARNING: Device for PV Y6H7MU-ZVl5-nztA-Xlne-eANW-BgVQ-lM4KlU not found or rejected by a filter.
Jan 20 16:42:23 tardis-03 lvm[2085]: Monitoring RAID device vg-raid_lv for events.
Jan 20 16:42:32 tardis-03 lvm[2085]: Monitoring RAID device vg-raid_lv for events.
Jan 20 16:42:32 tardis-03 lvm[2085]: Faulty devices in vg/raid_lv successfully replaced.
Jan 20 16:42:32 tardis-03 lvm[2085]: raid1 array, vg-raid_lv, is not in-sync.
Jan 20 16:42:32 tardis-03 lvm[2085]: raid1 array, vg-raid_lv, is not in-sync.
Jan 20 16:42:34 tardis-03 lvm[2085]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Jan 20 16:42:34 tardis-03 lvm[2085]: No longer monitoring RAID device vg-raid_lv for events.
Jan 20 16:42:36 tardis-03 lvm[2085]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Jan 20 16:42:36 tardis-03 lvm[2085]: No longer monitoring RAID device vg-raid_lv for events.
Jan 20 16:42:58 tardis-03 lvm[2085]: raid1 array, vg-raid_lv, is now in-sync.


Will open a separate bug for these device-mapper errors, since they appear after any raid/mirror sync has completed. 

Marking this one VERIFIED with:

3.10.0-223.el7.x86_64

lvm2-2.02.114-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
lvm2-libs-2.02.114-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
lvm2-cluster-2.02.114-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
device-mapper-1.02.92-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
device-mapper-libs-1.02.92-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
device-mapper-event-1.02.92-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
device-mapper-event-libs-1.02.92-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
device-mapper-persistent-data-0.4.1-2.el7    BUILT: Wed Nov 12 19:39:46 CET 2014
cmirror-2.02.114-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015

Comment 10 errata-xmlrpc 2015-03-05 13:08:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0513.html


Note You need to log in before you can comment on or make changes to this bug.