1089369 – RAID1 not successfully repaired upon failure when lvmetad is running.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1089369 - RAID1 not successfully repaired upon failure when lvmetad is running.

Summary: RAID1 not successfully repaired upon failure when lvmetad is running.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Petr Rockai
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	1085553
Blocks:
TreeView+	depends on / blocked

Reported:	2014-04-18 14:17 UTC by Nenad Peric
Modified:	2021-09-03 12:40 UTC (History)
CC List:	8 users (show)
Fixed In Version:	lvm2-2.02.115-1.el7
Doc Type:	Bug Fix
Doc Text:	Cause: When using lvmetad, dmeventd could see metadata that was not up to date at a time of a RAID volume repair. Consequence: The repair would not proceed, as based on the outdated information, the RAID volume was healthy. Fix: The repair code now forces a refresh of metadata for the PVs that host the RAID volume. Result: Automatic RAID volume repair using dmeventd and manual repair using lvconvert --repair now work as expected with or without lvmetad enabled.
Clone Of:
Environment:
Last Closed:	2015-03-05 13:08:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0513	0	normal	SHIPPED_LIVE	lvm2 bug fix and enhancement update	2015-03-05 16:14:41 UTC

Description Nenad Peric 2014-04-18 14:17:23 UTC

Description of problem:

If a device which holds one leg of a mirror fails in such a way that it returns I/O errors, raid does not get repaired (the policy is set to allocate) 

/var/log/messages says that raid1 has been repaired, however this is not true. 


Version-Release number of selected component (if applicable):


How reproducible:

Everytime

Steps to Reproduce:

Create a raid and make a device fail with I/O errors (it should still be present in the system but return errors). An easy way would be to unpam iscsi mapping on the storage server. 


Actual results:

the failure is detected and here are the messages from the log:


Apr 18 16:04:55 bucek-03 lvm[3453]: Device #0 of raid1 array, vg-raid, has failed.
Apr 18 16:04:55 bucek-03 lvm[3453]: /dev/sdb1: read failed after 0 of 1024 at 99994566656: Input/output error
Apr 18 16:04:55 bucek-03 lvm[3453]: /dev/sdb1: read failed after 0 of 1024 at 99994685440: Input/output error
Apr 18 16:04:55 bucek-03 lvm[3453]: /dev/sdb1: read failed after 0 of 1024 at 0: Input/output error
Apr 18 16:04:55 bucek-03 lvm[3453]: /dev/sdb1: read failed after 0 of 1024 at 4096: Input/output error
Apr 18 16:04:55 bucek-03 lvm[3453]: Faulty devices in vg/raid successfully replaced.


However this is not true:
(after pvscan --cache)

[root@bucek-03 ~]# lvs -a -o+devices
  PV QL855h-zBg4-ZO1A-Vatp-XhUF-gU6O-8G5h8j not recognised. Is the device missing?
  PV QL855h-zBg4-ZO1A-Vatp-XhUF-gU6O-8G5h8j not recognised. Is the device missing?
  LV              VG            Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert Devices                          
  home            rhel_bucek-03 -wi-ao---- 224.88g                                              /dev/sda2(1024)                  
  root            rhel_bucek-03 -wi-ao----  50.00g                                              /dev/sda2(58592)                 
  swap            rhel_bucek-03 -wi-ao----   4.00g                                              /dev/sda2(0)                     
  raid            vg            rwi-a-r-p-   2.00g                               100.00         raid_rimage_0(0),raid_rimage_1(0)
  [raid_rimage_0] vg            iwi-aor-p-   2.00g                                              unknown device(1)                
  [raid_rimage_1] vg            iwi-aor---   2.00g                                              /dev/sdc1(1)                     
  [raid_rmeta_0]  vg            ewi-aor-p-   4.00m                                              unknown device(0)                
  [raid_rmeta_1]  vg            ewi-aor---   4.00m                                              /dev/sdc1(0)              


The raid LV is marked as partial, even though another device should have been allocated. There are 5 unused PVs in that VG. 

Manually running lvconvert --repair seems to do the trick, but that means that automatic raid_fault_policy is not working when lvmetad is enabled and running.

Expected results:

That the repair of raid finishes successfully (and automatically) based on policies set in lvm.conf and if enough PVs are available in the VG.

Comment 1 Nenad Peric 2014-04-18 14:18:25 UTC

Forgot to write versions:

lvm2-2.02.105-14.el7.x86_64
device-mapper-1.02.84-14.el7.x86_64

Comment 3 Jonathan Earl Brassow 2014-04-22 19:22:29 UTC

This is a duplicate of bug 1085553, but i'd rather allow this bug to stay open and dependent on 1085553 since it is not immediately obvious.

Comment 4 RHEL Program Management 2014-04-30 05:47:54 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 5 Petr Rockai 2014-07-29 11:48:31 UTC

Should be fixed along with bug 1085553 and for same reason as bug 892991.

Comment 7 Nenad Peric 2015-01-20 15:48:54 UTC

This was tested with a scratch build and works in it. 

Jan 20 16:39:10 tardis-03 lvm[2085]: Monitoring RAID device vg-raid_lv for events.
Jan 20 16:39:10 tardis-03 lvm[2085]: Monitoring mirror device vg-mirror_lv for events.
Jan 20 16:39:10 tardis-03 lvm: 2 logical volume(s) in volume group "vg" now active
Jan 20 16:39:10 tardis-03 lvm[2085]: vg-mirror_lv is now in-sync.
Jan 20 16:42:15 tardis-03 lvm[2085]: Device #1 of raid1 array, vg-raid_lv, has failed.
Jan 20 16:42:15 tardis-03 lvm[2085]: /dev/sdf1: read failed after 0 of 2048 at 0: Input/output error
Jan 20 16:42:15 tardis-03 lvm[2085]: No PV label found on /dev/sdf1.
Jan 20 16:42:15 tardis-03 lvm[2085]: WARNING: Device for PV Y6H7MU-ZVl5-nztA-Xlne-eANW-BgVQ-lM4KlU not found or rejected by a filter.
Jan 20 16:42:23 tardis-03 lvm[2085]: Monitoring RAID device vg-raid_lv for events.
Jan 20 16:42:32 tardis-03 lvm[2085]: Monitoring RAID device vg-raid_lv for events.
Jan 20 16:42:32 tardis-03 lvm[2085]: Faulty devices in vg/raid_lv successfully replaced.
Jan 20 16:42:32 tardis-03 lvm[2085]: raid1 array, vg-raid_lv, is not in-sync.
Jan 20 16:42:32 tardis-03 lvm[2085]: raid1 array, vg-raid_lv, is not in-sync.
Jan 20 16:42:34 tardis-03 lvm[2085]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Jan 20 16:42:34 tardis-03 lvm[2085]: No longer monitoring RAID device vg-raid_lv for events.
Jan 20 16:42:36 tardis-03 lvm[2085]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Jan 20 16:42:36 tardis-03 lvm[2085]: No longer monitoring RAID device vg-raid_lv for events.
Jan 20 16:42:58 tardis-03 lvm[2085]: raid1 array, vg-raid_lv, is now in-sync.


Will open a separate bug for these device-mapper errors, since they appear after any raid/mirror sync has completed. 

Marking this one VERIFIED with:

3.10.0-223.el7.x86_64

lvm2-2.02.114-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
lvm2-libs-2.02.114-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
lvm2-cluster-2.02.114-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
device-mapper-1.02.92-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
device-mapper-libs-1.02.92-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
device-mapper-event-1.02.92-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
device-mapper-event-libs-1.02.92-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015
device-mapper-persistent-data-0.4.1-2.el7    BUILT: Wed Nov 12 19:39:46 CET 2014
cmirror-2.02.114-6.el7    BUILT: Tue Jan 20 14:49:01 CET 2015

Comment 10 errata-xmlrpc 2015-03-05 13:08:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0513.html

Note You need to log in before you can comment on or make changes to this bug.