1446754 – all primary raid1 failures whether in sync or not now require user intervention

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1446754 - all primary raid1 failures whether in sync or not now require user intervention

Summary: all primary raid1 failures whether in sync or not now require user intervention

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Heinz Mauelshagen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1311765
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-28 19:17 UTC by Corey Marthaler
Modified:	2021-09-03 12:37 UTC (History)
CC List:	9 users (show)
Fixed In Version:	lvm2-2.02.171-4.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-01 21:52:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2222	0	normal	SHIPPED_LIVE	lvm2 bug fix and enhancement update	2017-08-01 18:42:41 UTC

Description Corey Marthaler 2017-04-28 19:17:03 UTC

Description of problem:
Feel free to close this "NOTABUG".

This is basically a CYA bug for QA since this is a change in behavior from all prior rhel7 releases.

All automatic repair attempts in rhel7.4, as a result of the "allocate" raid_fault_policy, are expected to fail now, just as they were in rhel6.9. 
See related 6.9 comments:
https://bugzilla.redhat.com/show_bug.cgi?id=1311765#c16
https://bugzilla.redhat.com/show_bug.cgi?id=1311765#c26
https://bugzilla.redhat.com/show_bug.cgi?id=1397589#c4



Apr 28 14:09:45 host-116 lvm[28051]: Couldn't find device with uuid 8RpgKN-YbFE-XpIi-CSrI-fDdF-TaIA-XtwwSS.
Apr 28 14:09:45 host-116 lvm[28051]: WARNING: Couldn't find all devices for LV black_bird/synced_primary_raid1_2legs_1_rimage_0 while checking used and assumed devices.
Apr 28 14:09:45 host-116 lvm[28051]: WARNING: Couldn't find all devices for LV black_bird/synced_primary_raid1_2legs_1_rmeta_0 while checking used and assumed devices.
Apr 28 14:09:45 host-116 lvm[28051]: Unable to extract primary RAID image while RAID array is not in-sync (use --force option to replace).
Apr 28 14:09:45 host-116 lvm[28051]: Failed to remove the specified images from black_bird/synced_primary_raid1_2legs_1.
Apr 28 14:09:45 host-116 lvm[28051]: Failed to replace faulty devices in black_bird/synced_primary_raid1_2legs_1.
Apr 28 14:09:45 host-116 lvm[28051]: Repair of RAID device black_bird-synced_primary_raid1_2legs_1 failed.
Apr 28 14:09:45 host-116 lvm[28051]: Failed to process event for black_bird-synced_primary_raid1_2legs_1.



Version-Release number of selected component (if applicable):
3.10.0-651.el7.x86_64

lvm2-2.02.170-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
lvm2-libs-2.02.170-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
lvm2-cluster-2.02.170-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
device-mapper-1.02.139-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
device-mapper-libs-1.02.139-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
device-mapper-event-1.02.139-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
device-mapper-event-libs-1.02.139-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 10:15:46 CDT 2017

Comment 4 Jonathan Earl Brassow 2017-05-19 13:20:24 UTC

This will have to be fixed.  The key part about the summary is "whether in sync or not".  The rationale behind not allowing a user to automatically repair a primary leg while syncing is clear, but it should definitely be allowed if the RAID1 is in-sync at the time the failure happens.

[root@bp-01 ~]# devices
  WARNING: Not using lvmetad because a repair command was run.
  /dev/sdb1: read failed after 0 of 4096 at 898387345408: Input/output error
  /dev/sdb1: read failed after 0 of 4096 at 898387402752: Input/output error
  /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdb1: read failed after 0 of 4096 at 4096: Input/output error
  Couldn't find device with uuid dmVM0n-K1JI-wJ71-7Jto-o8r3-5IK4-QlsDke.
  WARNING: Couldn't find all devices for LV vg/raid1_rimage_0 while checking used and assumed devices.
  WARNING: Couldn't find all devices for LV vg/raid1_rmeta_0 while checking used and assumed devices.
  LV               Attr       Cpy%Sync Devices
  home             -wi-ao----          /dev/sda2(2016)
  root             -wi-ao----          /dev/sda2(106177)
  swap             -wi-ao----          /dev/sda2(0)
  raid1            rwi-a-r-p- 100.00   raid1_rimage_0(0),raid1_rimage_1(0)
  [raid1_rimage_0] Iwi-aor-p-          [unknown](1)
  [raid1_rimage_1] iwi-aor---          /dev/sdc1(1)
  [raid1_rmeta_0]  ewi-aor-p-          [unknown](0)
  [raid1_rmeta_1]  ewi-aor---          /dev/sdc1(0)
[root@bp-01 ~]# lvconvert --repair vg/raid1
  WARNING: Disabling lvmetad cache for repair command.
  WARNING: Not using lvmetad because of repair.
  /dev/sdb1: read failed after 0 of 4096 at 898387345408: Input/output error
  /dev/sdb1: read failed after 0 of 4096 at 898387402752: Input/output error
  /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdb1: read failed after 0 of 4096 at 4096: Input/output error
  Couldn't find device with uuid dmVM0n-K1JI-wJ71-7Jto-o8r3-5IK4-QlsDke.
  WARNING: Couldn't find all devices for LV vg/raid1_rimage_0 while checking used and assumed devices.
  WARNING: Couldn't find all devices for LV vg/raid1_rmeta_0 while checking used and assumed devices.
Attempt to replace failed RAID images (requires full device resync)? [y/n]: y
  Unable to extract primary RAID image while RAID array is not in-sync (use --force option to replace).
  Failed to remove the specified images from vg/raid1.
  Failed to replace faulty devices in vg/raid1.

Comment 6 Jonathan Earl Brassow 2017-06-01 02:24:04 UTC

I've got a patch I'm testing that should fix things up.  Prob have it by tmr.

Comment 7 Jonathan Earl Brassow 2017-06-06 15:49:08 UTC

patches committed upstream:

commit 88e649628863e78b101c584c513053fc9461c24d
Author: Jonathan Brassow <jbrassow>
Date:   Tue Jun 6 10:43:12 2017 -0500

    lvconvert:  linear -> raid1 upconvert should cause "recover" not "resync"

* and *

commit acaf3a5d47fd65b2e385a516544f8e6ec8d89b2d
Author: Jonathan Brassow <jbrassow>
Date:   Tue Jun 6 10:43:49 2017 -0500

    lvconvert:  Don't require a 'force' option during RAID repair.

Comment 9 Corey Marthaler 2017-06-28 17:25:42 UTC

Marking verified with the latest rpms.

# already synced 
lvm[28754]: Faulty devices in black_bird/synced_primary_raid1_2legs_1 successfully replaced.

# not yet insync (but also, not a linear up convert)
lvm[28754]: Faulty devices in black_bird/non_synced_primary_raid1_2legs_1 successfully replaced.

LVM can again successfully (automatically) repair failed raid1 primary devices while the raid fault policy is set to allocate. That said, the caveat with bug 1446780 still exists, leaving the user confused about how many times the device(s) was actually repaired, or whether it is believed to be in sync during the repair process.


3.10.0-688.el7.x86_64
lvm2-2.02.171-7.el7    BUILT: Thu Jun 22 08:35:15 CDT 2017
lvm2-libs-2.02.171-7.el7    BUILT: Thu Jun 22 08:35:15 CDT 2017
lvm2-cluster-2.02.171-7.el7    BUILT: Thu Jun 22 08:35:15 CDT 2017
device-mapper-1.02.140-7.el7    BUILT: Thu Jun 22 08:35:15 CDT 2017
device-mapper-libs-1.02.140-7.el7    BUILT: Thu Jun 22 08:35:15 CDT 2017
device-mapper-event-1.02.140-7.el7    BUILT: Thu Jun 22 08:35:15 CDT 2017
device-mapper-event-libs-1.02.140-7.el7    BUILT: Thu Jun 22 08:35:15 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 10:15:46 CDT 2017

Comment 10 errata-xmlrpc 2017-08-01 21:52:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2222

Note You need to log in before you can comment on or make changes to this bug.