Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1446780

Summary:

automatic "allocate" fault policy raid repair is failing (Unable to replace devices while it is not in-sync)

Product:

Red Hat Enterprise Linux 7

Reporter:

Corey Marthaler <cmarthal>

Component:

lvm2

Assignee:

Heinz Mauelshagen <heinzm>

lvm2 sub component:

Mirroring and RAID

QA Contact:

cluster-qe <cluster-qe>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

agk, cmarthal, heinzm, jbrassow, mcsontos, msnitzer, mthacker, prajnoha, prockai, zkabelac

Version:

7.4

Keywords:

Regression

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

lvm2-2.02.171-8.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-08-01 21:52:19 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
log file during the failure of two raids sharing the same PV	none
test cases that continue to fail	none
scenarios still failing	none

Description Corey Marthaler 2017-04-28 21:26:52 UTC

Description of problem:
In each of these cases, the raids are verified to be in sync before the device failure is attempted. In fact, "is now in-sync" can be seen before, dmeventd even attempts to repair the raid volumes. The end result is these raids are eventually successfully replaced, you'll see "successfully replaced" multiple times. 

So the issues are:
1. In the past you would never see "Repair of RAID... failed" followed by "successfully replaced". It was always one or the other i believe, so our tests keyed off that when verifying if automatic repair worked or failed.

2. Are raids supposed to be going in and out of sync like that? Three "is now in-sync" messages reported for a raid that was already in sync to begin with.

3. Are things happening multiple times now that should only be happening once?




# Raid1 (synced non primary device failed)                                                                                                            
Apr 28 12:07:39 host-116 lvm[28051]: Faulty devices in black_bird/synced_random_raid1_2legs_1 successfully replaced.
Apr 28 12:07:39 host-116 lvm[28051]: raid1 array, black_bird-synced_random_raid1_2legs_1, is now in-sync.
Apr 28 12:07:39 host-116 lvm[28051]: WARNING: Disabling lvmetad cache for repair command.
Apr 28 12:07:39 host-116 lvm[28051]: WARNING: Not using lvmetad because of repair.
Apr 28 12:07:40 host-116 lvm[28051]: Couldn't find device with uuid NztTsr-rIOz-8MBc-nq0f-yCQB-GYdm-kZXTym.
Apr 28 12:07:40 host-116 lvm[28051]: Unable to replace devices in black_bird/synced_random_raid1_2legs_1 while it is not in-sync.
Apr 28 12:07:40 host-116 lvm[28051]: Failed to replace faulty devices in black_bird/synced_random_raid1_2legs_1.
Apr 28 12:07:40 host-116 lvm[28051]: Repair of RAID device black_bird-synced_random_raid1_2legs_1 failed.
Apr 28 12:07:40 host-116 lvm[28051]: Failed to process event for black_bird-synced_random_raid1_2legs_1.
Apr 28 12:07:40 host-116 lvm[28051]: raid1 array, black_bird-synced_random_raid1_2legs_1, is not in-sync.
Apr 28 12:07:43 host-116 kernel: md: mdX: recovery done.
Apr 28 12:07:43 host-116 lvm[28051]: raid1 array, black_bird-synced_random_raid1_2legs_1, is now in-sync.
Apr 28 12:07:43 host-116 lvm[28051]: WARNING: Disabling lvmetad cache for repair command.
Apr 28 12:07:43 host-116 lvm[28051]: WARNING: Not using lvmetad because of repair.
Apr 28 12:07:43 host-116 lvm[28051]: Couldn't find device with uuid NztTsr-rIOz-8MBc-nq0f-yCQB-GYdm-kZXTym.
Apr 28 12:07:43 host-116 lvm[28051]: black_bird/synced_random_raid1_2legs_1 does not contain devices specified to replace.
Apr 28 12:07:43 host-116 lvm[28051]: Faulty devices in black_bird/synced_random_raid1_2legs_1 successfully replaced.
Apr 28 12:07:43 host-116 lvm[28051]: raid1 array, black_bird-synced_random_raid1_2legs_1, is not in-sync.
Apr 28 12:07:43 host-116 lvm[28051]: raid1 array, black_bird-synced_random_raid1_2legs_1, is now in-sync.


# Raid 4 (synced non primary device failed)
Apr 28 10:59:38 host-116 kernel: md: recovery of RAID array mdX
Apr 28 10:59:38 host-116 lvm[28051]: Faulty devices in black_bird/synced_random_raid4_3legs_1 successfully replaced.
Apr 28 10:59:39 host-116 lvm[28051]: raid4 array, black_bird-synced_random_raid4_3legs_1, is now in-sync.
Apr 28 10:59:39 host-116 lvm[28051]: WARNING: Disabling lvmetad cache for repair command.
Apr 28 10:59:39 host-116 lvm[28051]: WARNING: Not using lvmetad because of repair.
Apr 28 10:59:39 host-116 lvm[28051]: Couldn't find device with uuid fVixHo-2w5w-ebg0-Ct8i-CX14-vuDo-rhPEcw.
Apr 28 10:59:39 host-116 lvm[28051]: Unable to replace devices in black_bird/synced_random_raid4_3legs_1 while it is not in-sync.
Apr 28 10:59:39 host-116 lvm[28051]: Failed to replace faulty devices in black_bird/synced_random_raid4_3legs_1.
Apr 28 10:59:39 host-116 lvm[28051]: Repair of RAID device black_bird-synced_random_raid4_3legs_1 failed.
Apr 28 10:59:39 host-116 lvm[28051]: Failed to process event for black_bird-synced_random_raid4_3legs_1.
Apr 28 10:59:39 host-116 lvm[28051]: raid4 array, black_bird-synced_random_raid4_3legs_1, is not in-sync.
Apr 28 10:59:41 host-116 kernel: md: mdX: recovery done.
Apr 28 10:59:41 host-116 lvm[28051]: raid4 array, black_bird-synced_random_raid4_3legs_1, is now in-sync.
Apr 28 10:59:41 host-116 lvm[28051]: WARNING: Disabling lvmetad cache for repair command.
Apr 28 10:59:41 host-116 lvm[28051]: WARNING: Not using lvmetad because of repair.
Apr 28 10:59:41 host-116 lvm[28051]: Couldn't find device with uuid fVixHo-2w5w-ebg0-Ct8i-CX14-vuDo-rhPEcw.
Apr 28 10:59:41 host-116 lvm[28051]: black_bird/synced_random_raid4_3legs_1 does not contain devices specified to replace.
Apr 28 10:59:41 host-116 lvm[28051]: Faulty devices in black_bird/synced_random_raid4_3legs_1 successfully replaced.
Apr 28 10:59:41 host-116 lvm[28051]: raid4 array, black_bird-synced_random_raid4_3legs_1, is not in-sync.
Apr 28 10:59:41 host-116 lvm[28051]: raid4 array, black_bird-synced_random_raid4_3legs_1, is now in-sync.



# Raid 5 (synced non primary device failed)
Apr 28 11:00:47 host-073 lvm[1582]: Faulty devices in black_bird/synced_random_raid5_3legs_1 successfully replaced.
Apr 28 11:00:48 host-073 lvm[1582]: raid5_ls array, black_bird-synced_random_raid5_3legs_1, is now in-sync.
Apr 28 11:00:48 host-073 lvm[1582]: WARNING: Disabling lvmetad cache for repair command.
Apr 28 11:00:48 host-073 lvm[1582]: WARNING: Not using lvmetad because of repair.
Apr 28 11:00:48 host-073 lvm[1582]: Couldn't find device with uuid RUK27A-U9HU-3ORn-4pc5-0kP4-HIv5-82AWno.
Apr 28 11:00:48 host-073 lvm[1582]: Unable to replace devices in black_bird/synced_random_raid5_3legs_1 while it is not in-sync.
Apr 28 11:00:48 host-073 lvm[1582]: Failed to replace faulty devices in black_bird/synced_random_raid5_3legs_1.
Apr 28 11:00:48 host-073 lvm[1582]: Repair of RAID device black_bird-synced_random_raid5_3legs_1 failed.
Apr 28 11:00:48 host-073 lvm[1582]: Failed to process event for black_bird-synced_random_raid5_3legs_1.
Apr 28 11:00:48 host-073 lvm[1582]: raid5_ls array, black_bird-synced_random_raid5_3legs_1, is not in-sync.
Apr 28 11:00:51 host-073 kernel: md: mdX: recovery done.
Apr 28 11:00:51 host-073 lvm[1582]: raid5_ls array, black_bird-synced_random_raid5_3legs_1, is now in-sync.
Apr 28 11:00:51 host-073 lvm[1582]: WARNING: Disabling lvmetad cache for repair command.
Apr 28 11:00:51 host-073 lvm[1582]: WARNING: Not using lvmetad because of repair.
Apr 28 11:00:51 host-073 lvm[1582]: Couldn't find device with uuid RUK27A-U9HU-3ORn-4pc5-0kP4-HIv5-82AWno.
Apr 28 11:00:51 host-073 lvm[1582]: black_bird/synced_random_raid5_3legs_1 does not contain devices specified to replace.
Apr 28 11:00:51 host-073 lvm[1582]: Faulty devices in black_bird/synced_random_raid5_3legs_1 successfully replaced.
Apr 28 11:00:51 host-073 lvm[1582]: raid5_ls array, black_bird-synced_random_raid5_3legs_1, is not in-sync.
Apr 28 11:00:51 host-073 lvm[1582]: raid5_ls array, black_bird-synced_random_raid5_3legs_1, is now in-sync.



# Raid 6 (synced non primary device failed)
Apr 28 11:32:49 host-073 lvm[1582]: Faulty devices in black_bird/synced_random_raid6_3legs_1 successfully replaced.
Apr 28 11:32:49 host-073 kernel: md: recovery of RAID array mdX
Apr 28 11:32:50 host-073 lvm[1582]: raid6_zr array, black_bird-synced_random_raid6_3legs_1, is now in-sync.
Apr 28 11:32:50 host-073 lvm[1582]: WARNING: Disabling lvmetad cache for repair command.
Apr 28 11:32:50 host-073 lvm[1582]: WARNING: Not using lvmetad because of repair.
Apr 28 11:32:50 host-073 lvm[1582]: Couldn't find device with uuid gHz2SG-ZW3p-Epg2-EncO-HVRI-sgti-yiOxQ7.
Apr 28 11:32:50 host-073 lvm[1582]: Unable to replace devices in black_bird/synced_random_raid6_3legs_1 while it is not in-sync.
Apr 28 11:32:50 host-073 lvm[1582]: Failed to replace faulty devices in black_bird/synced_random_raid6_3legs_1.
Apr 28 11:32:50 host-073 lvm[1582]: Repair of RAID device black_bird-synced_random_raid6_3legs_1 failed.
Apr 28 11:32:50 host-073 lvm[1582]: Failed to process event for black_bird-synced_random_raid6_3legs_1.
Apr 28 11:32:50 host-073 lvm[1582]: raid6_zr array, black_bird-synced_random_raid6_3legs_1, is not in-sync.
Apr 28 11:32:56 host-073 kernel: md: mdX: recovery done.
Apr 28 11:32:56 host-073 lvm[1582]: raid6_zr array, black_bird-synced_random_raid6_3legs_1, is now in-sync.
Apr 28 11:32:56 host-073 lvm[1582]: WARNING: Disabling lvmetad cache for repair command.
Apr 28 11:32:56 host-073 lvm[1582]: WARNING: Not using lvmetad because of repair.
Apr 28 11:32:56 host-073 lvm[1582]: Couldn't find device with uuid gHz2SG-ZW3p-Epg2-EncO-HVRI-sgti-yiOxQ7.
Apr 28 11:32:56 host-073 lvm[1582]: black_bird/synced_random_raid6_3legs_1 does not contain devices specified to replace.
Apr 28 11:32:56 host-073 lvm[1582]: Faulty devices in black_bird/synced_random_raid6_3legs_1 successfully replaced.
Apr 28 11:32:56 host-073 lvm[1582]: raid6_zr array, black_bird-synced_random_raid6_3legs_1, is not in-sync.
Apr 28 11:32:56 host-073 lvm[1582]: raid6_zr array, black_bird-synced_random_raid6_3legs_1, is now in-sync.



# Raid 10 (synced non primary device failed) 
Apr 28 11:19:58 host-073 kernel: md/raid10:mdX: active with 5 out of 6 devices
Apr 28 11:19:59 host-073 kernel: md: mdX: recovery interrupted.
Apr 28 11:19:59 host-073 kernel: md: recovery of RAID array mdX
Apr 28 11:19:59 host-073 lvm[1582]: Faulty devices in black_bird/synced_random_raid10_3legs_1 successfully replaced.
Apr 28 11:19:59 host-073 lvm[1582]: raid10 array, black_bird-synced_random_raid10_3legs_1, is now in-sync.
Apr 28 11:19:59 host-073 lvm[1582]: WARNING: Disabling lvmetad cache for repair command.
Apr 28 11:19:59 host-073 lvm[1582]: WARNING: Not using lvmetad because of repair.
Apr 28 11:20:00 host-073 lvm[1582]: Couldn't find device with uuid ecLpK1-qjBZ-72UE-2FS5-ny0Z-QVw2-obAOE1.
Apr 28 11:20:00 host-073 lvm[1582]: Unable to replace devices in black_bird/synced_random_raid10_3legs_1 while it is not in-sync.
Apr 28 11:20:00 host-073 lvm[1582]: Failed to replace faulty devices in black_bird/synced_random_raid10_3legs_1.
Apr 28 11:20:00 host-073 lvm[1582]: Repair of RAID device black_bird-synced_random_raid10_3legs_1 failed.
Apr 28 11:20:00 host-073 lvm[1582]: Failed to process event for black_bird-synced_random_raid10_3legs_1.
Apr 28 11:20:00 host-073 lvm[1582]: raid10 array, black_bird-synced_random_raid10_3legs_1, is not in-sync.
Apr 28 11:20:01 host-073 systemd: Started Session 10 of user root.
Apr 28 11:20:01 host-073 systemd: Starting Session 10 of user root.
Apr 28 11:20:02 host-073 kernel: md: mdX: recovery done.
Apr 28 11:20:02 host-073 lvm[1582]: raid10 array, black_bird-synced_random_raid10_3legs_1, is now in-sync.
Apr 28 11:20:02 host-073 lvm[1582]: WARNING: Disabling lvmetad cache for repair command.
Apr 28 11:20:02 host-073 lvm[1582]: WARNING: Not using lvmetad because of repair.
Apr 28 11:20:02 host-073 lvm[1582]: Couldn't find device with uuid ecLpK1-qjBZ-72UE-2FS5-ny0Z-QVw2-obAOE1.
Apr 28 11:20:02 host-073 lvm[1582]: black_bird/synced_random_raid10_3legs_1 does not contain devices specified to replace.
Apr 28 11:20:02 host-073 lvm[1582]: Faulty devices in black_bird/synced_random_raid10_3legs_1 successfully replaced.
Apr 28 11:20:02 host-073 lvm[1582]: raid10 array, black_bird-synced_random_raid10_3legs_1, is not in-sync.
Apr 28 11:20:02 host-073 lvm[1582]: raid10 array, black_bird-synced_random_raid10_3legs_1, is now in-sync.



Version-Release number of selected component (if applicable):
3.10.0-651.el7.x86_64

lvm2-2.02.170-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
lvm2-libs-2.02.170-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
lvm2-cluster-2.02.170-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
device-mapper-1.02.139-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
device-mapper-libs-1.02.139-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
device-mapper-event-1.02.139-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
device-mapper-event-libs-1.02.139-2.el7    BUILT: Thu Apr 13 14:37:43 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 10:15:46 CDT 2017

Comment 3 Heinz Mauelshagen 2017-05-26 14:14:08 UTC

Corey,
I assume the fail/repair is immediately tried after the initial sync
(we got a presumed race relative to updating the sync status in the kernel we haven't been able to address as yet)?
If so, does introducing a delay after the initial sync not expose the issue?

Comment 4 Corey Marthaler 2017-05-26 17:25:57 UTC

No. The fail/repair is tried quite awhile after the raid has been verified to be in sync, like well over a minute and after I/O the filesystem has been run.

Comment 5 Corey Marthaler 2017-05-26 17:28:22 UTC

Created attachment 1282702 [details]
log file during the failure of two raids sharing the same PV

Notice how much time passes before the failure takes place after creation and sycn'ing. And, the number of times repair is attempted and fails then passes.

Comment 6 Jonathan Earl Brassow 2017-06-06 15:49:21 UTC

patches committed upstream:

commit 88e649628863e78b101c584c513053fc9461c24d
Author: Jonathan Brassow <jbrassow>
Date:   Tue Jun 6 10:43:12 2017 -0500

    lvconvert:  linear -> raid1 upconvert should cause "recover" not "resync"

* and *

commit acaf3a5d47fd65b2e385a516544f8e6ec8d89b2d
Author: Jonathan Brassow <jbrassow>
Date:   Tue Jun 6 10:43:49 2017 -0500

    lvconvert:  Don't require a 'force' option during RAID repair.

Comment 8 Corey Marthaler 2017-06-15 21:28:17 UTC

Created attachment 1288203 [details]
test cases that continue to fail

Here are at least two test cases that continue to "fail" in the exact same way as describe originally in comment #0. The automatic allocation does eventually succeed but not before "in-sync" failures.

Comment 9 Corey Marthaler 2017-06-20 23:18:50 UTC

Created attachment 1289853 [details]
scenarios still failing

Following up with the above comment #8, many of the same scenarios listed in the original comment #0 continue to exhibit the same behavior. The only two differences this one run, were that raid1 and raid10 didn't fail to repair, that said, they still had the multiple "is now in-sync." and "successfully replaced." messages that are a regression from when there was only one pass or fail message. The other raid types continued to fail.

Comment 10 Zdenek Kabelac 2017-06-23 21:40:25 UTC

I'm getting here a bit confused.


The raid repair code does 'require' to have raid in-sync to be 'repairable'

If another 'repairable' action is triggered while repair runs in progress - this could appear problematic.

I've also upstreamed code which was fixing transiently failed rimage device.
Unsure if that's been part of problem here, as it's unclear how the 'bad' was created.

Comment 11 Zdenek Kabelac 2017-06-23 22:09:52 UTC

Current upstream has few patches worth retesting this scenario - so Jon if you can check with 0016b79c8b3223de7155fb4ec5fa799f7af1c811 - is problem fixed ?

Comment 13 Jonathan Earl Brassow 2017-06-25 21:42:18 UTC

For some reason, we are seeing a storm of events that are all seemingly grabbing status at roughly the same time.  The statuses bounce around in random ways that are not in-line with what the kernel is producing AFAICT.  It almost seems that dmeventd is somehow reordering some of the events as they are being processed.  I haven't come up with a great way of solving this problem.

Adding a 'sleep' to the end of the
        /* Loop awaiting/analyzing device events. */
        while (thread->events) {
loop does a better job of processing events.  I believe this only affects one device (not the whole of all devices being monitored).  If this is true, it shouldn't hurt us to make the sleep somewhat long, allow for the initial event to be processed and then gather another event a bit later.  Remember that RAID can tolerate long delays before reparative action is taken (mirror doesn't do so well with that, but the initial event would be enough to clean things up there too).

I'm currently testing with the above idea (a sleep(2)) and things are working very well.  I will continue to run some of corey's tests for a bit.

Comment 14 Jonathan Earl Brassow 2017-06-25 21:57:47 UTC

nope.  even after adding the sleep, there are still cases that are failing.

To be clear, the RAID is successfully repaired, but then the subsequent events just add confusion on top.  They almost immediately report the RAID is in-sync (which it isn't by then, it is in the process of recovery), then they say a device has failed (which it hasn't, it has already been repaired) and that repair has failed (it didn't), then that the array is in-sync, then not in-sync, and so on.  It is a mess.  It is a harmless mess, but it is a mess.

We could easily punt this until a z-stream release, because again, the repair does happen - the extra events serve only to confuse, but not harm the user.

I will need to know from corey if there is actually a case that is _failing_ to repair - not merely saying there was a failure to repair.  That will determine whether we are forced to block the release for this repair or we can fix it afterwards.

Comment 15 Jonathan Earl Brassow 2017-06-25 22:13:24 UTC

... and the insane part is that I am only receiving 2 events during this whole period!

before the failure:
Name:              vg-raid6
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        0
Event number:      1
Major, minor:      253, 13
Number of targets: 1
UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8VTQDcGKlEW2JiTcXFgGYD6EJHUPwKEIPa

after the failure (and all the processing):
Name:              vg-raid6
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        0
Event number:      3
Major, minor:      253, 13
Number of targets: 1
UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8VTQDcGKlEW2JiTcXFgGYD6EJHUPwKEIPa

1 event for the failure, and one for the change in table?  (MD emits an event when there is an error and when the sync thread is reaped - so this seems correct.)

So, why are we getting so many responses to these limited events!?!?  It's as if dmeventd_raid.c:_process_raid_event() is getting called 2-3 times for each event!


I'm going to take out the sleep that I put in, run from pristine upstream LVM sources, and see how many _process_raid_event() calls we get from the number of events received.

Comment 16 Jonathan Earl Brassow 2017-06-25 22:29:05 UTC

from pristine sources I get the following 'dmsetup info' at the stated phases...

before initial sync finishes:
Name:              vg-raid6
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 13
Number of targets: 1
UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8Vbio1pWUoAaBWLT2PRhBaOboEMZYKwrPr

after initial sync finishes:
Name:              vg-raid6
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        0
Event number:      1
Major, minor:      253, 13
Number of targets: 1
UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8Vbio1pWUoAaBWLT2PRhBaOboEMZYKwrPr

after failure repair (before fully synced):
Name:              vg-raid6
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        0
Event number:      3
Major, minor:      253, 13
Number of targets: 1
UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8Vbio1pWUoAaBWLT2PRhBaOboEMZYKwrPr

after repair sync:
Name:              vg-raid6
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        0
Event number:      4
Major, minor:      253, 13
Number of targets: 1
UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8Vbio1pWUoAaBWLT2PRhBaOboEMZYKwrPr


Now, /only/ counting responses from dmeventd after being fully synced, I get (some non-relevent output clipped):

1) Jun 25 22:17:38 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is now in-sync.
2) Jun 25 22:17:38 bp-01 lvm[43576]: WARNING: Device #0 of raid6_zr array, vg-raid6, has failed.
2b) Jun 25 22:17:39 bp-01 lvm[43576]: Faulty devices in vg/raid6 successfully replaced.
3) Jun 25 22:17:39 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is now in-sync.
4) Jun 25 22:17:39 bp-01 lvm[43576]: WARNING: Device #0 of raid6_zr array, vg-raid6, has failed.
4b) Jun 25 22:17:39 bp-01 lvm[43576]: Failed to process event for vg-raid6.
5) Jun 25 22:17:39 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is not in-sync.
*) Jun 25 22:18:17 bp-01 kernel: md: mdX: recovery done.
6) Jun 25 22:18:17 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is now in-sync.
7) Jun 25 22:18:17 bp-01 lvm[43576]: WARNING: Device #0 of raid6_zr array, vg-raid6, has failed.
7b) Jun 25 22:18:17 bp-01 lvm[43576]: vg/raid6 does not contain devices specified to replace.
7c?) Jun 25 22:18:17 bp-01 lvm[43576]: Faulty devices in vg/raid6 successfully replaced.
8) Jun 25 22:18:17 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is not in-sync.
9) Jun 25 22:18:17 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is now in-sync.

3 events (note the resync finished in there) to 9 responses!!!  Some of these are clearly being repeated - and they seem to contain the same kernel status they had each time they are run.  Thus, I believe #1 and #3 are a repeat of the initial sync (they shouldn't be there at all), #2, #4, #7 are likely from the same event, etc.

Something has definitely gone wrong here.

Comment 18 Zdenek Kabelac 2017-06-26 18:18:23 UTC

Appeared to be an issue with reuse of dm_task struct where list pointer were not properly reininilized for next reuse for DM status call.

Fixed with following upstream patch:

https://www.redhat.com/archives/lvm-devel/2017-June/msg00118.html

Comment 21 Corey Marthaler 2017-06-28 22:24:01 UTC

Fix verified in the latest rpms.

3.10.0-686.el7.x86_64
lvm2-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-libs-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-cluster-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-libs-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-libs-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 10:15:46 CDT 2017


Jun 28 16:08:43 host-081 lvm[1299]: Faulty devices in black_bird/synced_primary_raid1_2legs_1 successfully replaced.
Jun 28 16:08:43 host-081 lvm[1299]: raid1 array, black_bird-synced_primary_raid1_2legs_1-real, is not in-sync.
Jun 28 16:09:08 host-081 lvm[1299]: raid1 array, black_bird-synced_primary_raid1_2legs_1-real, is now in-sync.

Comment 22 errata-xmlrpc 2017-08-01 21:52:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2222