Bug 1446780
| Summary: | automatic "allocate" fault policy raid repair is failing (Unable to replace devices while it is not in-sync) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> | ||||||||
| Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> | ||||||||
| lvm2 sub component: | Mirroring and RAID | QA Contact: | cluster-qe <cluster-qe> | ||||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||||
| Severity: | high | ||||||||||
| Priority: | unspecified | CC: | agk, cmarthal, heinzm, jbrassow, mcsontos, msnitzer, mthacker, prajnoha, prockai, zkabelac | ||||||||
| Version: | 7.4 | Keywords: | Regression | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | lvm2-2.02.171-8.el7 | Doc Type: | If docs needed, set a value | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2017-08-01 21:52:19 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Corey Marthaler
2017-04-28 21:26:52 UTC
Corey, I assume the fail/repair is immediately tried after the initial sync (we got a presumed race relative to updating the sync status in the kernel we haven't been able to address as yet)? If so, does introducing a delay after the initial sync not expose the issue? No. The fail/repair is tried quite awhile after the raid has been verified to be in sync, like well over a minute and after I/O the filesystem has been run. Created attachment 1282702 [details]
log file during the failure of two raids sharing the same PV
Notice how much time passes before the failure takes place after creation and sycn'ing. And, the number of times repair is attempted and fails then passes.
patches committed upstream:
commit 88e649628863e78b101c584c513053fc9461c24d
Author: Jonathan Brassow <jbrassow>
Date: Tue Jun 6 10:43:12 2017 -0500
lvconvert: linear -> raid1 upconvert should cause "recover" not "resync"
* and *
commit acaf3a5d47fd65b2e385a516544f8e6ec8d89b2d
Author: Jonathan Brassow <jbrassow>
Date: Tue Jun 6 10:43:49 2017 -0500
lvconvert: Don't require a 'force' option during RAID repair.
Created attachment 1288203 [details] test cases that continue to fail Here are at least two test cases that continue to "fail" in the exact same way as describe originally in comment #0. The automatic allocation does eventually succeed but not before "in-sync" failures. Created attachment 1289853 [details] scenarios still failing Following up with the above comment #8, many of the same scenarios listed in the original comment #0 continue to exhibit the same behavior. The only two differences this one run, were that raid1 and raid10 didn't fail to repair, that said, they still had the multiple "is now in-sync." and "successfully replaced." messages that are a regression from when there was only one pass or fail message. The other raid types continued to fail. I'm getting here a bit confused. The raid repair code does 'require' to have raid in-sync to be 'repairable' If another 'repairable' action is triggered while repair runs in progress - this could appear problematic. I've also upstreamed code which was fixing transiently failed rimage device. Unsure if that's been part of problem here, as it's unclear how the 'bad' was created. Current upstream has few patches worth retesting this scenario - so Jon if you can check with 0016b79c8b3223de7155fb4ec5fa799f7af1c811 - is problem fixed ? For some reason, we are seeing a storm of events that are all seemingly grabbing status at roughly the same time. The statuses bounce around in random ways that are not in-line with what the kernel is producing AFAICT. It almost seems that dmeventd is somehow reordering some of the events as they are being processed. I haven't come up with a great way of solving this problem.
Adding a 'sleep' to the end of the
/* Loop awaiting/analyzing device events. */
while (thread->events) {
loop does a better job of processing events. I believe this only affects one device (not the whole of all devices being monitored). If this is true, it shouldn't hurt us to make the sleep somewhat long, allow for the initial event to be processed and then gather another event a bit later. Remember that RAID can tolerate long delays before reparative action is taken (mirror doesn't do so well with that, but the initial event would be enough to clean things up there too).
I'm currently testing with the above idea (a sleep(2)) and things are working very well. I will continue to run some of corey's tests for a bit.
nope. even after adding the sleep, there are still cases that are failing. To be clear, the RAID is successfully repaired, but then the subsequent events just add confusion on top. They almost immediately report the RAID is in-sync (which it isn't by then, it is in the process of recovery), then they say a device has failed (which it hasn't, it has already been repaired) and that repair has failed (it didn't), then that the array is in-sync, then not in-sync, and so on. It is a mess. It is a harmless mess, but it is a mess. We could easily punt this until a z-stream release, because again, the repair does happen - the extra events serve only to confuse, but not harm the user. I will need to know from corey if there is actually a case that is _failing_ to repair - not merely saying there was a failure to repair. That will determine whether we are forced to block the release for this repair or we can fix it afterwards. ... and the insane part is that I am only receiving 2 events during this whole period! before the failure: Name: vg-raid6 State: ACTIVE Read Ahead: 8192 Tables present: LIVE Open count: 0 Event number: 1 Major, minor: 253, 13 Number of targets: 1 UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8VTQDcGKlEW2JiTcXFgGYD6EJHUPwKEIPa after the failure (and all the processing): Name: vg-raid6 State: ACTIVE Read Ahead: 8192 Tables present: LIVE Open count: 0 Event number: 3 Major, minor: 253, 13 Number of targets: 1 UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8VTQDcGKlEW2JiTcXFgGYD6EJHUPwKEIPa 1 event for the failure, and one for the change in table? (MD emits an event when there is an error and when the sync thread is reaped - so this seems correct.) So, why are we getting so many responses to these limited events!?!? It's as if dmeventd_raid.c:_process_raid_event() is getting called 2-3 times for each event! I'm going to take out the sleep that I put in, run from pristine upstream LVM sources, and see how many _process_raid_event() calls we get from the number of events received. from pristine sources I get the following 'dmsetup info' at the stated phases... before initial sync finishes: Name: vg-raid6 State: ACTIVE Read Ahead: 8192 Tables present: LIVE Open count: 0 Event number: 0 Major, minor: 253, 13 Number of targets: 1 UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8Vbio1pWUoAaBWLT2PRhBaOboEMZYKwrPr after initial sync finishes: Name: vg-raid6 State: ACTIVE Read Ahead: 8192 Tables present: LIVE Open count: 0 Event number: 1 Major, minor: 253, 13 Number of targets: 1 UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8Vbio1pWUoAaBWLT2PRhBaOboEMZYKwrPr after failure repair (before fully synced): Name: vg-raid6 State: ACTIVE Read Ahead: 8192 Tables present: LIVE Open count: 0 Event number: 3 Major, minor: 253, 13 Number of targets: 1 UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8Vbio1pWUoAaBWLT2PRhBaOboEMZYKwrPr after repair sync: Name: vg-raid6 State: ACTIVE Read Ahead: 8192 Tables present: LIVE Open count: 0 Event number: 4 Major, minor: 253, 13 Number of targets: 1 UUID: LVM-TvGXzbyBplKUjn9LafTl8SF27W8cqY8Vbio1pWUoAaBWLT2PRhBaOboEMZYKwrPr Now, /only/ counting responses from dmeventd after being fully synced, I get (some non-relevent output clipped): 1) Jun 25 22:17:38 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is now in-sync. 2) Jun 25 22:17:38 bp-01 lvm[43576]: WARNING: Device #0 of raid6_zr array, vg-raid6, has failed. 2b) Jun 25 22:17:39 bp-01 lvm[43576]: Faulty devices in vg/raid6 successfully replaced. 3) Jun 25 22:17:39 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is now in-sync. 4) Jun 25 22:17:39 bp-01 lvm[43576]: WARNING: Device #0 of raid6_zr array, vg-raid6, has failed. 4b) Jun 25 22:17:39 bp-01 lvm[43576]: Failed to process event for vg-raid6. 5) Jun 25 22:17:39 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is not in-sync. *) Jun 25 22:18:17 bp-01 kernel: md: mdX: recovery done. 6) Jun 25 22:18:17 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is now in-sync. 7) Jun 25 22:18:17 bp-01 lvm[43576]: WARNING: Device #0 of raid6_zr array, vg-raid6, has failed. 7b) Jun 25 22:18:17 bp-01 lvm[43576]: vg/raid6 does not contain devices specified to replace. 7c?) Jun 25 22:18:17 bp-01 lvm[43576]: Faulty devices in vg/raid6 successfully replaced. 8) Jun 25 22:18:17 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is not in-sync. 9) Jun 25 22:18:17 bp-01 lvm[43576]: raid6_zr array, vg-raid6, is now in-sync. 3 events (note the resync finished in there) to 9 responses!!! Some of these are clearly being repeated - and they seem to contain the same kernel status they had each time they are run. Thus, I believe #1 and #3 are a repeat of the initial sync (they shouldn't be there at all), #2, #4, #7 are likely from the same event, etc. Something has definitely gone wrong here. Appeared to be an issue with reuse of dm_task struct where list pointer were not properly reininilized for next reuse for DM status call. Fixed with following upstream patch: https://www.redhat.com/archives/lvm-devel/2017-June/msg00118.html Fix verified in the latest rpms. 3.10.0-686.el7.x86_64 lvm2-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 lvm2-libs-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 lvm2-cluster-2.02.171-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 device-mapper-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 device-mapper-libs-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 device-mapper-event-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 device-mapper-event-libs-1.02.140-8.el7 BUILT: Wed Jun 28 13:28:58 CDT 2017 device-mapper-persistent-data-0.7.0-0.1.rc6.el7 BUILT: Mon Mar 27 10:15:46 CDT 2017 Jun 28 16:08:43 host-081 lvm[1299]: Faulty devices in black_bird/synced_primary_raid1_2legs_1 successfully replaced. Jun 28 16:08:43 host-081 lvm[1299]: raid1 array, black_bird-synced_primary_raid1_2legs_1-real, is not in-sync. Jun 28 16:09:08 host-081 lvm[1299]: raid1 array, black_bird-synced_primary_raid1_2legs_1-real, is now in-sync. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2222 |