1210637 – Raid array occasionally returns it's not in-sync

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1210637 - Raid array occasionally returns it's not in-sync

Summary: Raid array occasionally returns it's not in-sync

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	6.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Heinz Mauelshagen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-04-10 08:59 UTC by Zdenek Kabelac
Modified:	2017-12-04 22:46 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-12-04 22:46:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Log from lvconvert-repair-raid.sh (394.05 KB, text/plain) 2015-04-10 08:59 UTC, Zdenek Kabelac	no flags	Details
Repair fail log (978.64 KB, text/plain) 2015-05-20 08:53 UTC, Zdenek Kabelac	no flags	Details
Complete lvm2 test script used (791 bytes, application/x-shellscript) 2017-12-04 22:38 UTC, Heinz Mauelshagen	no flags	Details
View All

Description Zdenek Kabelac 2015-04-10 08:59:05 UTC

Created attachment 1013056 [details]
Log from lvconvert-repair-raid.sh

Description of problem:

Current  RHEL6 kernel 2.6.32-540.el6.x86_64 with raid target v1.3.5
occasionally returns raid array is not sync.

This is causing problems in our test suite (log attached)
resulting into this:

LV percent: 0.000000
Unable to extract RAID image while RAID array is not in-sync
Failed to remove the specified images from LVMTEST612vg/LV1
Failed to replace faulty devices in LVMTEST612vg/LV1.

Being result of this sequence:

# lvcreate --type raid6 -i 3 -l 2 -n $lv1 $vg "$dev1" "$dev2" "$dev3" "$dev4" "$dev5"
# aux wait_for_sync $vg $lv1
# aux disable_dev "$dev4" "$dev5"
# lvconvert -y --repair $vg/$lv1
 WARNING: LVMTEST612vg/LV1 is not in-sync.
....


Version-Release number of selected component (if applicable):
kernel-2.6.32-540.el6.x86_64
lvm2  2.02.118

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
Kernel raid status reliably returns  'in-sync' status.

Additional info:

Comment 3 Heinz Mauelshagen 2015-04-15 19:46:09 UTC

I was able to see the following by the end of a resynchronization:

0 2097152 raid raid6 4 AAAa 69632/1048576 recover 0
...
[root.122.3 ~]# while true;do dmsetup status r1-r;done|grep " 0/"
0 2097152 raid raid6 4 aaaa 0/1048576 recover 0
0 2097152 raid raid6 4 aaaa 0/1048576 resync 0
^C
[root.122.3 ~]# dmsetup status r1-r

[root.122.3 ~]# dmsetup status r1-r
0 2097152 raid raid6_n_6 4 AAAA 1048576/1048576 idle 0


So raid_status() reports bogus interim 0/1048576 recover/resync status
before it ends up with 1048576/1048576 in state "idle".
raid_status() accesses flags and sector values unlocked in parallel
with an md thread.

If uspace accesses that interim status it explains the matter.

Comment 5 Zdenek Kabelac 2015-05-04 12:22:59 UTC

For now - lvm2 added 'cruel hack' to defeat this problematic unreliable reporting from target driver and rereads the value when 0 is reported:

https://www.redhat.com/archives/lvm-devel/2015-May/msg00032.html

Unsure how good or bad this idea is ATM.

Comment 6 Jonathan Earl Brassow 2015-05-12 16:20:59 UTC

With workaround described in comment 5, this can be deferred until 6.8 (and should be fixed in RHEL7 if it exists there).

Comment 7 Jonathan Earl Brassow 2015-05-12 16:21:56 UTC

The code commit described in comment 5 should be removed once the problem is fixed.

Comment 8 Zdenek Kabelac 2015-05-20 08:53:19 UTC

Created attachment 1027575 [details]
Repair fail  log

Patch from comment 5  makes it appear less often - but the problem is still reachable as visible in the attached test log trace:

Repair is run after raid array announced it's in sync - 
Command checks twice:

## DEBUG: activate/dev_manager.c:958   LV percent: 0.000000

and repair fails, however right after that the 'dmsetup status' tells this:

## DMSTATUS: @PREFIX@vg-LV1: 0 3072 raid raid6_zr 5 AAADD 1024/1024 idle 0

So the status reporting is unreliable and needs kernel fix.

Comment 9 Zdenek Kabelac 2015-05-27 12:04:43 UTC

As suggested by Heinz -

https://www.redhat.com/archives/lvm-devel/2015-May/msg00228.html

lvm test suite now has changed it's test for array-in-sync thus with this more complex test we now should possibly reduce of false 'in-sync' array repairs.

If this will work - we may probably design how the 'lvs' percent should be reported.

Should we never 'report' 100% if there is no-idle ?

What are all transition states ?

Comment 12 Jonathan Earl Brassow 2017-10-12 15:00:35 UTC

(In reply to Heinz Mauelshagen from comment #3)
> I was able to see the following by the end of a resynchronization:
> 
> 0 2097152 raid raid6 4 AAAa 69632/1048576 recover 0
> ...
> [root.122.3 ~]# while true;do dmsetup status r1-r;done|grep " 0/"
> 0 2097152 raid raid6 4 aaaa 0/1048576 recover 0
> 0 2097152 raid raid6 4 aaaa 0/1048576 resync 0
> ^C
> [root.122.3 ~]# dmsetup status r1-r
> 
> [root.122.3 ~]# dmsetup status r1-r
> 0 2097152 raid raid6_n_6 4 AAAA 1048576/1048576 idle 0
> 
> 
> So raid_status() reports bogus interim 0/1048576 recover/resync status
> before it ends up with 1048576/1048576 in state "idle".
> raid_status() accesses flags and sector values unlocked in parallel
> with an md thread.
> 
> If uspace accesses that interim status it explains the matter.

You are performing an up-convert.  This is a problem that I recently patched for and should be going into kernel 4.14 soon.

However, zdenek is creating the raid6... it would have to use the "recover" sync action to be fixed by my patch, I think.

Comment 13 Heinz Mauelshagen 2017-12-04 22:38:51 UTC

Created attachment 1362843 [details]
Complete lvm2 test script used

Comment 14 Heinz Mauelshagen 2017-12-04 22:46:03 UTC

(In reply to Heinz Mauelshagen from comment #13)
> Created attachment 1362843 [details]
> Complete lvm2 test script used

My test result as attached script:
### 1 tests: 1 passed, 0 skipped, 0 timed out, 0 warned, 0 failed

This is fixed by patch series
https://www.redhat.com/archives/dm-devel/2017-December/msg00012.html
and linux-dm.git, dm-4.16 branch commits
7931be992ba68da384f99a9a9d0a41d5b7ee2843 to
c8704bb994d53f2cd60948af925c3b7616b7845b)

Note You need to log in before you can comment on or make changes to this bug.