Bug 1210637

Summary: Raid array occasionally returns it's not in-sync
Product: Red Hat Enterprise Linux 6 Reporter: Zdenek Kabelac <zkabelac>
Component: lvm2Assignee: Heinz Mauelshagen <heinzm>
lvm2 sub component: Mirroring and RAID (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED NEXTRELEASE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: agk, heinzm, jbrassow, mcsontos, msnitzer, prajnoha, prockai, zkabelac
Version: 6.7   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-04 22:46:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Log from lvconvert-repair-raid.sh
none
Repair fail log
none
Complete lvm2 test script used none

Description Zdenek Kabelac 2015-04-10 08:59:05 UTC
Created attachment 1013056 [details]
Log from lvconvert-repair-raid.sh

Description of problem:

Current  RHEL6 kernel 2.6.32-540.el6.x86_64 with raid target v1.3.5
occasionally returns raid array is not sync.

This is causing problems in our test suite (log attached)
resulting into this:

LV percent: 0.000000
Unable to extract RAID image while RAID array is not in-sync
Failed to remove the specified images from LVMTEST612vg/LV1
Failed to replace faulty devices in LVMTEST612vg/LV1.

Being result of this sequence:

# lvcreate --type raid6 -i 3 -l 2 -n $lv1 $vg "$dev1" "$dev2" "$dev3" "$dev4" "$dev5"
# aux wait_for_sync $vg $lv1
# aux disable_dev "$dev4" "$dev5"
# lvconvert -y --repair $vg/$lv1
 WARNING: LVMTEST612vg/LV1 is not in-sync.
....


Version-Release number of selected component (if applicable):
kernel-2.6.32-540.el6.x86_64
lvm2  2.02.118

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
Kernel raid status reliably returns  'in-sync' status.

Additional info:

Comment 3 Heinz Mauelshagen 2015-04-15 19:46:09 UTC
I was able to see the following by the end of a resynchronization:

0 2097152 raid raid6 4 AAAa 69632/1048576 recover 0
...
[root.122.3 ~]# while true;do dmsetup status r1-r;done|grep " 0/"
0 2097152 raid raid6 4 aaaa 0/1048576 recover 0
0 2097152 raid raid6 4 aaaa 0/1048576 resync 0
^C
[root.122.3 ~]# dmsetup status r1-r

[root.122.3 ~]# dmsetup status r1-r
0 2097152 raid raid6_n_6 4 AAAA 1048576/1048576 idle 0


So raid_status() reports bogus interim 0/1048576 recover/resync status
before it ends up with 1048576/1048576 in state "idle".
raid_status() accesses flags and sector values unlocked in parallel
with an md thread.

If uspace accesses that interim status it explains the matter.

Comment 5 Zdenek Kabelac 2015-05-04 12:22:59 UTC
For now - lvm2 added 'cruel hack' to defeat this problematic unreliable reporting from target driver and rereads the value when 0 is reported:

https://www.redhat.com/archives/lvm-devel/2015-May/msg00032.html

Unsure how good or bad this idea is ATM.

Comment 6 Jonathan Earl Brassow 2015-05-12 16:20:59 UTC
With workaround described in comment 5, this can be deferred until 6.8 (and should be fixed in RHEL7 if it exists there).

Comment 7 Jonathan Earl Brassow 2015-05-12 16:21:56 UTC
The code commit described in comment 5 should be removed once the problem is fixed.

Comment 8 Zdenek Kabelac 2015-05-20 08:53:19 UTC
Created attachment 1027575 [details]
Repair fail  log

Patch from comment 5  makes it appear less often - but the problem is still reachable as visible in the attached test log trace:

Repair is run after raid array announced it's in sync - 
Command checks twice:

## DEBUG: activate/dev_manager.c:958   LV percent: 0.000000

and repair fails, however right after that the 'dmsetup status' tells this:

## DMSTATUS: @PREFIX@vg-LV1: 0 3072 raid raid6_zr 5 AAADD 1024/1024 idle 0

So the status reporting is unreliable and needs kernel fix.

Comment 9 Zdenek Kabelac 2015-05-27 12:04:43 UTC
As suggested by Heinz -

https://www.redhat.com/archives/lvm-devel/2015-May/msg00228.html

lvm test suite now has changed it's test for array-in-sync thus with this more complex test we now should possibly reduce of false 'in-sync' array repairs.

If this will work - we may probably design how the 'lvs' percent should be reported.

Should we never 'report' 100% if there is no-idle ?

What are all transition states ?

Comment 12 Jonathan Earl Brassow 2017-10-12 15:00:35 UTC
(In reply to Heinz Mauelshagen from comment #3)
> I was able to see the following by the end of a resynchronization:
> 
> 0 2097152 raid raid6 4 AAAa 69632/1048576 recover 0
> ...
> [root.122.3 ~]# while true;do dmsetup status r1-r;done|grep " 0/"
> 0 2097152 raid raid6 4 aaaa 0/1048576 recover 0
> 0 2097152 raid raid6 4 aaaa 0/1048576 resync 0
> ^C
> [root.122.3 ~]# dmsetup status r1-r
> 
> [root.122.3 ~]# dmsetup status r1-r
> 0 2097152 raid raid6_n_6 4 AAAA 1048576/1048576 idle 0
> 
> 
> So raid_status() reports bogus interim 0/1048576 recover/resync status
> before it ends up with 1048576/1048576 in state "idle".
> raid_status() accesses flags and sector values unlocked in parallel
> with an md thread.
> 
> If uspace accesses that interim status it explains the matter.

You are performing an up-convert.  This is a problem that I recently patched for and should be going into kernel 4.14 soon.

However, zdenek is creating the raid6... it would have to use the "recover" sync action to be fixed by my patch, I think.

Comment 13 Heinz Mauelshagen 2017-12-04 22:38:51 UTC
Created attachment 1362843 [details]
Complete lvm2 test script used

Comment 14 Heinz Mauelshagen 2017-12-04 22:46:03 UTC
(In reply to Heinz Mauelshagen from comment #13)
> Created attachment 1362843 [details]
> Complete lvm2 test script used

My test result as attached script:
### 1 tests: 1 passed, 0 skipped, 0 timed out, 0 warned, 0 failed

This is fixed by patch series
https://www.redhat.com/archives/dm-devel/2017-December/msg00012.html
and linux-dm.git, dm-4.16 branch commits
7931be992ba68da384f99a9a9d0a41d5b7ee2843 to
c8704bb994d53f2cd60948af925c3b7616b7845b)