Bug 1210637
| Summary: | Raid array occasionally returns it's not in-sync | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Zdenek Kabelac <zkabelac> | ||||||||
| Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> | ||||||||
| lvm2 sub component: | Mirroring and RAID (RHEL6) | QA Contact: | cluster-qe <cluster-qe> | ||||||||
| Status: | CLOSED NEXTRELEASE | Docs Contact: | |||||||||
| Severity: | unspecified | ||||||||||
| Priority: | unspecified | CC: | agk, heinzm, jbrassow, mcsontos, msnitzer, prajnoha, prockai, zkabelac | ||||||||
| Version: | 6.7 | ||||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2017-12-04 22:46:03 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
I was able to see the following by the end of a resynchronization: 0 2097152 raid raid6 4 AAAa 69632/1048576 recover 0 ... [root.122.3 ~]# while true;do dmsetup status r1-r;done|grep " 0/" 0 2097152 raid raid6 4 aaaa 0/1048576 recover 0 0 2097152 raid raid6 4 aaaa 0/1048576 resync 0 ^C [root.122.3 ~]# dmsetup status r1-r [root.122.3 ~]# dmsetup status r1-r 0 2097152 raid raid6_n_6 4 AAAA 1048576/1048576 idle 0 So raid_status() reports bogus interim 0/1048576 recover/resync status before it ends up with 1048576/1048576 in state "idle". raid_status() accesses flags and sector values unlocked in parallel with an md thread. If uspace accesses that interim status it explains the matter. For now - lvm2 added 'cruel hack' to defeat this problematic unreliable reporting from target driver and rereads the value when 0 is reported: https://www.redhat.com/archives/lvm-devel/2015-May/msg00032.html Unsure how good or bad this idea is ATM. With workaround described in comment 5, this can be deferred until 6.8 (and should be fixed in RHEL7 if it exists there). The code commit described in comment 5 should be removed once the problem is fixed. Created attachment 1027575 [details] Repair fail log Patch from comment 5 makes it appear less often - but the problem is still reachable as visible in the attached test log trace: Repair is run after raid array announced it's in sync - Command checks twice: ## DEBUG: activate/dev_manager.c:958 LV percent: 0.000000 and repair fails, however right after that the 'dmsetup status' tells this: ## DMSTATUS: @PREFIX@vg-LV1: 0 3072 raid raid6_zr 5 AAADD 1024/1024 idle 0 So the status reporting is unreliable and needs kernel fix. As suggested by Heinz - https://www.redhat.com/archives/lvm-devel/2015-May/msg00228.html lvm test suite now has changed it's test for array-in-sync thus with this more complex test we now should possibly reduce of false 'in-sync' array repairs. If this will work - we may probably design how the 'lvs' percent should be reported. Should we never 'report' 100% if there is no-idle ? What are all transition states ? (In reply to Heinz Mauelshagen from comment #3) > I was able to see the following by the end of a resynchronization: > > 0 2097152 raid raid6 4 AAAa 69632/1048576 recover 0 > ... > [root.122.3 ~]# while true;do dmsetup status r1-r;done|grep " 0/" > 0 2097152 raid raid6 4 aaaa 0/1048576 recover 0 > 0 2097152 raid raid6 4 aaaa 0/1048576 resync 0 > ^C > [root.122.3 ~]# dmsetup status r1-r > > [root.122.3 ~]# dmsetup status r1-r > 0 2097152 raid raid6_n_6 4 AAAA 1048576/1048576 idle 0 > > > So raid_status() reports bogus interim 0/1048576 recover/resync status > before it ends up with 1048576/1048576 in state "idle". > raid_status() accesses flags and sector values unlocked in parallel > with an md thread. > > If uspace accesses that interim status it explains the matter. You are performing an up-convert. This is a problem that I recently patched for and should be going into kernel 4.14 soon. However, zdenek is creating the raid6... it would have to use the "recover" sync action to be fixed by my patch, I think. Created attachment 1362843 [details]
Complete lvm2 test script used
(In reply to Heinz Mauelshagen from comment #13) > Created attachment 1362843 [details] > Complete lvm2 test script used My test result as attached script: ### 1 tests: 1 passed, 0 skipped, 0 timed out, 0 warned, 0 failed This is fixed by patch series https://www.redhat.com/archives/dm-devel/2017-December/msg00012.html and linux-dm.git, dm-4.16 branch commits 7931be992ba68da384f99a9a9d0a41d5b7ee2843 to c8704bb994d53f2cd60948af925c3b7616b7845b) |
Created attachment 1013056 [details] Log from lvconvert-repair-raid.sh Description of problem: Current RHEL6 kernel 2.6.32-540.el6.x86_64 with raid target v1.3.5 occasionally returns raid array is not sync. This is causing problems in our test suite (log attached) resulting into this: LV percent: 0.000000 Unable to extract RAID image while RAID array is not in-sync Failed to remove the specified images from LVMTEST612vg/LV1 Failed to replace faulty devices in LVMTEST612vg/LV1. Being result of this sequence: # lvcreate --type raid6 -i 3 -l 2 -n $lv1 $vg "$dev1" "$dev2" "$dev3" "$dev4" "$dev5" # aux wait_for_sync $vg $lv1 # aux disable_dev "$dev4" "$dev5" # lvconvert -y --repair $vg/$lv1 WARNING: LVMTEST612vg/LV1 is not in-sync. .... Version-Release number of selected component (if applicable): kernel-2.6.32-540.el6.x86_64 lvm2 2.02.118 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Kernel raid status reliably returns 'in-sync' status. Additional info: