Bug 1102564
| Summary: | lvchange --syncaction is not detecting corruptions | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Zdenek Kabelac <zkabelac> |
| Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> |
| lvm2 sub component: | Mirroring and RAID (RHEL6) | QA Contact: | Cluster QE <mspqa-list> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | agk, cmarthal, heinzm, jbrassow, msnitzer, nperic, prajnoha, prockai, zkabelac |
| Version: | 6.5 | Keywords: | Reopened |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-02-26 11:14:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1075263 | ||
|
Description
Zdenek Kabelac
2014-05-29 08:32:31 UTC
I think this bug is a failure of the test. The 'dd' cannot be relied upon to have the contents on the disk directly after the command completes unless you run 'sync' afterwards or use 'oflag=direct' with the 'dd' command. BEFORE (w/o direct I/O flag): # hmmm it's still in 'dd' buffer and not on real disk ?? # anyway skip over with 'should' should check lv_field $vg/$lv1 raid_mismatch_count "128" #lvchange-syncaction-raid.sh:31+ should check lv_field LVMTEST39786vg/LV1 raid_mismatch_count 128 lv_field: lv=LVMTEST39786vg/LV1, field="raid_mismatch_count", actual="0", expected="128" TEST WARNING: Ignoring command failure. # Ensure it's all on disk now AFTER (with direct I/O flag): # hmmm it's still in 'dd' buffer and not on real disk ?? # anyway skip over with 'should' should check lv_field $vg/$lv1 raid_mismatch_count "128" #foo.sh:31+ should check lv_field LVMTEST40248vg/LV1 raid_mismatch_count 128 # Ensure it's all on disk now The RAID array cannot possibly know about cached write blocks that come in orthoganally to the array's write path. A real-life scenario would eminate from the disk and be caught. For these test that simulate corruption on disk, we must ensure that the bad data gets to the disk before checking for corruption. I will fix the test as the fix for this bug and then close it NOTABUG. If I am missing some concern, please advise. commit 4454a580dfd966d0cd132a2fd7d0cbb0df7e46a6
Author: Jonathan Brassow <jbrassow>
Date: Fri May 30 17:26:10 2014 -0500
test: use direct I/O when injecting bad data into RAID images
When directly corrupting RAID images for the purpose of testing,
we must use direct I/O (or a 'sync' after the 'dd') to ensure that
the writes are not caught in the buffer cache in a way that is not
reachable by the top-level RAID device.
I don't think you could this bug yet. Of course for 'raid1' 'oflags=direct' or 'sync' solves the problem. BUT - the lvm2 tool should issue disk sync on it's own (just like we sync disk before we create i.e. snapshot). So IMHO for raid1 - lvchange before starting check raid function should ensure all in-flight disk operation are on disk - so it gives result for current state of the system. 2nd. BUT is - this will not help for raid4,5,6 case at all - it needs reactivation of device. this fix will have to wait for 6.7 As Jon already elaborated on, this is a test case flaw writing to legs directly via the buffer cache and not syncing. ODIRECT obviously making the sync superfluous, hence solving the issue on RAID1. The part of the bug releated to RAID4/5/6 not detecting inconsistencies whilst checking persists though. If that regresses, MD_RECOVERY_CHECK may not invalidate stripe cache entries before checking. Analyzing that next. Analysis of the RAID4/5/6 personalities (drivers/md/raid5.c) does not show any revalidation of active stripes when requested to do MD_RECOVERY_CHECK. Directio writing 1K random data to the beginning of an array leg updates that block on the media but the previous correct block content is still present in the related active stripe and thus is utilized for the check operation on that very stripe. I am discussing any senseful revalidation options with Neil Brown now, but I I think this is just an artificial case caused by the test data injection and no real life data corruption flaw to be covered by MD (any rogue user with appropriate credentials can corrupt any data on block devices). To work around this in testing, writing enough data to the raid device after the initial stripe to fill the stripe cache and then run "lvchange --syncaction check ..." should do. Created a test case which'll show check triggers along the lines of ^ comment
work around proposal based on a 4G test RAID5:
#!/bin/sh
#
# Test RAID4/5/6 "--syncaction check" working around the stripe cache in md-raids personalities
#
function wait_for_syncaction()
{
percent="0"
while [ $percent != "100,00" ]
do
sleep 1
percent="$(lvs -ocopy_percent --noheadings $1 2>&1)"
done
lvs -oraid_mismatch_count --noheadings $1
}
LV=/dev/mapper/evo-raid5
dev2=${LV}_rimage_1
lvchange --syncaction repair $LV
mismatch_1=$(wait_for_syncaction $LV)
dd if=/dev/urandom of=$dev2 bs=1K count=1 oflag=direct >/dev/null 2>&1
lvchange --syncaction check $LV
mismatch_2=$(wait_for_syncaction $LV)
if [ $mismatch_1 -ne $mismatch_2 ]
then
echo "$mismatch_2 sectors mismatching!"
else
echo "No mismatch?! :-("
fi
Neil Brown confirms (respective mail pasted underneath for completeness) md-raid456 assumes that a drive never spontaneously changes it's content (as enforced by the odirect write to a leg), thus no stripe revalidation occurs. My test proposal in comment #8 works around that for the time being. He's thinking about introducing a mempool, which'd address the issue in question by only keeping stripe cache entries during times of concurrent access and returning them to the mempool afterwards, hence invalidating them. > Hi Neil, > > I'm analyzing a data consistency bug related to a data injection in one > of our tests > (ie. "dd oflag=odirect" random data to the first KB of raid 5 leg and > request > an array check do _not_ find any inconsistencies). > > Looking at raid5.c the active stripe does not get revalidated, thus the > check has > to succeed based on the correct block content still being present in > sh->dev[i].page. > > Is that correct? > > If so, any bit rot on a stripe would not be spotted unless a read/write > would > eventually cause an io error? > > Did you think about revalidating stripe cache entries on check in order to > spot such out-of-band data corruptions? > > Thanks, > Heinz (Sorry for empty reply - clicked the wrong button). Yes, md/raid5 assumes that the drive never spontaneously changes it's content - what was read recently is probably still there. I think that is a reasonable assumption. Data doesn't usually survive in the stripe cache for very long, so "bit rot" is very unlikely to cause an inconsistency that remains hidden by the cache for long. Unless the array is tiny, a 'check' will re-use all of the stripe cache multiple times, so doing two consecutive 'check's will read what is really on the devices. I have thought about replacing the fixed-size stripe cache with a mempool. When stripes become idle I would then return them to the pool and forget any content they might have. That would make this particular symptom go away as we would not cache data from one request to the next, only during concurrent requests. NeilBrown (In reply to Heinz Mauelshagen from comment #9) > Neil Brown confirms (respective mail pasted underneath for completeness) > md-raid456 assumes that a drive never spontaneously changes it's content (as > enforced by the odirect write to a leg), thus no stripe revalidation occurs. > > My test proposal in comment #8 works around that for the time being. > > He's thinking about introducing a mempool, which'd address the issue in > question by only keeping stripe cache entries during times of concurrent > access > and returning them to the mempool afterwards, hence invalidating them. > > > Hi Neil, > > > > I'm analyzing a data consistency bug related to a data injection in one > > of our tests > > (ie. "dd oflag=odirect" random data to the first KB of raid 5 leg and > > request > > an array check do _not_ find any inconsistencies). > > > > Looking at raid5.c the active stripe does not get revalidated, thus the > > check has > > to succeed based on the correct block content still being present in > > sh->dev[i].page. > > > > Is that correct? > > > > If so, any bit rot on a stripe would not be spotted unless a read/write > > would > > eventually cause an io error? > > > > Did you think about revalidating stripe cache entries on check in order to > > spot such out-of-band data corruptions? > > > > Thanks, > > Heinz > > (Sorry for empty reply - clicked the wrong button). > > Yes, md/raid5 assumes that the drive never spontaneously changes it's > content - what was read recently is probably still there. > I think that is a reasonable assumption. > > Data doesn't usually survive in the stripe cache for very long, so "bit rot" > is very unlikely to cause an inconsistency that remains hidden by the cache > for long. > > Unless the array is tiny, a 'check' will re-use all of the stripe cache > multiple times, so doing two consecutive 'check's will read what is really on > the devices. > > I have thought about replacing the fixed-size stripe cache with a mempool. > When stripes become idle I would then return them to the pool and forget any > content they might have. That would make this particular symptom go away > as we would not cache data from one request to the next, only during > concurrent requests. > > > NeilBrown Given the rational explaining the design and potential workaround, closing as WONTFIX. FWIW: "lvchange --refresh $LV" on a raid4/5/6 LV will reload the mapping, thus dropping and reinitializing the respective RAID stripe cache. |