Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1102564

Summary:	lvchange --syncaction is not detecting corruptions
Product:	Red Hat Enterprise Linux 6	Reporter:	Zdenek Kabelac <zkabelac>
Component:	lvm2	Assignee:	Heinz Mauelshagen <heinzm>
lvm2 sub component:	Mirroring and RAID (RHEL6)	QA Contact:	Cluster QE <mspqa-list>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	agk, cmarthal, heinzm, jbrassow, msnitzer, nperic, prajnoha, prockai, zkabelac
Version:	6.5	Keywords:	Reopened
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-02-26 11:14:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1075263

Description Zdenek Kabelac 2014-05-29 08:32:31 UTC

Description of problem:

lvchange --syncaction  is currently not correctly detecting inconsistencies
in raid arrays.

For raid1 - when the leg is corrupted - it's not seen by syncaction until 'sync'. For raid4,5,6 it's even necessary to deactivate & activate whole LV, before the corruptions is detected

Version-Release number of selected component (if applicable):
lvm2 2.02.106

How reproducible:


Steps to Reproduce:
1. create  raid array
2. corrupt leg  (dd)
3. run  lvchange --syncaction check

Actual results:
Corruption is detected after deactivation and activation of volume

Expected results:
Reactivation should not be necessary

Additional info:

Comment 2 Jonathan Earl Brassow 2014-05-30 22:13:18 UTC

I think this bug is a failure of the test.  The 'dd' cannot be relied upon to have the contents on the disk directly after the command completes unless you run 'sync' afterwards or use 'oflag=direct' with the 'dd' command.

BEFORE (w/o direct I/O flag):
# hmmm it's still in 'dd' buffer and not on real disk ??
# anyway skip over with 'should'
should check lv_field $vg/$lv1 raid_mismatch_count "128"
#lvchange-syncaction-raid.sh:31+ should check lv_field LVMTEST39786vg/LV1 raid_mismatch_count 128
lv_field: lv=LVMTEST39786vg/LV1, field="raid_mismatch_count", actual="0", expected="128"
TEST WARNING: Ignoring command failure.
# Ensure it's all on disk now

AFTER (with direct I/O flag):
# hmmm it's still in 'dd' buffer and not on real disk ??
# anyway skip over with 'should'
should check lv_field $vg/$lv1 raid_mismatch_count "128"
#foo.sh:31+ should check lv_field LVMTEST40248vg/LV1 raid_mismatch_count 128
# Ensure it's all on disk now

The RAID array cannot possibly know about cached write blocks that come in orthoganally to the array's write path.  A real-life scenario would eminate from the disk and be caught.  For these test that simulate corruption on disk, we must ensure that the bad data gets to the disk before checking for corruption.

I will fix the test as the fix for this bug and then close it NOTABUG.  If I am missing some concern, please advise.

Comment 3 Jonathan Earl Brassow 2014-05-30 22:30:13 UTC

commit 4454a580dfd966d0cd132a2fd7d0cbb0df7e46a6
Author: Jonathan Brassow <jbrassow>
Date:   Fri May 30 17:26:10 2014 -0500

    test: use direct I/O when injecting bad data into RAID images
    
    When directly corrupting RAID images for the purpose of testing,
    we must use direct I/O (or a 'sync' after the 'dd') to ensure that
    the writes are not caught in the buffer cache in a way that is not
    reachable by the top-level RAID device.

Comment 4 Zdenek Kabelac 2014-05-31 07:27:23 UTC

I don't think you could this bug yet.

Of course for 'raid1'  'oflags=direct' or 'sync' solves the problem.

BUT - the lvm2 tool should issue disk sync on it's own
(just like we sync disk before we create i.e. snapshot).

So IMHO for raid1 - lvchange before starting check raid function should ensure all in-flight disk operation are on disk - so it gives result for current state of the system.

2nd. BUT is - this will not help for  raid4,5,6 case at all - it needs reactivation of device.

Comment 5 Jonathan Earl Brassow 2014-08-27 04:26:03 UTC

this fix will have to wait for 6.7

Comment 6 Heinz Mauelshagen 2014-09-25 13:25:03 UTC

As Jon already elaborated on, this is a test case flaw writing to legs directly via the buffer cache and not syncing. ODIRECT obviously making the sync superfluous, hence solving the issue on RAID1.

The part of the bug releated to RAID4/5/6 not detecting inconsistencies whilst checking persists though.

If that regresses, MD_RECOVERY_CHECK may not invalidate stripe cache entries before checking. Analyzing that next.

Comment 7 Heinz Mauelshagen 2014-09-25 15:41:32 UTC

Analysis of the RAID4/5/6 personalities (drivers/md/raid5.c) does not show any revalidation of active stripes when requested to do MD_RECOVERY_CHECK.

Directio writing 1K random data to the beginning of an array leg
updates that block on the media but the previous correct block content
is still present in the related active stripe and thus is utilized
for the check operation on that very stripe.

I am discussing any senseful revalidation options with Neil Brown now, but I
I think this is just an artificial case caused by the test data injection
and no real life data corruption flaw to be covered by MD (any rogue user
with appropriate credentials can corrupt any data on block devices).

To work around this in testing, writing enough data to the raid device
after the initial stripe to fill the stripe cache and then run
"lvchange --syncaction check ..." should do.

Comment 8 Heinz Mauelshagen 2014-09-25 17:18:27 UTC

Created a test case which'll show check triggers along the lines of ^ comment
work around proposal based on a 4G test RAID5:


                             
#!/bin/sh
#
# Test RAID4/5/6 "--syncaction check" working around the stripe cache in md-raids personalities
#

function wait_for_syncaction()
{
        percent="0"

        while [ $percent != "100,00" ]
        do
                sleep 1
                percent="$(lvs -ocopy_percent --noheadings $1 2>&1)"
        done

        lvs -oraid_mismatch_count --noheadings $1
}

LV=/dev/mapper/evo-raid5
dev2=${LV}_rimage_1

lvchange --syncaction repair $LV
mismatch_1=$(wait_for_syncaction $LV)

dd if=/dev/urandom of=$dev2 bs=1K count=1 oflag=direct >/dev/null 2>&1

lvchange --syncaction check $LV
mismatch_2=$(wait_for_syncaction $LV)

if [ $mismatch_1 -ne $mismatch_2 ]
then
        echo "$mismatch_2 sectors mismatching!"
else
        echo "No mismatch?! :-("
fi

Comment 9 Heinz Mauelshagen 2014-09-26 10:27:56 UTC

Neil Brown confirms (respective mail pasted underneath for completeness) md-raid456 assumes that a drive never spontaneously changes it's content (as enforced by the odirect write to a leg), thus no stripe revalidation occurs.

My test proposal in comment #8 works around that for the time being.

He's thinking about introducing a mempool, which'd address the issue in question by only keeping stripe cache entries during times of concurrent access
and returning them to the mempool afterwards, hence invalidating them.

> Hi Neil,
>
> I'm analyzing a data consistency bug related to a data injection in one 
> of our tests
> (ie. "dd oflag=odirect" random data to the first KB of raid 5 leg and 
> request
> an array check do _not_ find any inconsistencies).
>
> Looking at raid5.c the active stripe does not get revalidated, thus the 
> check has
> to succeed based on the correct block content still being present in 
> sh->dev[i].page.
>
> Is that correct?
>
> If so, any bit rot on a stripe would not be spotted unless a read/write 
> would
> eventually cause an io error?
>
> Did you think about revalidating stripe cache entries on check in order to
> spot such out-of-band data corruptions?
>
> Thanks,
> Heinz

(Sorry for empty reply - clicked the wrong button).

Yes, md/raid5 assumes that the drive never spontaneously changes it's
content - what was read recently is probably still there.
I think that is a reasonable assumption.

Data doesn't usually survive in the stripe cache for very long, so "bit rot"
is very unlikely to cause an inconsistency that remains hidden by the cache
for long.

Unless the array is tiny, a 'check' will re-use all of the stripe cache
multiple times, so doing two consecutive 'check's will read what is really on
the devices.

I have thought about replacing the fixed-size stripe cache with a mempool.
When stripes become idle I would then return them to the pool and forget any
content they might have.  That would make this particular symptom go away
as we would not cache data from one request to the next, only during
concurrent requests.


NeilBrown

Comment 13 Heinz Mauelshagen 2015-02-26 11:14:14 UTC

(In reply to Heinz Mauelshagen from comment #9)
> Neil Brown confirms (respective mail pasted underneath for completeness)
> md-raid456 assumes that a drive never spontaneously changes it's content (as
> enforced by the odirect write to a leg), thus no stripe revalidation occurs.
> 
> My test proposal in comment #8 works around that for the time being.
> 
> He's thinking about introducing a mempool, which'd address the issue in
> question by only keeping stripe cache entries during times of concurrent
> access
> and returning them to the mempool afterwards, hence invalidating them.
> 
> > Hi Neil,
> >
> > I'm analyzing a data consistency bug related to a data injection in one 
> > of our tests
> > (ie. "dd oflag=odirect" random data to the first KB of raid 5 leg and 
> > request
> > an array check do _not_ find any inconsistencies).
> >
> > Looking at raid5.c the active stripe does not get revalidated, thus the 
> > check has
> > to succeed based on the correct block content still being present in 
> > sh->dev[i].page.
> >
> > Is that correct?
> >
> > If so, any bit rot on a stripe would not be spotted unless a read/write 
> > would
> > eventually cause an io error?
> >
> > Did you think about revalidating stripe cache entries on check in order to
> > spot such out-of-band data corruptions?
> >
> > Thanks,
> > Heinz
> 
> (Sorry for empty reply - clicked the wrong button).
> 
> Yes, md/raid5 assumes that the drive never spontaneously changes it's
> content - what was read recently is probably still there.
> I think that is a reasonable assumption.
> 
> Data doesn't usually survive in the stripe cache for very long, so "bit rot"
> is very unlikely to cause an inconsistency that remains hidden by the cache
> for long.
> 
> Unless the array is tiny, a 'check' will re-use all of the stripe cache
> multiple times, so doing two consecutive 'check's will read what is really on
> the devices.
> 
> I have thought about replacing the fixed-size stripe cache with a mempool.
> When stripes become idle I would then return them to the pool and forget any
> content they might have.  That would make this particular symptom go away
> as we would not cache data from one request to the next, only during
> concurrent requests.
> 
> 
> NeilBrown

Given the rational explaining the design and potential workaround, closing as WONTFIX.

FWIW:
"lvchange --refresh $LV" on a raid4/5/6 LV will reload the mapping, thus dropping and reinitializing the respective RAID stripe cache.