Bug 182659

Summary:	kernel dm: snapshots of the same origin with differing chunk sizes causes corruption
Product:	Red Hat Enterprise Linux 4	Reporter:	Corey Marthaler <cmarthal>
Component:	kernel	Assignee:	Mikuláš Patočka <mpatocka>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	4.0	CC:	agk, bstevens, coughlan, dwysocha, jbaron, jbrassow, mbroz, riek, syeghiay
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-05-18 21:57:08 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	422551, 430698, 461297

Description Corey Marthaler 2006-02-23 21:36:41 UTC

Description of problem:
This is either a corruption bug or a usabillity issue. 

If you create two snapshots of the same origin, each with a different chunk
size, you will get corruption on the snapshot, when writing to the origin volume. 

If you are not allowed to have differing chunk sized snaps of the same origin,
then you shouldn't be allowed to create them.

I've tried the following chunk sizes 32, 64, 128, 256, 512 and all work just
fine when all the snaps have the same size. But as soon as you intoduce another
snap with a differing chunk size, the I/O validation fails. 


Test output:
Making snapshot with chunksize of 16
lvcreate -s /dev/snapper/origin -c 16 -L 2.5G
Running block level I/O to the origin and verifying on snapshot 16
b_iogen -o -m random -f direct -i 15s -s write,writev -t1000b -T10000b -d
/dev/snapper/origin:/dev/snapper/lvol1 | b_doio-vi
b_iogen starting up with the following:

Out-pipe:        stdout
Iterations:      15s
Seed:            9663
Offset-mode:     random
Single Pass:     off
Overlap Flag:    on
Mintrans:        512000
Maxtrans:        5120000
Syscalls:        write  writev
Flags:          direct

Test Devices:

Path                                                      Size
                                                        (bytes)
---------------------------------------------------------------
/dev/snapper/origin                                        4294967296
        Snap Devices:
                /dev/snapper/lvol1

Making snapshot with chunksize of 16
lvcreate -s /dev/snapper/origin -c 16 -L 2.5G
Running block level I/O to the origin and verifying on snapshot 16
b_iogen -o -m random -f direct -i 15s -s write,writev -t1000b -T10000b -d
/dev/snapper/origin:/dev/snapper/lvol2 | b_doio-vi
b_iogen starting up with the following:

Out-pipe:        stdout
Iterations:      15s
Seed:            9711
Offset-mode:     random
Single Pass:     off
Overlap Flag:    on
Mintrans:        512000
Maxtrans:        5120000
Syscalls:        write  writev
Flags:          direct

Test Devices:

Path                                                      Size
                                                        (bytes)
---------------------------------------------------------------
/dev/snapper/origin                                        4294967296
        Snap Devices:
                /dev/snapper/lvol2

Making snapshot with chunksize of 16
lvcreate -s /dev/snapper/origin -c 16 -L 2.5G
Running block level I/O to the origin and verifying on snapshot 16
b_iogen -o -m random -f direct -i 15s -s write,writev -t1000b -T10000b -d
/dev/snapper/origin:/dev/snapper/lvol3 | b_doio-vi
b_iogen starting up with the following:

Out-pipe:        stdout
Iterations:      15s
Seed:            9759
Offset-mode:     random
Single Pass:     off
Overlap Flag:    on
Mintrans:        512000
Maxtrans:        5120000
Syscalls:        write  writev
Flags:          direct

Test Devices:

Path                                                      Size
                                                        (bytes)
---------------------------------------------------------------
/dev/snapper/origin                                        4294967296
        Snap Devices:
                /dev/snapper/lvol3

Making snapshot with chunksize of 32
lvcreate -s /dev/snapper/origin -c 32 -L 2.5G
Running block level I/O to the origin and verifying on snapshot 32
b_iogen -o -m random -f direct -i 15s -s write,writev -t1000b -T10000b -d
/dev/snapper/origin:/dev/snapper/lvol4 | b_doio-vi
b_iogen starting up with the following:

Out-pipe:        stdout
Iterations:      15s
Seed:            9807
Offset-mode:     random
Single Pass:     off
Overlap Flag:    on
Mintrans:        512000
Maxtrans:        5120000
Syscalls:        write  writev
Flags:          direct

Test Devices:

Path                                                      Size
                                                        (bytes)
---------------------------------------------------------------
/dev/snapper/origin                                        4294967296
        Snap Devices:
                /dev/snapper/lvol4
*** DATA COMPARISON ERROR /dev/snapper/lvol4 ***
Started at byte 3278848
Expected 0x6f 0x69 0x6f 0x2a 0x53 0x3a 0x39 0x36 0x31 0x34  ...
Got 0x30 0x38 0x3a 0x62 0x5f 0x64 0x6f 0x69 0x6f 0x2a  ...
block level IO failed with snapshot 32[root@link-08 bin]#


Version-Release number of selected component (if applicable):
[root@link-08 bin]# rpm -q device-mapper
device-mapper-1.02.03-1.0.RHEL4
[root@link-08 bin]# rpm -q lvm2
lvm2-2.02.02-1.0.RHEL4


How reproducible:
everytime

Comment 1 Alasdair Kergon 2006-02-24 16:16:51 UTC

Fascinating.  I never thought to try that before!

It's supposed to work:
                /*
                 * Remember, different snapshots can have
                 * different chunk sizes.
                 */

Comment 2 Alasdair Kergon 2006-03-20 21:57:54 UTC

I haven't reproduced this yet:
I created two snapshots with different chunk sizes,
wrote to the origin but saw no corruption on the snapshots.

Is that what your test is doing, or is it something more complicated?

Comment 3 Alasdair Kergon 2006-03-20 22:01:16 UTC

Is the output above from a *single* test i.e. you need *four* snapshots to
reproduce this, and only if number 4 has a different chunksize?

Comment 4 Corey Marthaler 2006-03-20 22:05:42 UTC

I believe I was able to reproduce this with just two snapshots. I'll try it
again ad see...

Comment 5 Alasdair Kergon 2006-03-20 22:08:33 UTC

And what does the 'data comparison error' tell you?

Can you narrow down which data is corrupt?  Where did it come from - in which
step was it written, and how much data was written in that step and what
proportion of it was corrupt?  How much data is corrupt e.g. everything written
in a particular step, or just one or more chunks?   Is 'Started at byte 3278848'
the start of a section of data that was written, or part way through?   [Easiest
way to find out might be to reduce the amount of data written to a minimum and
see if it fails every time, no times, or only some proportion of times.]

Trying to establish if there's any randomness in the pattern of failures (e.g.
caused by race condition) or if it's deterministic and given the same setup and
I/O, it would always fail in exactly the same way.

Comment 6 Alasdair Kergon 2006-03-20 22:14:21 UTC

I looked at the code and haven't spotted the problem yet.

Also, can you run the validation after writing *twice* in succession, to confirm
that it's not related to the known race condition on bug 175830.  That bug can
be detected because two successive reads of the "same" place on disk will return
different data - corrupt data the first time, but correct data the second time.
 Also, the proportion of corrupt data will be small.  The amount of data
corrupted if it's a general problem with different chunk sizes would presumably
be larger and proportional to the amount of data written.

Comment 7 Alasdair Kergon 2006-03-20 22:17:13 UTC

Parse error in "can you run the validation after writing *twice* in succession".

Better: After the write phase, can you run the validation *twice* in succession.

Try a sync & sleep 10 seconds after writing and before the read check to see if
that makes any difference.

Comment 8 Corey Marthaler 2006-03-22 00:21:41 UTC

This is reproducable with just two snapshots, one with chunksize 16 and one with
chunksize 32.

However, we tried this with smaller sized origin and snapshot volumes (4M each
and then 12M each) and were unable to reproduce it. 

So then we tried it again with the sizes in the original report (4G), only this
time we completely filled the origin volume with the letter "A" before creating
the two different chunk sized snaps (so from here on out, no matter what we
write to the origin, the snap should only have "A"'s in it).

We then tried the following different types of I/O requests to the origin with
verification of the two snapshots:

A. direct, overlapped, random
B. direct, non overlapped, sequential
C. direct, non overlapped, reverse
D. buffered, non overlapped, reverse
E. sync, non overlapped, sequential

We only hit the bug when doing the direct I/O cases, and only on the 32k
chunksized snapshot. 

Here is a little more detail what those failures on that snapshot looked like
(remember it should only have "A"'s on it):

# direct, overlap, random
cmdline:
[root@link-08 tmp]# /tmp/b_iogen -o -m random -f direct -i 5s -s write,writev
-t1000b -T10000b -d /dev/snapper/origin:/dev/snapper/snap2 | /tmp/b_doio -vD 2>
/tmp/foobar

request that failed:
----- xior ----
magic: 0xfeed10
type: 16 (write)
path: /dev/snapper/origin
syscall: write
oflags: 16386 (O_RDWR|O_DIRECT)
offset: 1222144000
count: 4740608
pattern: V
devmap[0]: /dev/snapper/snap2

# at offset 1222144000 + 2543616 in the 32chunk size snap:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAo*V:7986:link-08:b_doio*V:7986:link-08:b_doio*V:7986:link-08:b_doio*V:7986:link-08:b_doio*V:7986:link-08:b


# direct, sequential, non overlapping
cmdline:
/tmp/b_iogen -f direct -i 5s -s write,writev -t1000b -T10000b -d
/dev/snapper/origin:/dev/snapper/snap2 | /tmp/b_doio -vD 2> /tmp/glarch

request that failed:
----- xior ----
magic: 0xfeed10
type: 16 (write)
path: /dev/snapper/origin
syscall: writev
oflags: 16386 (O_RDWR|O_DIRECT)
offset: 39311872
count: 4895232
pattern: Q
devmap[0]: /dev/snapper/snap2


# at offset 39311872 + 1664512 in the 32chunk size snap:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA:b_doio*Q:7989:link-08:b_doio*Q:7989
:link-08:b_doio*Q:7989:link-08:b_doio*Q:7989:link-08:b_doio*Q:7989:link-08:b_doio*Q:798



# direct, reverse, non overlapping
cmdline:
/tmp/b_iogen -m reverse -f direct -i 5s -s write,writev -t1000b -T10000b -d
/dev/snapper/origin:/dev/snapper/snap2 | /tmp/b_doio -vD 2> /tmp/doodie

request that failed:
----- xior ----
magic: 0xfeed10
type: 16 (write)
path: /dev/snapper/origin
syscall: write
oflags: 16386 (O_RDWR|O_DIRECT)
offset: 4033847296
count: 3115008
pattern: J
devmap[0]: /dev/snapper/snap2


# at offset 4033847296 + 1843200 in the 32chunk size snap:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAoio*J:7999:link-08:b_doio*J:7999:link-08:b_doio*J:7999:link-08:b_doio*J:7999:link-

Comment 9 Alasdair Kergon 2006-03-22 01:04:55 UTC

Can you repeat the snapshot read test twice?

ie you say '# at offset 4033847296 + 1843200 in the 32chunk size snap:"
If you run something that reads the snapshot a second or third time, do you still see the same 
corruption i.e. can you confirm that the corruption is genuinely on the disk and not a one-off bug in the 
reading-back process?

And if you repeat one of the tests completely, does the offset of the failure change?

Does the test show *every* bit of corruption, or only the first it finds?  (ie is it just a small bit of data 
corrupted or is there lots more now shown?  what do you mean by "count:" and "+ 1843200" etc.?)

Comment 10 Alasdair Kergon 2006-03-22 01:22:16 UTC

The repeated test would need to be B to avoid the randomness within the test - question is whether the 
failure always happens at exactly or approximately the same offset, or if it's completely random.

And can you get it with smaller devices (ie order of magnitude test - 4 megabytes isn't enough, 4 
gigabytes is, what about 40M or 400M ?   How much RAM in the machine?  Does less RAM make the 
problem appear with a smaller device size?)

As I read it, the evidence so far points to some race condition, so we need to vary different things one 
at-a-time to see what makes the problem worse and what makes it go away.

If it is a race, it might be worth trying a kernel with the patches I already submitted to U4 in case they 
have a positive impact.  (It's just possible this is another manifestation of a problem already solved.)

Comment 11 Dean Jansa 2006-03-22 14:22:01 UTC

I can add a few bits to answer some of the questions Alasdair has.

The test stops checking for errors on the buffer once it finds the first bad
byte, it then dumps out a chunk beyond the corruption.  We have some extra code
in place right now to dump the entire "expected" buffer as well as the buffer we
read.  If that output would help we can add it to the bug.  

As for the count:  and "+ XXXXXXXX" in the report:

The xior shown:
----- xior ----
magic: 0xfeed10
type: 16 (write)
path: /dev/snapper/origin
syscall: write
oflags: 16386 (O_RDWR|O_DIRECT)
offset: 4033847296
count: 3115008
pattern: J
devmap[0]: /dev/snapper/snap2

Would cause b_doio to generate a pattern buffer 3115008 bytes long using 'J' as
a seed, the write (in this example using write(), could request writev, pwrite,
etc) to /dev/snapper/origin, starting at offset 4033847296.

Thus the b_doio*J:7999: repeating pattern seen in the snap read is what was
written to the origin, the read is getting a mix of the origional 'A' data and
what was just written to the origin, rather than just random data.

Prior to the write() b_doio pre-reads the snapshot(s) (devmap in the xior),
holds that buffer, writes to the origin, re-reads the origin and checks the
buffer, then re-reads all of the snaps and checks that buffer against the what
we read off the snap prior to the write to the origin.

So, the '# at offset 4033847296 + 1843200 in the 32chunk size snap' means the
b_doio detected the first byte of bad data 1843200 bytes into the buffer we read
 starting at offset 4033847296 in the snapshot.   (The full buffer size is
'count' as in the xior)

We can store the xior stream on disk and replay it, which should let us see if
we hit the failure at the exact same place.

We can add a multiple re-read at the snapshot verify step as well.

Comment 12 Alasdair Kergon 2006-03-22 15:18:53 UTC

Thanks for adding those details.

Definitely repeat the "then re-reads all of the snaps and checks that buffer against the what
we read off the snap prior to the write to the origin" step twice and report whether it returns wrong data 
the first and/or second time.

Comment 13 Alasdair Kergon 2006-03-22 15:23:32 UTC

This result of that test should show whether or not this is the same as bug 175830.  If so, having two 
different chunk sizes must be making the race more likely.

Comment 14 Corey Marthaler 2006-03-22 18:42:15 UTC

A re-read of the snapshot data shows the same corrupted data as the first read.

*** DATA COMPARISON ERROR /dev/snapper/snap2  (Offset 387887616) ***
Started at byte 1674752
Expected 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A)
0x41(A)  ...
Got 0x3a(:) 0x6c(l) 0x69(i) 0x6e(n) 0x6b(k) 0x2d(-) 0x30(0) 0x38(8) 0x3a(:)
0x62(b)  ...

*** DATA COMPARISON ERROR /dev/snapper/snap2  (Offset 387887616) ***
Started at byte 1674752
Expected 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A)
0x41(A)  ...
Got 0x3a(:) 0x6c(l) 0x69(i) 0x6e(n) 0x6b(k) 0x2d(-) 0x30(0) 0x38(8) 0x3a(:)
0x62(b)  ...

After seeing the failure, we then waited 10-15 minutes and did a straight dd off
of that snap device where the corruption was noticed and it was still there:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:

Comment 21 Tim Burke 2006-09-19 09:29:14 UTC

Put into NEEDINFO state as we are waiting for the problem reporter to confirm
whether the problem remains in RHEL4 Update 4 (4.4).

Comment 22 Corey Marthaler 2006-09-19 18:13:49 UTC

This issue is still in the latest rhel4.4 rpms.

[root@taft-02 bin]# rpm -q lvm2
lvm2-2.02.06-6.0.RHEL4B
[root@taft-02 bin]# rpm -q device-mapper
device-mapper-1.02.07-4.0.RHEL4B

Comment 26 Corey Marthaler 2007-07-31 20:38:36 UTC

Still here...

2.6.9-55.8.ELsmp
lvm2-2.02.27-1.el4

[root@link-08 ~]# lvs -a -o +chunksize
  LV           VG         Attr   LSize  Origin Snap%  Move Log Copy%  Chunk
  LogVol00     VolGroup00 -wi-ao 72.44G                                   0
  LogVol01     VolGroup00 -wi-ao  1.94G                                   0
  block_snap16 snapper    swi-a-  2.50G origin   9.43                 16.00K
  block_snap32 snapper    swi-a-  2.50G origin   0.18                 32.00K
  origin       snapper    owi-a-  4.00G                                   0


Making snapshot block_snap32 of origin volume
Running block level I/O to the origin and verifying on snapshot 32
b_iogen starting up with the following:

Iterations:      15s
Seed:            9597
Offset-mode:     random
Single Pass:     off
Overlap Flag:    on
Mintrans:        512000
Maxtrans:        5120000
Syscalls:        write  writev
Flags:          direct

Test Devices:

Path                                                      Size
                                                        (bytes)
---------------------------------------------------------------
/dev/snapper/origin                                        4294967296
        Snap Devices:
                /dev/snapper/block_snap32
*** DATA COMPARISON ERROR /dev/snapper/block_snap32  (Offset 993153024) ***
Started at byte 1077249
Expected 0x2a(*) 0x48(H) 0x3a(:) 0x39(9) 0x34(4) 0x35(5) 0x34(4) 0x3a(:) 0x6c(l)
0x69(i)  ...
Got 0x69(i) 0x6f(o) 0x2a(*) 0x51(Q) 0x3a(:) 0x39(9) 0x35(5) 0x39(9) 0x38(8)
0x3a(:)  ...
block level IO failed with snapshot 32

Comment 30 Mikuláš Patočka 2008-04-16 02:52:25 UTC

I found the reason for this bug during source code review (although I was not
yet able to reproduce it).

Imagine that you have two devices with different chunk sizes, on the first
snapshot you have a big chunk "A" and on the second snapshot you have two
smaller chunks "B" and "C".

=====A=====
==B== ==C==

Now if you write to origin to location that hits chunks B and A, you create
pending exception for B and A and link A's primary_pe pointer to B. You add the
bio to B's queue. When both these relocations finish, you dispatch the write to
the origin device.

If another write simultaneously hits chunks C and A, you create pending
exception for C, you don't link A's primary_pe pointer to C (because it is
already linked to B) and you queue the bio on C.

When realocation of C finishes, you dispatch the second write to the origin, but
realocation of A may not finish yet. That corrupts data on the snapshot with the
big chunksize.

I am working on a solution, it needs some code rewrite.

Comment 31 RHEL Program Management 2008-09-03 12:51:19 UTC

Updating PM score.

Comment 35 Mikuláš Patočka 2009-03-05 02:18:56 UTC

The patch is here http://people.redhat.com/mpatocka/patches/kernel/merging/2.6.28/dm-snapshot-rework-origin-write.patch , but it is waiting for upstream acceptance, so it wouldn't be good to try to push the bug into RHEL now.

Comment 36 RHEL Program Management 2009-03-12 19:32:56 UTC

Since RHEL 4.8 External Beta has begun, and this bugzilla remains 
unresolved, it has been rejected as it is not proposed as exception or 
blocker.