Bug 182659
Summary: | kernel dm: snapshots of the same origin with differing chunk sizes causes corruption | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Corey Marthaler <cmarthal> |
Component: | kernel | Assignee: | Mikuláš Patočka <mpatocka> |
Status: | CLOSED WONTFIX | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.0 | CC: | agk, bstevens, coughlan, dwysocha, jbaron, jbrassow, mbroz, riek, syeghiay |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-05-18 21:57:08 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 422551, 430698, 461297 |
Description
Corey Marthaler
2006-02-23 21:36:41 UTC
Fascinating. I never thought to try that before! It's supposed to work: /* * Remember, different snapshots can have * different chunk sizes. */ I haven't reproduced this yet: I created two snapshots with different chunk sizes, wrote to the origin but saw no corruption on the snapshots. Is that what your test is doing, or is it something more complicated? Is the output above from a *single* test i.e. you need *four* snapshots to reproduce this, and only if number 4 has a different chunksize? I believe I was able to reproduce this with just two snapshots. I'll try it again ad see... And what does the 'data comparison error' tell you? Can you narrow down which data is corrupt? Where did it come from - in which step was it written, and how much data was written in that step and what proportion of it was corrupt? How much data is corrupt e.g. everything written in a particular step, or just one or more chunks? Is 'Started at byte 3278848' the start of a section of data that was written, or part way through? [Easiest way to find out might be to reduce the amount of data written to a minimum and see if it fails every time, no times, or only some proportion of times.] Trying to establish if there's any randomness in the pattern of failures (e.g. caused by race condition) or if it's deterministic and given the same setup and I/O, it would always fail in exactly the same way. I looked at the code and haven't spotted the problem yet. Also, can you run the validation after writing *twice* in succession, to confirm that it's not related to the known race condition on bug 175830. That bug can be detected because two successive reads of the "same" place on disk will return different data - corrupt data the first time, but correct data the second time. Also, the proportion of corrupt data will be small. The amount of data corrupted if it's a general problem with different chunk sizes would presumably be larger and proportional to the amount of data written. Parse error in "can you run the validation after writing *twice* in succession". Better: After the write phase, can you run the validation *twice* in succession. Try a sync & sleep 10 seconds after writing and before the read check to see if that makes any difference. This is reproducable with just two snapshots, one with chunksize 16 and one with chunksize 32. However, we tried this with smaller sized origin and snapshot volumes (4M each and then 12M each) and were unable to reproduce it. So then we tried it again with the sizes in the original report (4G), only this time we completely filled the origin volume with the letter "A" before creating the two different chunk sized snaps (so from here on out, no matter what we write to the origin, the snap should only have "A"'s in it). We then tried the following different types of I/O requests to the origin with verification of the two snapshots: A. direct, overlapped, random B. direct, non overlapped, sequential C. direct, non overlapped, reverse D. buffered, non overlapped, reverse E. sync, non overlapped, sequential We only hit the bug when doing the direct I/O cases, and only on the 32k chunksized snapshot. Here is a little more detail what those failures on that snapshot looked like (remember it should only have "A"'s on it): # direct, overlap, random cmdline: [root@link-08 tmp]# /tmp/b_iogen -o -m random -f direct -i 5s -s write,writev -t1000b -T10000b -d /dev/snapper/origin:/dev/snapper/snap2 | /tmp/b_doio -vD 2> /tmp/foobar request that failed: ----- xior ---- magic: 0xfeed10 type: 16 (write) path: /dev/snapper/origin syscall: write oflags: 16386 (O_RDWR|O_DIRECT) offset: 1222144000 count: 4740608 pattern: V devmap[0]: /dev/snapper/snap2 # at offset 1222144000 + 2543616 in the 32chunk size snap: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAo*V:7986:link-08:b_doio*V:7986:link-08:b_doio*V:7986:link-08:b_doio*V:7986:link-08:b_doio*V:7986:link-08:b # direct, sequential, non overlapping cmdline: /tmp/b_iogen -f direct -i 5s -s write,writev -t1000b -T10000b -d /dev/snapper/origin:/dev/snapper/snap2 | /tmp/b_doio -vD 2> /tmp/glarch request that failed: ----- xior ---- magic: 0xfeed10 type: 16 (write) path: /dev/snapper/origin syscall: writev oflags: 16386 (O_RDWR|O_DIRECT) offset: 39311872 count: 4895232 pattern: Q devmap[0]: /dev/snapper/snap2 # at offset 39311872 + 1664512 in the 32chunk size snap: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA:b_doio*Q:7989:link-08:b_doio*Q:7989 :link-08:b_doio*Q:7989:link-08:b_doio*Q:7989:link-08:b_doio*Q:7989:link-08:b_doio*Q:798 # direct, reverse, non overlapping cmdline: /tmp/b_iogen -m reverse -f direct -i 5s -s write,writev -t1000b -T10000b -d /dev/snapper/origin:/dev/snapper/snap2 | /tmp/b_doio -vD 2> /tmp/doodie request that failed: ----- xior ---- magic: 0xfeed10 type: 16 (write) path: /dev/snapper/origin syscall: write oflags: 16386 (O_RDWR|O_DIRECT) offset: 4033847296 count: 3115008 pattern: J devmap[0]: /dev/snapper/snap2 # at offset 4033847296 + 1843200 in the 32chunk size snap: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAoio*J:7999:link-08:b_doio*J:7999:link-08:b_doio*J:7999:link-08:b_doio*J:7999:link- Can you repeat the snapshot read test twice? ie you say '# at offset 4033847296 + 1843200 in the 32chunk size snap:" If you run something that reads the snapshot a second or third time, do you still see the same corruption i.e. can you confirm that the corruption is genuinely on the disk and not a one-off bug in the reading-back process? And if you repeat one of the tests completely, does the offset of the failure change? Does the test show *every* bit of corruption, or only the first it finds? (ie is it just a small bit of data corrupted or is there lots more now shown? what do you mean by "count:" and "+ 1843200" etc.?) The repeated test would need to be B to avoid the randomness within the test - question is whether the failure always happens at exactly or approximately the same offset, or if it's completely random. And can you get it with smaller devices (ie order of magnitude test - 4 megabytes isn't enough, 4 gigabytes is, what about 40M or 400M ? How much RAM in the machine? Does less RAM make the problem appear with a smaller device size?) As I read it, the evidence so far points to some race condition, so we need to vary different things one at-a-time to see what makes the problem worse and what makes it go away. If it is a race, it might be worth trying a kernel with the patches I already submitted to U4 in case they have a positive impact. (It's just possible this is another manifestation of a problem already solved.) I can add a few bits to answer some of the questions Alasdair has. The test stops checking for errors on the buffer once it finds the first bad byte, it then dumps out a chunk beyond the corruption. We have some extra code in place right now to dump the entire "expected" buffer as well as the buffer we read. If that output would help we can add it to the bug. As for the count: and "+ XXXXXXXX" in the report: The xior shown: ----- xior ---- magic: 0xfeed10 type: 16 (write) path: /dev/snapper/origin syscall: write oflags: 16386 (O_RDWR|O_DIRECT) offset: 4033847296 count: 3115008 pattern: J devmap[0]: /dev/snapper/snap2 Would cause b_doio to generate a pattern buffer 3115008 bytes long using 'J' as a seed, the write (in this example using write(), could request writev, pwrite, etc) to /dev/snapper/origin, starting at offset 4033847296. Thus the b_doio*J:7999: repeating pattern seen in the snap read is what was written to the origin, the read is getting a mix of the origional 'A' data and what was just written to the origin, rather than just random data. Prior to the write() b_doio pre-reads the snapshot(s) (devmap in the xior), holds that buffer, writes to the origin, re-reads the origin and checks the buffer, then re-reads all of the snaps and checks that buffer against the what we read off the snap prior to the write to the origin. So, the '# at offset 4033847296 + 1843200 in the 32chunk size snap' means the b_doio detected the first byte of bad data 1843200 bytes into the buffer we read starting at offset 4033847296 in the snapshot. (The full buffer size is 'count' as in the xior) We can store the xior stream on disk and replay it, which should let us see if we hit the failure at the exact same place. We can add a multiple re-read at the snapshot verify step as well. Thanks for adding those details. Definitely repeat the "then re-reads all of the snaps and checks that buffer against the what we read off the snap prior to the write to the origin" step twice and report whether it returns wrong data the first and/or second time. This result of that test should show whether or not this is the same as bug 175830. If so, having two different chunk sizes must be making the race more likely. A re-read of the snapshot data shows the same corrupted data as the first read. *** DATA COMPARISON ERROR /dev/snapper/snap2 (Offset 387887616) *** Started at byte 1674752 Expected 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) ... Got 0x3a(:) 0x6c(l) 0x69(i) 0x6e(n) 0x6b(k) 0x2d(-) 0x30(0) 0x38(8) 0x3a(:) 0x62(b) ... *** DATA COMPARISON ERROR /dev/snapper/snap2 (Offset 387887616) *** Started at byte 1674752 Expected 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) 0x41(A) ... Got 0x3a(:) 0x6c(l) 0x69(i) 0x6e(n) 0x6b(k) 0x2d(-) 0x30(0) 0x38(8) 0x3a(:) 0x62(b) ... After seeing the failure, we then waited 10-15 minutes and did a straight dd off of that snap device where the corruption was noticed and it was still there: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127:link-08:b_doio*R:12127: Put into NEEDINFO state as we are waiting for the problem reporter to confirm whether the problem remains in RHEL4 Update 4 (4.4). This issue is still in the latest rhel4.4 rpms. [root@taft-02 bin]# rpm -q lvm2 lvm2-2.02.06-6.0.RHEL4B [root@taft-02 bin]# rpm -q device-mapper device-mapper-1.02.07-4.0.RHEL4B Still here... 2.6.9-55.8.ELsmp lvm2-2.02.27-1.el4 [root@link-08 ~]# lvs -a -o +chunksize LV VG Attr LSize Origin Snap% Move Log Copy% Chunk LogVol00 VolGroup00 -wi-ao 72.44G 0 LogVol01 VolGroup00 -wi-ao 1.94G 0 block_snap16 snapper swi-a- 2.50G origin 9.43 16.00K block_snap32 snapper swi-a- 2.50G origin 0.18 32.00K origin snapper owi-a- 4.00G 0 Making snapshot block_snap32 of origin volume Running block level I/O to the origin and verifying on snapshot 32 b_iogen starting up with the following: Iterations: 15s Seed: 9597 Offset-mode: random Single Pass: off Overlap Flag: on Mintrans: 512000 Maxtrans: 5120000 Syscalls: write writev Flags: direct Test Devices: Path Size (bytes) --------------------------------------------------------------- /dev/snapper/origin 4294967296 Snap Devices: /dev/snapper/block_snap32 *** DATA COMPARISON ERROR /dev/snapper/block_snap32 (Offset 993153024) *** Started at byte 1077249 Expected 0x2a(*) 0x48(H) 0x3a(:) 0x39(9) 0x34(4) 0x35(5) 0x34(4) 0x3a(:) 0x6c(l) 0x69(i) ... Got 0x69(i) 0x6f(o) 0x2a(*) 0x51(Q) 0x3a(:) 0x39(9) 0x35(5) 0x39(9) 0x38(8) 0x3a(:) ... block level IO failed with snapshot 32 I found the reason for this bug during source code review (although I was not yet able to reproduce it). Imagine that you have two devices with different chunk sizes, on the first snapshot you have a big chunk "A" and on the second snapshot you have two smaller chunks "B" and "C". =====A===== ==B== ==C== Now if you write to origin to location that hits chunks B and A, you create pending exception for B and A and link A's primary_pe pointer to B. You add the bio to B's queue. When both these relocations finish, you dispatch the write to the origin device. If another write simultaneously hits chunks C and A, you create pending exception for C, you don't link A's primary_pe pointer to C (because it is already linked to B) and you queue the bio on C. When realocation of C finishes, you dispatch the second write to the origin, but realocation of A may not finish yet. That corrupts data on the snapshot with the big chunksize. I am working on a solution, it needs some code rewrite. Updating PM score. The patch is here http://people.redhat.com/mpatocka/patches/kernel/merging/2.6.28/dm-snapshot-rework-origin-write.patch , but it is waiting for upstream acceptance, so it wouldn't be good to try to push the bug into RHEL now. Since RHEL 4.8 External Beta has begun, and this bugzilla remains unresolved, it has been rejected as it is not proposed as exception or blocker. |