Bug 174742
Summary: | corrupted snapshot filesystems: Error reading/writing snapshot | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Corey Marthaler <cmarthal> |
Component: | lvm2 | Assignee: | Alasdair Kergon <agk> |
Status: | CLOSED NOTABUG | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.3 | CC: | agk, jbrassow, kanderso, rkenna |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-01-25 19:11:41 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2005-12-01 20:43:20 UTC
I've just got an 'end of device' error after writing a single large piece of data to the origin that was too big for the snapshot. But haven't succeeded in reproducing it:-( See comments 17 & 18 of bug 174636 I've been unable to reproduce this on kernel 2.6.9-25.EL Is this reproducible enough that you can leave a test running for a certain period of time and be fairly sure you'll see it happen? Are the snapshots being mounted read-write? If so, can you change the mount options to mount them *read only* and see if you can still get it to go wrong? Try with the 1st read-write like now and the 2nd read-only. Then with both read-only. I'm still seeing the filesystem corruption when running the test snapper on the lastest kern/rpms. Though, not necessarily after the second snapshot attempt but after creating a few snaps, and then recreating the origin and then attempting to write to the snaps. [root@link-08 ~]# uname -ar Linux link-08 2.6.9-28.ELsmp #1 SMP Fri Jan 13 17:08:22 EST 2006 x86_64 x86_64 x86_64 GNU/Linux [root@link-08 ~]# rpm -q device-mapper device-mapper-1.02.02-3.0.RHEL4 [root@link-08 ~]# rpm -q lvm2 lvm2-2.02.01-1.3.RHEL4 [root@link-08 ~]# rpm -q lvm2-cluster lvm2-cluster-2.02.01-1.2.RHEL4 The snapshots are obviously being mounted read-write and are able to be written to before the failures start. /dev/mapper/snapper-snap1 on /mnt/snap1 type ext3 (rw) /dev/mapper/snapper-snap2 on /mnt/snap2 type ext3 (rw) /dev/mapper/snapper-origin on /mnt/origin type ext3 (rw) /dev/mapper/snapper-snap3 on /mnt/snap3 type ext3 (rw) I can try this read-only like you suggest and then see if I get read errors... Yes, please try read-only. We need to try to narrow down what could be happening. Try adding 'sync' or 'sleep' at multiple places in the test and see if it goes away. How many CPUs in the test machine? Can you attach the actual sequence of commands you're running when you see the problem? Does everything run in a single sequence, or are multiple commands issued in parallel? (I'm wondering if this could be the first reproducible test for bug 175830 - but need to find out more of its characteristics first - there are still other simpler things that might be wrong giving similar symptoms.) I've got this narrowed down to re'mkfsing a snapshoted origin with a mounted shapshot. This will work whetheror not the snapshot is mounted read only or read write. A. Write to an origin fs B. Snapshot the origin C. Mount the snapshot fs C. Remake the origin fs D. Attempt to write to the snapshot (if rw) or D. Attempt to umount and then remount the snapshot (if ro) [the snapshot filesystem will them be corrupt] I've still not managed to reproduce this yet. Can you post the actual script you're using, as I had to assume the order of the steps you didn't list. (eg Can't do the second C without unmounting the origin first - but at which point do you do that unmount?) Also try repeating the test with 'sync; sleep 10' inserted between every command. Aha! By changing the placement of the umount I've found something that goes wrong... But only the "unchecked fs" warnings on bug 178705 - not identified any actual snapshot corruption yet. And I don't see the warning with ext3. As well the sync;sleep test, can you reproduce with ext3 with a journal rather than ext2? I can't try anything right now as my machines are dedicated for regression testing of RHEL4U3. I was also able to hit this by mkfs'ing an *unmounted* snapshot. So here are the more detailed steps I did to reproduce this every time. A. Write to an origin fs # echo "AAAAAAAAAAAAAAAAAAAAAAAAAAAA" > /mnt/origin # echo "BBBBBBBBBBBBBBBBBBBBBBBBBBBB" > /mnt/origin # echo "CCCCCCCCCCCCCCCCCCCCCCCCCCCC" > /mnt/origin B. Snapshot the origin # lvcreate -s /dev/snapper/origin -L 50M -n snap1 C. Mount the snapshot fs # mount /dev/snapper/snap1 /mnt/snap1 D. Unmount origin # umount /mnt/origin E. Remake the origin fs # mkfs /dev/snapper/origin F. Choose one of the following: F1. Attempt to write to the snapshot (if rw) # echo "DDDDDDDDDDDDDDDDDDDDDDDD" > /mnt/snap1 or F2. Attempt to umount and then remount the snapshot (if ro) # umount /mnt/snap1 # mount /dev/snapper/snap1 /mnt/snap1 or F3. Mkfs a snapshot # umount /mnt/snap1 # mkfs /dev/snapper/snap1 # echo "AAAAAAAAAAAAAAAAAAAAAAAAAAAA" > /mnt/origin # echo "BBBBBBBBBBBBBBBBBBBBBBBBBBBB" > /mnt/origin # echo "CCCCCCCCCCCCCCCCCCCCCCCCCCCC" > /mnt/origin /mnt/origin/A B and C ? With F1, F2 and F3 how does the corruption show up? Do you have to run something else to see it, or do the commands or kernel give errors? yes. # echo "AAAAAAAAAAAAAAAAAAAAAAAAAAAA" > /mnt/origin/A The corruption shows up just as mentioned in original bug report. You wait like 1 - 3 seconds and then boom tons of I/O errors and ext errors in the log and input/output errors on any read/write attempt of the snapshot. From "1310720 inodes, 2621440 blocks" I deduce the origin is about 10GB. From comment 16 the snapshot is 50MB. The mkfs is making 160MB of changes to the origin. That doesn't fit into 50MB, so the snapshot simply gets dropped. The fix here is to allocate more space for the snapshot in the test script. Running out of space is a hazard when you're using snapshots - it's important to monitor the amount of free disk space left (e.g. with 'lvs') and extend them if necessary before they fill up. This is mentioned in the HOWTO http://www.tldp.org/HOWTO/LVM-HOWTO/snapshots_backup.html and elsewhere. Maybe one day there'll be automatic handling of situations like this. In the mean time, we can close this one - not a bug. |