Description of problem: Unlike bz 173209 where the origin fs becomes corrupt, here the 2nd snapshot of that fs becomes corrupt # snapper test output: Creating origin volume Logical volume "origin" created Placing an ext filesystem on origin volume mke2fs 1.35 (28-Feb-2004) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 1310720 inodes, 2621440 blocks 131072 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=2684354560 80 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632 Writing inode tables: done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 25 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. Mounting origin volume Writing files ABC to origin volume Making 1st snapshot of origin volume Rounding up size to full physical extent 52.00 MB Logical volume "snap1" created Mounting 1st snap volume Checking files on 1st snap volume /mnt/snap1/A /mnt/snap1/B /mnt/snap1/C Writing files DEF to origin volume Making 2nd snapshot of origin volume Rounding up size to full physical extent 52.00 MB Logical volume "snap2" created Mounting 2nd snap volume /dev/snapper/snap2: Input/output error mount: /dev/snapper/snap2: can't read superblock couldn't mount 2nd snap on /mnt/snap2 SYSLOG after 2nd snap mount attempt: Dec 1 14:29:37 morph-01 kernel: attempt to access beyond end of device Dec 1 14:29:37 morph-01 kernel: dm-6: rw=17, want=2736122528, limit=106496 Dec 1 14:29:37 morph-01 kernel: device-mapper: Error reading/writing snapshot Dec 1 14:29:38 morph-01 kernel: device-mapper: snapshot is marked invalid Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 0 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 1 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 2 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 3 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 4 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 5 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 6 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 7 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 8 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 9 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 10 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 11 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 12 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 13 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 14 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 15 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 0 Dec 1 14:29:38 morph-01 kernel: Buffer I/O error on device dm-7, logical block 8 Dec 1 14:29:38 morph-01 kernel: FAT: unable to read boot sector Version-Release number of selected component (if applicable): device-mapper-1.02.01-1.0.RHEL4 lvm2-2.02.01-1.0.RHEL4 How reproducible: I've only seen this twice now.
I've just got an 'end of device' error after writing a single large piece of data to the origin that was too big for the snapshot.
But haven't succeeded in reproducing it:-(
See comments 17 & 18 of bug 174636 I've been unable to reproduce this on kernel 2.6.9-25.EL
Is this reproducible enough that you can leave a test running for a certain period of time and be fairly sure you'll see it happen? Are the snapshots being mounted read-write? If so, can you change the mount options to mount them *read only* and see if you can still get it to go wrong? Try with the 1st read-write like now and the 2nd read-only. Then with both read-only.
I'm still seeing the filesystem corruption when running the test snapper on the lastest kern/rpms. Though, not necessarily after the second snapshot attempt but after creating a few snaps, and then recreating the origin and then attempting to write to the snaps. [root@link-08 ~]# uname -ar Linux link-08 2.6.9-28.ELsmp #1 SMP Fri Jan 13 17:08:22 EST 2006 x86_64 x86_64 x86_64 GNU/Linux [root@link-08 ~]# rpm -q device-mapper device-mapper-1.02.02-3.0.RHEL4 [root@link-08 ~]# rpm -q lvm2 lvm2-2.02.01-1.3.RHEL4 [root@link-08 ~]# rpm -q lvm2-cluster lvm2-cluster-2.02.01-1.2.RHEL4 The snapshots are obviously being mounted read-write and are able to be written to before the failures start. /dev/mapper/snapper-snap1 on /mnt/snap1 type ext3 (rw) /dev/mapper/snapper-snap2 on /mnt/snap2 type ext3 (rw) /dev/mapper/snapper-origin on /mnt/origin type ext3 (rw) /dev/mapper/snapper-snap3 on /mnt/snap3 type ext3 (rw) I can try this read-only like you suggest and then see if I get read errors...
Yes, please try read-only. We need to try to narrow down what could be happening. Try adding 'sync' or 'sleep' at multiple places in the test and see if it goes away. How many CPUs in the test machine? Can you attach the actual sequence of commands you're running when you see the problem? Does everything run in a single sequence, or are multiple commands issued in parallel?
(I'm wondering if this could be the first reproducible test for bug 175830 - but need to find out more of its characteristics first - there are still other simpler things that might be wrong giving similar symptoms.)
I've got this narrowed down to re'mkfsing a snapshoted origin with a mounted shapshot. This will work whetheror not the snapshot is mounted read only or read write. A. Write to an origin fs B. Snapshot the origin C. Mount the snapshot fs C. Remake the origin fs D. Attempt to write to the snapshot (if rw) or D. Attempt to umount and then remount the snapshot (if ro) [the snapshot filesystem will them be corrupt]
I've still not managed to reproduce this yet. Can you post the actual script you're using, as I had to assume the order of the steps you didn't list. (eg Can't do the second C without unmounting the origin first - but at which point do you do that unmount?)
Also try repeating the test with 'sync; sleep 10' inserted between every command.
Aha! By changing the placement of the umount I've found something that goes wrong...
But only the "unchecked fs" warnings on bug 178705 - not identified any actual snapshot corruption yet.
And I don't see the warning with ext3.
As well the sync;sleep test, can you reproduce with ext3 with a journal rather than ext2?
I can't try anything right now as my machines are dedicated for regression testing of RHEL4U3. I was also able to hit this by mkfs'ing an *unmounted* snapshot. So here are the more detailed steps I did to reproduce this every time. A. Write to an origin fs # echo "AAAAAAAAAAAAAAAAAAAAAAAAAAAA" > /mnt/origin # echo "BBBBBBBBBBBBBBBBBBBBBBBBBBBB" > /mnt/origin # echo "CCCCCCCCCCCCCCCCCCCCCCCCCCCC" > /mnt/origin B. Snapshot the origin # lvcreate -s /dev/snapper/origin -L 50M -n snap1 C. Mount the snapshot fs # mount /dev/snapper/snap1 /mnt/snap1 D. Unmount origin # umount /mnt/origin E. Remake the origin fs # mkfs /dev/snapper/origin F. Choose one of the following: F1. Attempt to write to the snapshot (if rw) # echo "DDDDDDDDDDDDDDDDDDDDDDDD" > /mnt/snap1 or F2. Attempt to umount and then remount the snapshot (if ro) # umount /mnt/snap1 # mount /dev/snapper/snap1 /mnt/snap1 or F3. Mkfs a snapshot # umount /mnt/snap1 # mkfs /dev/snapper/snap1
# echo "AAAAAAAAAAAAAAAAAAAAAAAAAAAA" > /mnt/origin # echo "BBBBBBBBBBBBBBBBBBBBBBBBBBBB" > /mnt/origin # echo "CCCCCCCCCCCCCCCCCCCCCCCCCCCC" > /mnt/origin /mnt/origin/A B and C ?
With F1, F2 and F3 how does the corruption show up? Do you have to run something else to see it, or do the commands or kernel give errors?
yes. # echo "AAAAAAAAAAAAAAAAAAAAAAAAAAAA" > /mnt/origin/A The corruption shows up just as mentioned in original bug report. You wait like 1 - 3 seconds and then boom tons of I/O errors and ext errors in the log and input/output errors on any read/write attempt of the snapshot.
From "1310720 inodes, 2621440 blocks" I deduce the origin is about 10GB. From comment 16 the snapshot is 50MB. The mkfs is making 160MB of changes to the origin. That doesn't fit into 50MB, so the snapshot simply gets dropped. The fix here is to allocate more space for the snapshot in the test script. Running out of space is a hazard when you're using snapshots - it's important to monitor the amount of free disk space left (e.g. with 'lvs') and extend them if necessary before they fill up. This is mentioned in the HOWTO http://www.tldp.org/HOWTO/LVM-HOWTO/snapshots_backup.html and elsewhere. Maybe one day there'll be automatic handling of situations like this. In the mean time, we can close this one - not a bug.