From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040510 Description of problem: When dumping only set of files (not entire partition), in same cases files in dump archive are corrupted. My colegue Don Tiessen discovered this with dump-0.4b27-3 on Red Hat 7.3, and I confirmed and did some additinal testing with dump-0.4b33-3 on Fedora Core 2. The corruption is not reproducible always. Sometimes (but only sometimes) everything goes fine. Sometimes some files are corrupted. What I did was to set up directory /root/test, and copied current set of official RPM updates for Fedora Core 2 as test files. Than I did: # cd /root # dump -0 -f test.dump test # restore -ivf test1.dump # mv root root1 # cd root1/test # for a in *; do diff $a /root/test/$a; done I repeated this test three times (changing 1 to 2 and 3 in above commands). First and third time I ended up with some files corrupted, and second time everything was correctly restored. Dump and restore did not print any error messages. The results of one run: Binary files ipsec-tools-0.2.5-2.i386.rpm and /root/test/ipsec-tools-0.2.5-2.i386.rpm differ Binary files kdelibs-3.2.2-6.i386.rpm and /root/test/kdelibs-3.2.2-6.i386.rpm differ Binary files kdelibs-devel-3.2.2-6.i386.rpm and /root/test/kdelibs-devel-3.2.2-6.i386.rpm differ Binary files subversion-perl-1.0.2-2.1.i386.rpm and /root/test/subversion-perl-1.0.2-2.1.i386.rpm differ The results of another run: Binary files cups-1.1.20-11.1.i386.rpm and /root/test/cups-1.1.20-11.1.i386.rpm differ Binary files php-ldap-4.3.6-5.i386.rpm and /root/test/php-ldap-4.3.6-5.i386.rpm differ Binary files php-pear-4.3.6-5.i386.rpm and /root/test/php-pear-4.3.6-5.i386.rpm differ Binary files subversion-1.0.2-2.1.i386.rpm and /root/test/subversion-1.0.2-2.1.i386.rpm differ The corrupted files have same size as original files. Than I did two additional rounds of tests. In second round of tests, I used dump file on different partition (original files on hda1, dump on hda6). In this case I was not able to reproduce the problem (but it might be that I wasn't trying hard enough). In third round of test, I dumped entire partition. Again, I was not able to reproduce the problem (but it might be that I wasn't trying hard enough). I'm reporting the bug with high severity, since it might result in loss of valuable data (we all trust utilities such as dump or tar to work reliably). Version-Release number of selected component (if applicable): dump-0.4b33-3 How reproducible: Sometimes Steps to Reproduce: 1. dump -0 -f dump_file dir 2. restore -ivf dump_file 3. diff between original and restored files Actual Results: The restored files were different than original files (corrupted). Expected Results: The restored files should be the same as original files. Additional info:
Your trust is misplaced if using tar and/or dump to back up active systems. dump has never promised reliable archives on active unix systems, and the problem is worse on linux because of the lack of a character device, output from raw devices is subject to change while cached by the kernel. I'm pretty sure that's what the problem is, but I will upgrade to -b36 and double check that the problem exists there as well.
I know all about the dangers of backing up files on active file system. However, I expect that files that are not active (no process have them open for writing) should be backed up fine. The files that I was trying to archive using dump were untouched by any process on the system for long enough time that the kernel flushed all pages to the disk for sure. Anyhow, in case that (parts) of the file are still in cache and not commited to disk, shouldn't the kernel return information from the cache?
Jeff: you are probably correct, this looks very much like an active filesystem problem. Aleksandar: dump bypasses the kernel cache when it access the raw disk, so if the kernel hasn't flushed the data dump will not see it. Even worse, if the kernel has flushed only part of the data (for example, flushed the inode metadata but not the data blocks), dump can see an invalid filesystem. Most of the time dumping a mounted filesystem works just fine, but there is no guarantee, and you'd better run restore -C to verify it. On the other side, dumping a unmounted filesystem (or a filesystem snapshot created by LVM/EVMS) is 100% guaranteed to be valid. Stelian.
Than I guess we can mark this as NOTABUG (I'll leave it to Jeff to decide)? Stelian, if I understood you correctly, if I do something like: raidhotremove dump raidhotadd Linux kernel will flush all data to the disk prior of it being removed from the RAID device? Is that documented and garanteed behaviour? Or you had something else in mind?
Jeff: I released 0.4b37 today, which fixes a filesystem offset calculation which could also lead to read errors or data corruption. Make sure you package 0.4b37 if you decide to upgrade. Aleksandar: no need for raidhotremove. I was talking about: umount /dev/whatever dump 0f /dev/tape /dev/whatever mount /dev/whatever
Stelian: The workaround doesn't work in my case (or I would be doing it in a first place, and there wouldn't be this bug report). It would be nice if /dev/whatever was unmountable. However it isn't. So the backup must be done on mounted file system. No way around it. Booting single user or from CD is not an option either (for starters, no physical access to console, not to mention other restrictions). Never mind, on Linux this obviously can't be done in a safe and simple way with minimum (application) downtime like on Solaris by use of lockfs -fw;metaoffline;lockfs -u;ufsdump;metaonline (which would only sync the differences if any after metaonline which takes seconds to complete, not resync entire meta device like raidhotadd witch takes loooong time for large volumes). Time to stop typing, I'm going too much off topic anyway...
I would consider that if 'raidhotremove' doesn't flush the data on the hardware device then you have a kernel bug. Anyway, you may want to look at LVM (or EVMS) snapshots, this may be the only way for you to run dump without problems. Just in case, please give a try to 0b4b37 too (you can download the .src.rpm or binary rpm from dump.sf.net), a corruption fix is included and who knows, you may be hitting exactly this bug... Stelian.
Alex, have you tested the dump 0.4b37? Does the restore bug occur again? I doubt that even Stellian tried hard to fix the problem, dump cannot be safely used to mounted (and even not idle filesystems) without the EVMS kernel patches. If you agree, I'll close this bug as NOTABUG. Jindrich
I've downloaded and installed 0.4b37. The fix Stellian made fixed this bug too (or maybe it was the same bug). If you agree, it can be closed as CURRENTRELEASE or NEXTRELEASE (whichever is appropriate).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-439.html