125646 – files restored from dump are corrupt

Bug 125646 - files restored from dump are corrupt

Summary: files restored from dump are corrupt

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	dump
Sub Component:
Version:	2
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jindrich Novy
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-06-09 17:03 UTC by Aleksandar Milivojevic
Modified:	2013-07-02 23:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-08-12 07:10:16 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:439	0	low	SHIPPED_LIVE	Updated dump package	2005-05-19 04:00:00 UTC

Description Aleksandar Milivojevic 2004-06-09 17:03:27 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040510

Description of problem:
When dumping only set of files (not entire partition), in same cases
files in dump archive are corrupted.  My colegue Don Tiessen
discovered this with dump-0.4b27-3 on Red Hat 7.3, and I confirmed and
did some additinal testing with dump-0.4b33-3 on Fedora Core 2.

The corruption is not reproducible always.  Sometimes (but only
sometimes) everything goes fine.  Sometimes some files are corrupted.

What I did was to set up directory /root/test, and copied current set
of official RPM updates for Fedora Core 2 as test files.  Than I did:

   # cd /root
   # dump -0 -f test.dump test
   # restore -ivf test1.dump
   # mv root root1
   # cd root1/test
   # for a in *; do diff $a /root/test/$a; done

I repeated this test three times (changing 1 to 2 and 3 in above
commands).  First and third time I ended up with some files corrupted,
and second time everything was correctly restored.  Dump and restore
did not print any error messages.

The results of one run:

Binary files ipsec-tools-0.2.5-2.i386.rpm and
/root/test/ipsec-tools-0.2.5-2.i386.rpm differ
Binary files kdelibs-3.2.2-6.i386.rpm and
/root/test/kdelibs-3.2.2-6.i386.rpm differ
Binary files kdelibs-devel-3.2.2-6.i386.rpm and
/root/test/kdelibs-devel-3.2.2-6.i386.rpm differ
Binary files subversion-perl-1.0.2-2.1.i386.rpm and
/root/test/subversion-perl-1.0.2-2.1.i386.rpm differ

The results of another run:

Binary files cups-1.1.20-11.1.i386.rpm and
/root/test/cups-1.1.20-11.1.i386.rpm differ
Binary files php-ldap-4.3.6-5.i386.rpm and
/root/test/php-ldap-4.3.6-5.i386.rpm differ
Binary files php-pear-4.3.6-5.i386.rpm and
/root/test/php-pear-4.3.6-5.i386.rpm differ
Binary files subversion-1.0.2-2.1.i386.rpm and
/root/test/subversion-1.0.2-2.1.i386.rpm differ

The corrupted files have same size as original files.

Than I did two additional rounds of tests.

In second round of tests, I used dump file on different partition
(original files on hda1, dump on hda6).  In this case I was not able
to reproduce the problem (but it might be that I wasn't trying hard
enough).

In third round of test, I dumped entire partition.  Again, I was not
able to reproduce the problem (but it might be that I wasn't trying
hard enough).

I'm reporting the bug with high severity, since it might result in
loss of valuable data (we all trust utilities such as dump or tar to
work reliably).

Version-Release number of selected component (if applicable):
dump-0.4b33-3

How reproducible:
Sometimes

Steps to Reproduce:
1. dump -0 -f dump_file dir
2. restore -ivf dump_file
3. diff between original and restored files
    

Actual Results:  The restored files were different than original files
(corrupted).

Expected Results:  The restored files should be the same as original
files.

Additional info:

Comment 1 Jeff Johnson 2004-06-12 21:55:00 UTC

Your trust is misplaced if using tar and/or dump to back up
active systems.

dump has never promised reliable archives on active unix
systems, and the problem is worse on linux because of
the lack of a character device, output from raw devices is
subject to change while cached by the kernel.

I'm pretty sure that's what the problem is, but I
will upgrade to -b36 and double check that the problem
exists there as well.

Comment 2 Aleksandar Milivojevic 2004-06-15 23:43:35 UTC

I know all about the dangers of backing up files on active file
system.  However, I expect that files that are not active (no process
have them open for writing) should be backed up fine.  The files that
I was trying to archive using dump were untouched by any process on
the system for long enough time that the kernel flushed all pages to
the disk for sure.

Anyhow, in case that (parts) of the file are still in cache and not
commited to disk, shouldn't the kernel return information from the cache?

Comment 3 Stelian Pop 2004-07-06 09:47:43 UTC

Jeff: you are probably correct, this looks very much like an active
filesystem problem.

Aleksandar: dump bypasses the kernel cache when it access the raw
disk, so if the kernel hasn't flushed the data dump will not see it.
Even worse, if the kernel has flushed only part of the data (for
example, flushed the inode metadata but not the data blocks), dump can
see an invalid filesystem.

Most of the time dumping a mounted filesystem works just fine, but
there is no guarantee, and you'd better run restore -C to verify it. 

On the other side, dumping a unmounted filesystem (or a filesystem
snapshot created by LVM/EVMS) is 100% guaranteed to be valid.

Stelian.

Comment 4 Aleksandar Milivojevic 2004-07-06 13:38:45 UTC

Than I guess we can mark this as NOTABUG (I'll leave it to Jeff to
decide)?

Stelian, if I understood you correctly, if I do something like:

   raidhotremove
   dump
   raidhotadd

Linux kernel will flush all data to the disk prior of it being removed
from the RAID device?  Is that documented and garanteed behaviour?  Or
you had something else in mind?

Comment 5 Stelian Pop 2004-07-07 15:54:51 UTC

Jeff: I released 0.4b37 today, which fixes a filesystem offset
calculation which could also lead to read errors or data corruption.
Make sure you package 0.4b37 if you decide to upgrade.

Aleksandar: no need for raidhotremove. I was talking about:
  umount /dev/whatever
  dump 0f /dev/tape /dev/whatever
  mount /dev/whatever

Comment 6 Aleksandar Milivojevic 2004-07-28 14:07:58 UTC

Stelian: The workaround doesn't work in my case (or I would be doing
it in a first place, and there wouldn't be this bug report).  It would
be nice if /dev/whatever was unmountable.  However it isn't.  So the
backup must be done on mounted file system.  No way around it. 
Booting single user or from CD is not an option either (for starters,
no physical access to console, not to mention other restrictions). 
Never mind, on Linux this obviously can't be done in a safe and simple
way with minimum (application) downtime like on Solaris by use of
lockfs -fw;metaoffline;lockfs -u;ufsdump;metaonline (which would only
sync the differences if any after metaonline which takes seconds to
complete, not resync entire meta device like raidhotadd witch takes
loooong time for large volumes).  Time to stop typing, I'm going too
much off topic anyway...

Comment 7 Stelian Pop 2004-07-28 16:28:04 UTC

I would consider that if 'raidhotremove' doesn't flush the data on the
hardware device then you have a kernel bug.

Anyway, you may want to look at LVM (or EVMS) snapshots, this may be
the only way for you to run dump without problems.

Just in case, please give a try to 0b4b37 too (you can download the
.src.rpm or binary rpm from dump.sf.net), a corruption fix is included
and who knows, you may be hitting exactly this bug...

Stelian.

Comment 8 Jindrich Novy 2004-08-11 11:39:00 UTC

Alex, have you tested the dump 0.4b37? Does the restore bug occur
again? I doubt that even Stellian tried hard to fix the problem, dump
cannot be safely used to mounted (and even not idle filesystems)
without the EVMS kernel patches. If you agree, I'll close this bug as
NOTABUG.

Jindrich

Comment 9 Aleksandar Milivojevic 2004-08-11 14:12:31 UTC

I've downloaded and installed 0.4b37.  The fix Stellian made fixed
this bug too (or maybe it was the same bug).  If you agree, it can be
closed as CURRENTRELEASE or NEXTRELEASE (whichever is appropriate).

Comment 10 Tim Powers 2005-05-19 13:37:14 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-439.html

Note You need to log in before you can comment on or make changes to this bug.