Bug 440160 - data corruption on installed partition.
Summary: data corruption on installed partition.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: anaconda
Version: rawhide
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: Anaconda Maintenance Team
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: F9Blocker
TreeView+ depends on / blocked
 
Reported: 2008-04-01 22:52 UTC by Dave Jones
Modified: 2015-01-04 22:30 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2008-04-13 22:38:41 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Dave Jones 2008-04-01 22:52:35 UTC
I've been chasing a bug where after installing the Asus Eeepc, when we reboot,
we immediately face filesystem corruption. On ext3 we end up with orphan inodes,
and with ext2, it looks something like http://www.codemonkey.org.uk/ext2.txt

My theory is that anaconda isn't cleanly unmounting the filesystem before it
reboots.  I notice that right at the end, when the reboot button appears, that
anaconda still has a load of files left open, which prevents me from manually
umounting /mnt/sysimage on tty2.  Do we close() these files before we do the reboot?

even after moving /sbin/reboot out of the way, when I click reboot, somehow
anaconda forces a reboot.  Is there a way that I can make anaconda just exit
instead of doing the reboot ?

Comment 1 Jeremy Katz 2008-04-01 23:13:31 UTC
You've still got X running, so of course there are still a lot of files open. 
We end up killing processes and then doing the unmount of filesystems, etc. 
/sbin/reboot is just a convenience binary that kicks off that process, so moving
it out of the way wouldn't be expected to do anything much like if you moved
reboot out of the way on a running system and sent init the appropriate signal,
you'd still get rebooted.

You should be able to add 'nokill' to the kernel command line and then we won't
go and kill all processes but you still might need to be quick with the ctrl-s
to avoid the actual reboot

Comment 2 Dave Jones 2008-04-02 00:19:51 UTC
I don't seem to be quick enough.  I've tried twice no to no avail.


Comment 3 Dave Jones 2008-04-02 20:29:13 UTC
ok, I managed in a text mode install to ctrl-s long enough to see that right
before the reboot, anaconda spits out a message saying it couldn't umount
/mnt/sysimage   Given the small disk size, I'm guessing that memory pressure
isn't forcing everything to be written out to disk until we sync() or umount the
device.  Given neither of those occur, we end up with corruption when we reboot.

I suspect we're leaking an fd somewhere? 

Comment 4 Dave Jones 2008-04-02 20:29:44 UTC
if only fuser worked too.. that segfaults :-/

Comment 5 Jeremy Katz 2008-04-03 03:13:36 UTC
Gotta love busybox some days...  just kicked off an install, let's see if it
reproduces it.

Comment 6 Jeremy Katz 2008-04-03 03:59:46 UTC
Okay, I at least reproduced the failure to unmount.  And have a hunch or five as
to why.  Will poke at it some more tomorrow. 

Comment 7 Jeremy Katz 2008-04-03 21:05:52 UTC
Bah, failure to unmount is because I was running with 'nokill'.  In which case
it makes sense given the way we do things now.  

I wasn't able to reproduce this at all today, even on davej's eeepc.

Comment 8 Chris Lumens 2008-04-09 13:39:19 UTC
While I haven't seen the file system corruption before, I have seen what I
believe to be related error messages at the end of anaconda where a LOOP_CLR_FD
fails and we are unable to unmount /mnt/sysimage.  However, I am not able to
reproduce and determine whether or not it is to blame for this bug report.

Comment 9 Jeremy Katz 2008-04-09 14:35:54 UTC
Chris -- have you seen it other than running with 'nokill'?  IF so, then yeah,
that's one we need to worry on.  But I spent a while doing different types of
installs to hit it without success on sunday

Comment 10 Dave Jones 2008-04-09 15:14:15 UTC
just before I left for the week, I managed to reproduce this at home on another
laptop too, so it's not anything special to the eee.

I also managed to capture a screenshot of the failing umount on the eee, which
I've put up at http://www.codemonkey.org.uk/fail.jpg , though it sounds like
Chris is already familiar with that failure mode.

I really can't explain why I'm seeing this failure over and over at home, but
not at the office, with the same bits.  To be sure, I am syncing from the right
bits right? Here's my sync script..

#!/bin/bash

TAG='latest'

cd rawhide

rsync --delete -avzP
rsync://wallace.redhat.com/fedora-enchilada/linux/development/x86_64/os/ x86_64/
--exclude SRPMS --exclude debug \
 --exclude openoffice.org-langpack* --exclude repodata/repoview --exclude
kde-i18n* --exclude headers --exclude .repodata --delete
rsync -avzP
rsync://wallace.redhat.com/fedora-enchilada/linux/development/x86_64/os/Packages/openoffice.org-langpack-en*
x86_64/Packages/ 
if [ -f x86_64/isolinux/vmlinuz ]; then
  cp x86_64/isolinux/vmlinuz /tftpboot/X86PC/UNDI/pxelinux/davej/vmlinuz64
  cp x86_64/isolinux/initrd.img /tftpboot/X86PC/UNDI/pxelinux/davej/initrd64.img
else
  echo Rawhide incomplete.
fi

for i in x86_64/Packages/*i[356]86*.rpm
do
  rsync -a $i i386/Packages/
done
rsync -avP x86_64/Packages/*noarch*.rpm i386/Packages/

rsync --delete -avzP
rsync://wallace.redhat.com/fedora-enchilada/linux/development/i386/os/ i386/
--exclude SRPMS --exclude debug \
 --exclude openoffice.org-langpack* --exclude repodata/repoview --exclude
kde-i18n* --exclude headers --exclude .repodata --delete
rsync -avzP
rsync://wallace.redhat.com/fedora-enchilada/linux/development/i386/os/Packages/openoffice.org-langpack-en*
i386/Packages/ 


if [ -f i386/isolinux/vmlinuz ]; then
  cp i386/isolinux/vmlinuz /tftpboot/X86PC/UNDI/pxelinux/davej/vmlinuz
  cp i386/isolinux/initrd.img /tftpboot/X86PC/UNDI/pxelinux/davej/initrd.img
else
  echo Rawhide incomplete.
fi


The non-eee reproduction I managed was also an NFS install, but that one was by
hand, no kickstart involved.


Comment 11 Jeremy Katz 2008-04-09 16:03:55 UTC
In your picture, there's not any comments about killing processes... are you
sure 'nokill' isn't lingering in your pxe config from when you were trying it
last week?  As that (well, or running init in test mode which is even more
likely) is the only way I can see that processes wouldn't be killed.

And if processes don't get killed things will start to go wacky, no question. 
We could probably improve that situation (it used to be better), but there are
some dragons there that I really don't want to fight at this late stage of the game.

Comment 12 Dave Jones 2008-04-09 20:01:59 UTC
nggh. boy do I feel dumb.

somehow I added that to the wrong stanza in my pxeconfig. 
So yes, my 'install by hand' was also using nokill.

So, NOTABUG I guess.


Comment 13 Jeremy Katz 2008-04-10 00:20:44 UTC
Or at least, "try again when you get home, let us know if it still happens".

Comment 14 Dave Jones 2008-04-13 22:19:50 UTC
as expected, it works fine without nokill.
Is it worth leaving this open though, and repurposing it to improve nokill
handling of this situation ?

Comment 15 Jeremy Katz 2008-04-13 22:38:41 UTC
There's not really any way to improve nokill's handling here -- the problem is
that the shell is one of the things we really don't want to kill, but it's
running out of stage2 which is on the filesystem thus we can't unmount it.


Note You need to log in before you can comment on or make changes to this bug.