Red Hat Bugzilla – Full Text Bug Listing
|Summary:||data corruption on installed partition.|
|Product:||[Fedora] Fedora||Reporter:||Dave Jones <davej>|
|Component:||anaconda||Assignee:||Anaconda Maintenance Team <anaconda-maint-list>|
|Status:||CLOSED NOTABUG||QA Contact:||Fedora Extras Quality Assurance <extras-qa>|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2008-04-13 18:38:41 EDT||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
Description Dave Jones 2008-04-01 18:52:35 EDT
I've been chasing a bug where after installing the Asus Eeepc, when we reboot, we immediately face filesystem corruption. On ext3 we end up with orphan inodes, and with ext2, it looks something like http://www.codemonkey.org.uk/ext2.txt My theory is that anaconda isn't cleanly unmounting the filesystem before it reboots. I notice that right at the end, when the reboot button appears, that anaconda still has a load of files left open, which prevents me from manually umounting /mnt/sysimage on tty2. Do we close() these files before we do the reboot? even after moving /sbin/reboot out of the way, when I click reboot, somehow anaconda forces a reboot. Is there a way that I can make anaconda just exit instead of doing the reboot ?
Comment 1 Jeremy Katz 2008-04-01 19:13:31 EDT
You've still got X running, so of course there are still a lot of files open. We end up killing processes and then doing the unmount of filesystems, etc. /sbin/reboot is just a convenience binary that kicks off that process, so moving it out of the way wouldn't be expected to do anything much like if you moved reboot out of the way on a running system and sent init the appropriate signal, you'd still get rebooted. You should be able to add 'nokill' to the kernel command line and then we won't go and kill all processes but you still might need to be quick with the ctrl-s to avoid the actual reboot
Comment 2 Dave Jones 2008-04-01 20:19:51 EDT
I don't seem to be quick enough. I've tried twice no to no avail.
Comment 3 Dave Jones 2008-04-02 16:29:13 EDT
ok, I managed in a text mode install to ctrl-s long enough to see that right before the reboot, anaconda spits out a message saying it couldn't umount /mnt/sysimage Given the small disk size, I'm guessing that memory pressure isn't forcing everything to be written out to disk until we sync() or umount the device. Given neither of those occur, we end up with corruption when we reboot. I suspect we're leaking an fd somewhere?
Comment 4 Dave Jones 2008-04-02 16:29:44 EDT
if only fuser worked too.. that segfaults :-/
Comment 5 Jeremy Katz 2008-04-02 23:13:36 EDT
Gotta love busybox some days... just kicked off an install, let's see if it reproduces it.
Comment 6 Jeremy Katz 2008-04-02 23:59:46 EDT
Okay, I at least reproduced the failure to unmount. And have a hunch or five as to why. Will poke at it some more tomorrow.
Comment 7 Jeremy Katz 2008-04-03 17:05:52 EDT
Bah, failure to unmount is because I was running with 'nokill'. In which case it makes sense given the way we do things now. I wasn't able to reproduce this at all today, even on davej's eeepc.
Comment 8 Chris Lumens 2008-04-09 09:39:19 EDT
While I haven't seen the file system corruption before, I have seen what I believe to be related error messages at the end of anaconda where a LOOP_CLR_FD fails and we are unable to unmount /mnt/sysimage. However, I am not able to reproduce and determine whether or not it is to blame for this bug report.
Comment 9 Jeremy Katz 2008-04-09 10:35:54 EDT
Chris -- have you seen it other than running with 'nokill'? IF so, then yeah, that's one we need to worry on. But I spent a while doing different types of installs to hit it without success on sunday
Comment 10 Dave Jones 2008-04-09 11:14:15 EDT
just before I left for the week, I managed to reproduce this at home on another laptop too, so it's not anything special to the eee. I also managed to capture a screenshot of the failing umount on the eee, which I've put up at http://www.codemonkey.org.uk/fail.jpg , though it sounds like Chris is already familiar with that failure mode. I really can't explain why I'm seeing this failure over and over at home, but not at the office, with the same bits. To be sure, I am syncing from the right bits right? Here's my sync script.. #!/bin/bash TAG='latest' cd rawhide rsync --delete -avzP rsync://wallace.redhat.com/fedora-enchilada/linux/development/x86_64/os/ x86_64/ --exclude SRPMS --exclude debug \ --exclude openoffice.org-langpack* --exclude repodata/repoview --exclude kde-i18n* --exclude headers --exclude .repodata --delete rsync -avzP rsync://wallace.redhat.com/fedora-enchilada/linux/development/x86_64/os/Packages/openoffice.org-langpack-en* x86_64/Packages/ if [ -f x86_64/isolinux/vmlinuz ]; then cp x86_64/isolinux/vmlinuz /tftpboot/X86PC/UNDI/pxelinux/davej/vmlinuz64 cp x86_64/isolinux/initrd.img /tftpboot/X86PC/UNDI/pxelinux/davej/initrd64.img else echo Rawhide incomplete. fi for i in x86_64/Packages/*i86*.rpm do rsync -a $i i386/Packages/ done rsync -avP x86_64/Packages/*noarch*.rpm i386/Packages/ rsync --delete -avzP rsync://wallace.redhat.com/fedora-enchilada/linux/development/i386/os/ i386/ --exclude SRPMS --exclude debug \ --exclude openoffice.org-langpack* --exclude repodata/repoview --exclude kde-i18n* --exclude headers --exclude .repodata --delete rsync -avzP rsync://wallace.redhat.com/fedora-enchilada/linux/development/i386/os/Packages/openoffice.org-langpack-en* i386/Packages/ if [ -f i386/isolinux/vmlinuz ]; then cp i386/isolinux/vmlinuz /tftpboot/X86PC/UNDI/pxelinux/davej/vmlinuz cp i386/isolinux/initrd.img /tftpboot/X86PC/UNDI/pxelinux/davej/initrd.img else echo Rawhide incomplete. fi The non-eee reproduction I managed was also an NFS install, but that one was by hand, no kickstart involved.
Comment 11 Jeremy Katz 2008-04-09 12:03:55 EDT
In your picture, there's not any comments about killing processes... are you sure 'nokill' isn't lingering in your pxe config from when you were trying it last week? As that (well, or running init in test mode which is even more likely) is the only way I can see that processes wouldn't be killed. And if processes don't get killed things will start to go wacky, no question. We could probably improve that situation (it used to be better), but there are some dragons there that I really don't want to fight at this late stage of the game.
Comment 12 Dave Jones 2008-04-09 16:01:59 EDT
nggh. boy do I feel dumb. somehow I added that to the wrong stanza in my pxeconfig. So yes, my 'install by hand' was also using nokill. So, NOTABUG I guess.
Comment 13 Jeremy Katz 2008-04-09 20:20:44 EDT
Or at least, "try again when you get home, let us know if it still happens".
Comment 14 Dave Jones 2008-04-13 18:19:50 EDT
as expected, it works fine without nokill. Is it worth leaving this open though, and repurposing it to improve nokill handling of this situation ?
Comment 15 Jeremy Katz 2008-04-13 18:38:41 EDT
There's not really any way to improve nokill's handling here -- the problem is that the shell is one of the things we really don't want to kill, but it's running out of stage2 which is on the filesystem thus we can't unmount it.