Bug 521107 - fsck cannot clean up filesystem, eventually hangs forever
Summary: fsck cannot clean up filesystem, eventually hangs forever
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: e2fsprogs
Version: 5.1
Hardware: x86_64
OS: Linux
Target Milestone: rc
: ---
Assignee: Eric Sandeen
QA Contact: BaseOS QE
Depends On:
TreeView+ depends on / blocked
Reported: 2009-09-03 17:13 UTC by Dave
Modified: 2011-01-26 21:20 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2011-01-26 21:20:57 UTC
Target Upstream Version:

Attachments (Terms of Use)
Output of fsck.ext2 command on the filesystem (775.34 KB, text/plain)
2009-09-03 17:13 UTC, Dave
no flags Details
Output of fsck.ext3 command (349.64 KB, text/plain)
2009-09-03 17:26 UTC, Dave
no flags Details

Description Dave 2009-09-03 17:13:13 UTC
Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:
Always with this particular corrupt fileystem

Steps to Reproduce:
1. run fsck
2. after a while fsck stops printing output and spins up to 100% cpu utilization
3. fsck hangs forever and ctrl-c doesn't stop the process.  Only kill can stop it
Actual results:
Unclean filesystem, 600 GB of lost data

Expected results:
fsck to clean up filesystem and we continue operating

Additional info:
I have output of the fsck command and a core dump of fsck.ext2 and the same output from fsck.ext3 (both result in the same issue).  I'm also attaching core dumps of both processes.

Comment 1 Dave 2009-09-03 17:13:45 UTC
Created attachment 359705 [details]
Output of fsck.ext2 command on the filesystem

Comment 2 Dave 2009-09-03 17:17:09 UTC
strace -p <PID> on ext2 after it is "hung" creates an empty file, but the CPU utilization for that process is using 100% of 1 CPU.

I've also updated the core core-bz521107.bz2 to dropbox.redhat.com/incoming . . . this is for fsck.ext2

Comment 3 Dave 2009-09-03 17:26:12 UTC
Created attachment 359706 [details]
Output of fsck.ext3 command

Comment 4 Dave 2009-09-03 17:27:41 UTC
Also put core-fsck.ext3-bz521107.bz2 on dropbox.redhat.com/incoming for the core dump of the fsck.ext3 filesystem.

Comment 5 Eric Sandeen 2009-09-03 17:42:03 UTC
Could you please create an "e2image -r" of the problematic filesystem, compress it, and provide it for analysis?  I can probably work backwards from the corefile, but with a filesystem image I could verify any fix.  If there is concern about sensitive filenames, the -s option will scramble them up in the image.


Comment 6 Dave 2009-09-03 17:51:11 UTC
Thanks Eric, 

Here's the output of my command:

[root@almcrpstg01 workspace]# e2image -s -r /dev/mapper/almcrpprd03VG-localmnt2  fsimage-bz521107.img
e2image 1.39 (29-May-2006)
e2image: A block group is missing an inode table while getting next inode

It creates a 0 byte image file:

[root@almcrpstg01 workspace]# ls -lah fsimage-bz521107.img 
-rw------- 1 root bin 0 Sep  3 10:50 fsimage-bz521107.img

Comment 7 Dave 2009-09-03 17:51:43 UTC
Note:  also opened service request 1949408 for this.

Comment 8 Eric Sandeen 2009-09-03 18:10:33 UTC
Ah crud.

Just in case more recent e2fsprogs can handle this, you might try installing e4fsprogs (userspace for the ext4 tech preview) and running e4image ... but I bet it dies the same way.

I'll try to look backwards from the core.

Any idea what happened to this filesystem?


Comment 9 Dave 2009-09-03 19:54:24 UTC
I think it was pretty straight forward in this case.  My understanding is that we initiated a reboot (using the reboot command manually) most probably while some application was still using this filesystem.  I'm presuming that the process didn't get killed off before the system rebooted.

Before that reboot, we noted a syslog error message saying that multipathd segfaulted.  when the system came back up, this filesystem would not fsck.

Comment 10 Eric Sandeen 2009-09-03 20:00:57 UTC
When e2fsck is running are you getting any errors in dmesg from the storage?

e2fsck should handle it more gracefully of course, but I wonder if everything got put back together again properly after the reboot ...


Comment 11 Dave 2009-09-03 20:19:57 UTC
I've looked and I just don't see anything.  Unfortunately as well, this is a down application.  We had to completely wipe this filesystem to get our internal customer back up. So, I can't try fscking again the broken fs.   Also, I'm not confident that we can reproduce this error reliably (though I will try).

Comment 12 Eric Sandeen 2011-01-26 21:20:57 UTC
I wasn't able to sort out from the core what was wrong, and I am afraid that without access to the broken image, this will be nigh impossible to fix... I'm afraid I'll have to close this one.

Note You need to log in before you can comment on or make changes to this bug.