Description of problem:
Version-Release number of selected component (if applicable):
Always with this particular corrupt fileystem
Steps to Reproduce:
1. run fsck
2. after a while fsck stops printing output and spins up to 100% cpu utilization
3. fsck hangs forever and ctrl-c doesn't stop the process. Only kill can stop it
Unclean filesystem, 600 GB of lost data
fsck to clean up filesystem and we continue operating
I have output of the fsck command and a core dump of fsck.ext2 and the same output from fsck.ext3 (both result in the same issue). I'm also attaching core dumps of both processes.
Created attachment 359705 [details]
Output of fsck.ext2 command on the filesystem
strace -p <PID> on ext2 after it is "hung" creates an empty file, but the CPU utilization for that process is using 100% of 1 CPU.
I've also updated the core core-bz521107.bz2 to dropbox.redhat.com/incoming . . . this is for fsck.ext2
Created attachment 359706 [details]
Output of fsck.ext3 command
Also put core-fsck.ext3-bz521107.bz2 on dropbox.redhat.com/incoming for the core dump of the fsck.ext3 filesystem.
Could you please create an "e2image -r" of the problematic filesystem, compress it, and provide it for analysis? I can probably work backwards from the corefile, but with a filesystem image I could verify any fix. If there is concern about sensitive filenames, the -s option will scramble them up in the image.
Here's the output of my command:
[root@almcrpstg01 workspace]# e2image -s -r /dev/mapper/almcrpprd03VG-localmnt2 fsimage-bz521107.img
e2image 1.39 (29-May-2006)
e2image: A block group is missing an inode table while getting next inode
It creates a 0 byte image file:
[root@almcrpstg01 workspace]# ls -lah fsimage-bz521107.img
-rw------- 1 root bin 0 Sep 3 10:50 fsimage-bz521107.img
Note: also opened service request 1949408 for this.
Just in case more recent e2fsprogs can handle this, you might try installing e4fsprogs (userspace for the ext4 tech preview) and running e4image ... but I bet it dies the same way.
I'll try to look backwards from the core.
Any idea what happened to this filesystem?
I think it was pretty straight forward in this case. My understanding is that we initiated a reboot (using the reboot command manually) most probably while some application was still using this filesystem. I'm presuming that the process didn't get killed off before the system rebooted.
Before that reboot, we noted a syslog error message saying that multipathd segfaulted. when the system came back up, this filesystem would not fsck.
When e2fsck is running are you getting any errors in dmesg from the storage?
e2fsck should handle it more gracefully of course, but I wonder if everything got put back together again properly after the reboot ...
I've looked and I just don't see anything. Unfortunately as well, this is a down application. We had to completely wipe this filesystem to get our internal customer back up. So, I can't try fscking again the broken fs. Also, I'm not confident that we can reproduce this error reliably (though I will try).
I wasn't able to sort out from the core what was wrong, and I am afraid that without access to the broken image, this will be nigh impossible to fix... I'm afraid I'll have to close this one.