Bug 521107

Summary: fsck cannot clean up filesystem, eventually hangs forever
Product: Red Hat Enterprise Linux 5 Reporter: Dave <dave.costakos>
Component: e2fsprogsAssignee: Eric Sandeen <esandeen>
Status: CLOSED INSUFFICIENT_DATA QA Contact: BaseOS QE <qe-baseos-auto>
Severity: high Docs Contact:
Priority: low    
Version: 5.1CC: fhirtz, sct, zbrown
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-26 21:20:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Output of fsck.ext2 command on the filesystem
none
Output of fsck.ext3 command none

Description Dave 2009-09-03 17:13:13 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
Always with this particular corrupt fileystem


Steps to Reproduce:
1. run fsck
2. after a while fsck stops printing output and spins up to 100% cpu utilization
3. fsck hangs forever and ctrl-c doesn't stop the process.  Only kill can stop it
  
Actual results:
Unclean filesystem, 600 GB of lost data


Expected results:
fsck to clean up filesystem and we continue operating


Additional info:
I have output of the fsck command and a core dump of fsck.ext2 and the same output from fsck.ext3 (both result in the same issue).  I'm also attaching core dumps of both processes.

Comment 1 Dave 2009-09-03 17:13:45 UTC
Created attachment 359705 [details]
Output of fsck.ext2 command on the filesystem

Comment 2 Dave 2009-09-03 17:17:09 UTC
strace -p <PID> on ext2 after it is "hung" creates an empty file, but the CPU utilization for that process is using 100% of 1 CPU.

I've also updated the core core-bz521107.bz2 to dropbox.redhat.com/incoming . . . this is for fsck.ext2

Comment 3 Dave 2009-09-03 17:26:12 UTC
Created attachment 359706 [details]
Output of fsck.ext3 command

Comment 4 Dave 2009-09-03 17:27:41 UTC
Also put core-fsck.ext3-bz521107.bz2 on dropbox.redhat.com/incoming for the core dump of the fsck.ext3 filesystem.

Comment 5 Eric Sandeen 2009-09-03 17:42:03 UTC
Could you please create an "e2image -r" of the problematic filesystem, compress it, and provide it for analysis?  I can probably work backwards from the corefile, but with a filesystem image I could verify any fix.  If there is concern about sensitive filenames, the -s option will scramble them up in the image.

Thanks,
-Eric

Comment 6 Dave 2009-09-03 17:51:11 UTC
Thanks Eric, 

Here's the output of my command:

[root@almcrpstg01 workspace]# e2image -s -r /dev/mapper/almcrpprd03VG-localmnt2  fsimage-bz521107.img
e2image 1.39 (29-May-2006)
e2image: A block group is missing an inode table while getting next inode

It creates a 0 byte image file:

[root@almcrpstg01 workspace]# ls -lah fsimage-bz521107.img 
-rw------- 1 root bin 0 Sep  3 10:50 fsimage-bz521107.img

Comment 7 Dave 2009-09-03 17:51:43 UTC
Note:  also opened service request 1949408 for this.

Comment 8 Eric Sandeen 2009-09-03 18:10:33 UTC
Ah crud.

Just in case more recent e2fsprogs can handle this, you might try installing e4fsprogs (userspace for the ext4 tech preview) and running e4image ... but I bet it dies the same way.

I'll try to look backwards from the core.

Any idea what happened to this filesystem?

-Eric

Comment 9 Dave 2009-09-03 19:54:24 UTC
I think it was pretty straight forward in this case.  My understanding is that we initiated a reboot (using the reboot command manually) most probably while some application was still using this filesystem.  I'm presuming that the process didn't get killed off before the system rebooted.

Before that reboot, we noted a syslog error message saying that multipathd segfaulted.  when the system came back up, this filesystem would not fsck.

Comment 10 Eric Sandeen 2009-09-03 20:00:57 UTC
When e2fsck is running are you getting any errors in dmesg from the storage?

e2fsck should handle it more gracefully of course, but I wonder if everything got put back together again properly after the reboot ...

Thanks,
-Eric

Comment 11 Dave 2009-09-03 20:19:57 UTC
I've looked and I just don't see anything.  Unfortunately as well, this is a down application.  We had to completely wipe this filesystem to get our internal customer back up. So, I can't try fscking again the broken fs.   Also, I'm not confident that we can reproduce this error reliably (though I will try).

Comment 12 Eric Sandeen 2011-01-26 21:20:57 UTC
I wasn't able to sort out from the core what was wrong, and I am afraid that without access to the broken image, this will be nigh impossible to fix... I'm afraid I'll have to close this one.