521107 – fsck cannot clean up filesystem, eventually hangs forever

Bug 521107 - fsck cannot clean up filesystem, eventually hangs forever

Summary: fsck cannot clean up filesystem, eventually hangs forever

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	e2fsprogs
Sub Component:
Version:	5.1
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Eric Sandeen
QA Contact:	BaseOS QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-09-03 17:13 UTC by Dave
Modified:	2011-01-26 21:20 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-26 21:20:57 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Output of fsck.ext2 command on the filesystem (775.34 KB, text/plain) 2009-09-03 17:13 UTC, Dave	no flags	Details
Output of fsck.ext3 command (349.64 KB, text/plain) 2009-09-03 17:26 UTC, Dave	no flags	Details
View All

Description Dave 2009-09-03 17:13:13 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
Always with this particular corrupt fileystem


Steps to Reproduce:
1. run fsck
2. after a while fsck stops printing output and spins up to 100% cpu utilization
3. fsck hangs forever and ctrl-c doesn't stop the process.  Only kill can stop it
  
Actual results:
Unclean filesystem, 600 GB of lost data


Expected results:
fsck to clean up filesystem and we continue operating


Additional info:
I have output of the fsck command and a core dump of fsck.ext2 and the same output from fsck.ext3 (both result in the same issue).  I'm also attaching core dumps of both processes.

Comment 1 Dave 2009-09-03 17:13:45 UTC

Created attachment 359705 [details]
Output of fsck.ext2 command on the filesystem

Comment 2 Dave 2009-09-03 17:17:09 UTC

strace -p <PID> on ext2 after it is "hung" creates an empty file, but the CPU utilization for that process is using 100% of 1 CPU.

I've also updated the core core-bz521107.bz2 to dropbox.redhat.com/incoming . . . this is for fsck.ext2

Comment 3 Dave 2009-09-03 17:26:12 UTC

Created attachment 359706 [details]
Output of fsck.ext3 command

Comment 4 Dave 2009-09-03 17:27:41 UTC

Also put core-fsck.ext3-bz521107.bz2 on dropbox.redhat.com/incoming for the core dump of the fsck.ext3 filesystem.

Comment 5 Eric Sandeen 2009-09-03 17:42:03 UTC

Could you please create an "e2image -r" of the problematic filesystem, compress it, and provide it for analysis?  I can probably work backwards from the corefile, but with a filesystem image I could verify any fix.  If there is concern about sensitive filenames, the -s option will scramble them up in the image.

Thanks,
-Eric

Comment 6 Dave 2009-09-03 17:51:11 UTC

Thanks Eric, 

Here's the output of my command:

[root@almcrpstg01 workspace]# e2image -s -r /dev/mapper/almcrpprd03VG-localmnt2  fsimage-bz521107.img
e2image 1.39 (29-May-2006)
e2image: A block group is missing an inode table while getting next inode

It creates a 0 byte image file:

[root@almcrpstg01 workspace]# ls -lah fsimage-bz521107.img 
-rw------- 1 root bin 0 Sep  3 10:50 fsimage-bz521107.img

Comment 7 Dave 2009-09-03 17:51:43 UTC

Note:  also opened service request 1949408 for this.

Comment 8 Eric Sandeen 2009-09-03 18:10:33 UTC

Ah crud.

Just in case more recent e2fsprogs can handle this, you might try installing e4fsprogs (userspace for the ext4 tech preview) and running e4image ... but I bet it dies the same way.

I'll try to look backwards from the core.

Any idea what happened to this filesystem?

-Eric

Comment 9 Dave 2009-09-03 19:54:24 UTC

I think it was pretty straight forward in this case.  My understanding is that we initiated a reboot (using the reboot command manually) most probably while some application was still using this filesystem.  I'm presuming that the process didn't get killed off before the system rebooted.

Before that reboot, we noted a syslog error message saying that multipathd segfaulted.  when the system came back up, this filesystem would not fsck.

Comment 10 Eric Sandeen 2009-09-03 20:00:57 UTC

When e2fsck is running are you getting any errors in dmesg from the storage?

e2fsck should handle it more gracefully of course, but I wonder if everything got put back together again properly after the reboot ...

Thanks,
-Eric

Comment 11 Dave 2009-09-03 20:19:57 UTC

I've looked and I just don't see anything.  Unfortunately as well, this is a down application.  We had to completely wipe this filesystem to get our internal customer back up. So, I can't try fscking again the broken fs.   Also, I'm not confident that we can reproduce this error reliably (though I will try).

Comment 12 Eric Sandeen 2011-01-26 21:20:57 UTC

I wasn't able to sort out from the core what was wrong, and I am afraid that without access to the broken image, this will be nigh impossible to fix... I'm afraid I'll have to close this one.

Note You need to log in before you can comment on or make changes to this bug.