439148 – e2fsprogs still doesn't repair large software raids (320GB)

Bug 439148 - e2fsprogs still doesn't repair large software raids (320GB)

Summary: e2fsprogs still doesn't repair large software raids (320GB)

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	e2fsprogs
Sub Component:
Version:	4.8
Hardware:	i386
OS:	Linux
Priority:	low
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Eric Sandeen
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-03-27 11:06 UTC by Curtis Falany
Modified:	2009-07-30 20:25 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-07-30 20:25:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
fsck report from boot to single user (791 bytes, text/plain) 2008-03-28 05:33 UTC, Curtis Falany	no flags	Details
debug output from boot to single user (275.52 KB, text/plain) 2008-03-28 05:35 UTC, Curtis Falany	no flags	Details
Data you requested with my apologies (1.22 KB, text/plain) 2008-04-25 00:33 UTC, Curtis Falany	no flags	Details
View All

Description Curtis Falany 2008-03-27 11:06:32 UTC

Description of problem:  M/B failure created error in software RAID = resize
inode not valid  unexpected inconsistency


Version-Release number of selected component (if applicable): 1.35-12.11 (latest
from up2date)


How reproducible:  Run fsck on failed /dev/md0.  fsck fails to repair the error.
 I have no idea how to create the error


Steps to Reproduce:
1.run fsck -y /dev/md0
2.answer yes 
3.
  
Actual results:  no change - error still exists


Expected results:  error repaired 


Additional info:  This was posted as fixed in the latest release but apparently
was not...

Comment 1 Eric Sandeen 2008-03-27 14:10:48 UTC

Could you attach the full e2fsck output, as well as the output of the following
command, so I can hopefully see what is wrong with the inode:

debugfs -c /dev/md0
debugfs: stat <7>

Thanks,
-Eric

Comment 2 Curtis Falany 2008-03-28 05:33:56 UTC

Created attachment 299434 [details]
fsck report from boot to single user

Comment 3 Curtis Falany 2008-03-28 05:35:17 UTC

Created attachment 299435 [details]
debug output from boot to single user

Comment 4 Curtis Falany 2008-03-28 05:39:00 UTC

The results of fsck -y /dev/md0 seem different when run after booting as a 
single user than they do when run after the systems fails to 'maintenance.'

I will try to get another set from the 'maintenance' failure but time presses.  
The customer wants his server back up.

Comment 5 Eric Sandeen 2008-03-28 14:02:48 UTC

Hm, so, this time no messages about the resize inode, but a different error?

Perhaps making an e2image of the fs for safekeeping & later analysis would be good.

In the meantime...

It appears that the only 2 problems are now 2 directory inodes in lost+found/
which have bad (unfixable?) parents entries...

So it looks like the rest of the fs is fixed, w.r.t. the pressing time issue.

can you try "stat <15302615>" and "ls <15302615>" in debugfs, and same for the
other inode? (37017175)  I'm guessing they might say "size 0"

Unless you know it's critical data in those lost+found files, we can maybe just
nuke them with debugfs, with either kill <inode> or rm <inode> and perhaps a
subsequent repair.

I've found one other report of this problem, but no solution yet, can't tell if
it's fixed upstream.  I hope with the stat information I can recreate it here to
investigate.

Comment 6 Eric Sandeen 2008-04-11 15:54:54 UTC

Can you provide the information requested?

Thanks,
-Eric

Comment 7 Curtis Falany 2008-04-25 00:33:18 UTC

Created attachment 303715 [details]
Data you requested with my apologies

You were right about the length.  I just had to get the computer back from the
customer for awhile.

Comment 8 Eric Sandeen 2008-04-25 01:30:24 UTC

Thanks!  If you do have the computer for a bit now you might make an e2image of
the filesystem,  something like

    e2image -r /dev/sda1 - | bzip2 > sda1.e2i.bz2 

and we could do further analysis later on that, if needed.

I'll see if I can work out anything from the info provided.

Comment 9 Eric Sandeen 2008-04-25 04:05:26 UTC

If there's any chance of getting access to the image, it'd be greatly helpful. 
So far I've not been able to recreate a filesystem with corruption which behaves
in teh same way...

Comment 10 Eric Sandeen 2008-04-25 04:34:36 UTC

The thing that's very odd here is that in order to get a message like:

'..' in /lost+found/#15302615 (15302615) is <The NULL inode> (0), should be
/lost+found (11).

then, well, the inode nr. needs to be 0, as it says.  But that was in pass3, and
in pass2, we do this:

static int check_dotdot(e2fsck_t ctx,
                        struct ext2_dir_entry *dirent,
                        struct dir_info *dir, struct problem_context *pctx)
{
        int             problem = 0;

        if (!dirent->inode)
                problem = PR_2_MISSING_DOT_DOT;

and since !dirent->inode (this is the inode nr; it's 0) then we'd get:

Pass 2: Checking directory structure
Missing '..' in directory inode 15302615.

I'm just not seeing how we can get

'..' in /lost+found/#15302615 (15302615) is <The NULL inode> (0), should be
/lost+found (11).

without first seeing:

Missing '..' in directory inode 15302615.

Comment 11 Curtis Falany 2008-04-25 13:34:50 UTC

The image should fit on a 40 GB or so disk.

How about I make the image copy to a USB drive and ship it to you?

Comment 12 Curtis Falany 2008-04-25 13:45:29 UTC

Consider this:  When we booted the system with /dev/md0 in fstab, the attempt 
to auto mount tries to clean the hard disks, which /dev/md0 fails.  This drops 
the system into a single user 'maintenance mode.'  The time or two we ran 
e2fsprogs or fsck in this 'maintenance mode,' we received a different, but 
consistent, report on /dev/md0.

I haven't put /dev/md0 back into fstab because I am afraid our problem will 
change while we are looking at it.  The system runs because I manually 
mount /dev/md0 and ignore the recommended file check.

Comment 13 Eric Sandeen 2008-04-29 15:18:17 UTC

(sorry, somehow I missed the bug updates for a couple days)

> How about I make the image copy to a USB drive and ship it to you?

is 40GB compressed?  I'd hoped it would be a bit smaller.

Before we resort to that, as long as you have the image now, let me see if I can
have you run a few other things to try to debug the problem.  I'll look into
this a bit more, and follow up with some requests for more info.  Another option
might be guest access to a machine where I could do some debugging.  But, if you
are amenable to physically sending the image, perhaps that would be the fastest
path.

Thanks,
-Eric

Comment 14 Curtis Falany 2008-04-29 18:13:31 UTC

I'll see what I can do about a 'guest' system.

Comment 15 Eric Sandeen 2008-06-03 04:58:29 UTC

Any news about getting access to a filesystem image one way or another?

Thanks,
-Eric

Comment 16 Curtis Falany 2008-06-28 21:43:41 UTC

We had the techs out this AM to review the situation and the possibilities.  We 
have been unable to duplicate the problem.

What additional reports or files can we produce and attach to this bug report 
that would assist you.  That would be easier than punching you through two 
firewalls.

If we can arrange for you to view the production system as a last resort, can 
you learn anything in a NON-destructive mode that would help?  Due to the 
nature of the information, we would also have to have an executed non-
disclosure agreement for access.

Thanks

Comment 17 Eric Sandeen 2008-06-29 02:30:43 UTC

Do you still have a copy of the bad fs?  And, was e2image really 40GB compressed?

When you say you can't duplicate the problem... does that mean you no longer
have a filesystem which exhibits this failing fsck behavior, or?

Just a note, I'm out of the office 'til next Thurs.

Thanks,
-Eric

Comment 18 Eric Sandeen 2008-06-29 04:19:27 UTC

I also might be able to give you a custom e2sfck which would print more info as
it goes... that might be the simplest route at first.

I'll cook something up when I get back.

-Eric

Comment 19 Eric Sandeen 2008-12-02 20:03:10 UTC

Any more info on this, or will it be lost to the mists of time... :)  (do you still have a copy of the bad fs?)

Comment 20 Eric Sandeen 2009-07-30 20:25:31 UTC

I'm afraid that without more info, we're not going to be able to fix this one.  If you are able to provide anything else that might offer a clue, feel free to re-open.

thanks,
-Eric

Note You need to log in before you can comment on or make changes to this bug.