Bug 443145 - strange filesystem corruption on reboot
Summary: strange filesystem corruption on reboot
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.5
Hardware: i386
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Eric Sandeen
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-04-18 20:17 UTC by andy
Modified: 2011-05-05 17:34 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-05 17:34:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fsck output from IBM Blade running Redhat ES 4 update 4 (105.15 KB, text/plain)
2008-04-28 19:32 UTC, andy
no flags Details

Description andy 2008-04-18 20:17:50 UTC
Description of problem:
Observed in RHEL 4.5, 2.6.9-55.ELsmp (mpt version 3.02.73rh) on IBM HS21 Blade,
which uses LSI SAS1064E controller.  2 drives configured as RAID 1. Filesystem
is ext3.


On reboot, a host may boot with a very dirty filesystem.  Fsck tends to fail,
requiring manual intervention.  After cleanup, files will be missing in
seemingly random locations, though /var/log was reasonably consistent in having
missing files.

The only consistent observation is that boxes that exhibit this issue appear to
have greater than 200 days uptime.  We have not yet found a manual trigger for
this behavior.

IBM case 12RG9KK also opened on this issue.

Particulars:
IBM HS21 blade, SAS1064E controller, fw 0.10.15.00 as reported by LSI's cfggen
utility.
Redhat 4.5 2.6.9-55.ELsmp kernel, mpt version 3.02.73rh, ext3 filesystem.

Bugzilla case opened on advice of jscalia for tracking purposes.

How reproducible:
Unfortunately, we haven't deduced a reliable, on-demand test case yet.  Reboot
and hope.

Comment 1 Eric Sandeen 2008-04-21 14:22:31 UTC
How often does this happen?  Do any errors appear in the logs prior to reboot,
are there any problems reported as the system is shutting down?

How does fsck fail?  Do you mean that fsck -a fails, but ultimately fsck is able
to repair the fs (or at least run to completion) when manually invoked?  Are the
missing files those which fsck reported moving to lost+found/ ?  What problems
were found with the filesystem?

There's just not a lot to go on at this point.  Any more information you can
gather would be helpful.

-Eric

Comment 2 andy 2008-04-21 14:45:50 UTC
Agreed, we've been trying to capture as much information as possible.  Once a
host is recovered, we've seen nothing unusual in the logs (assuming the logfile
survived the recovery).  We've seen a total of 8 examples so far, 6 in the last
month.

When I say that fsck fails (I wish I had captured one of those), it gets to a
state during the boot where it no longer continues and drops into single user
for repair.  At that point we've been re-booting into rescue mode and executing
fsck.  The fsck at that time completes successfully, but leaves us in a state
where we're not sure what files might have been removed into lost+found.

Fsck primarily complained about the number if inodes found for a file to be
greater than it expected for the file, which smells like corruption with the
directory to us.

We're actively working to try to replicate this somehow (without waiting 200
days).  I would leave this ticket at low priority and have it remain a
collection point for data as we can discover it.

Comment 3 Eric Sandeen 2008-04-21 14:53:25 UTC
Sounds like a plan, thanks.

Comment 4 andy 2008-04-24 01:33:21 UTC
Here's what a failure would look like on boot.  The fsck error (duplicate blocks
shared across 4 files) isn't one of the usual that I come across, and I'm rather
curious as to what sort of event or action could cause that.  Transcription is
from the remote console window (actually.. multiple screenshots.. it keeps
rebooting because apparently /etc/passwd has bit the dust).

Checking root filesystem
/ contains a file system with errors, check forced.
/:
Duplicate or bad block in use!
/: Duplicate/bad block(s) in inode: 7733410: 15495780 15495781 15495810 15495811
/: Duplicate/bad block(s) in inode: 7734079: 15495781
/: Duplicate/bad block(s) in inode: 7735469: 15495780
/: Duplicate/bad block(s) in inode: 7735473: 15495810
/: Duplicate/bad block(s) in inode: 7735519: 15495811
/: File /var/cache/man/whatis (inode #7733410, mod time Sun Apr 6 04:22:38 2008)
 has 4 duplicate block(s), shared with 4 file(s):
/:     /var/spool/clientmqueue/qfm3G501Ck000622 (inode #7735519, mod time Wed
Apr 16 01:00:01 2008)
/:     /var/spool/clientmqueue/qfm388229w004126 (inode #7735473, mod time Tue
Apr 8 04:02:02 2008)
/:     /var/spool/clientmqueue/dfm3G501Ck000622 (inode #7734079, mod time Wed
Apr 16 01:00:01 2008)
/:     /var/spool/clientmqueue/dfm388229w004126 (inode #7735469, mod time Tue
Apr 8 04:02:02 2008)
/:

/: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.

That's when it attempts to drop to a single user shell, sulogin can't find the
password db and it reboots.  I can boot this in rescue mode, fsck the system and
collect the errors/warnings, but I'll wait and see if there is more data
collection desired by RedHat before I make filesystem changes.  Just let me know.

I've got 2 blades doing this particular event, both went belly up at the same
time, after they were rebooted following a change in hostname.  This is the same
sort of behavior that we saw on the other 6 blade servers.


Comment 5 Eric Sandeen 2008-04-24 02:03:55 UTC
The boxes are totally fine for the 200(-ish) days prior to the reboot, and this
only shows up when you actually bring it up again?

Above, the files in question were last modified in the last week or two (Apr 6,
8, 16)...

Before you repair it, perhaps you can do dumpe2fs, or even make an e2image -r
image, to preserve the current state.  Depending on whether or not filenames are
sensitive, perhaps the image could be provided for further analysis.

Is there anything else "unique" about these blades?  I've not seen this sort of
thing anywhere else, so perhaps something unique about your setup could be
triggering it.  I suppose I should ask, are there any 3rd party modules in use?

Thanks,
-Eric



Comment 6 andy 2008-04-24 13:01:55 UTC
Yep, that seems to be the attendant behavior.  Our first thought was write cache
bugs or other issues with the RAID controller, but problem blades have had both
the SA1064E and LSI53C1020/1030 controllers (SAS and SCSI).  Nothing consistent.
 Also, I hate to say it, but all of the Windows blades have been just fine.

I saw the file modification times, but they don't match to any human
intervention.  The hosts had been retired from previous work and had been simply
sitting around for a bit before being re-purposed.  I'm guessing regular cron
maintenance tasks.

On the second host that's broken now, his boot time fsck output is similar:
/1: Duplicate/bad block(s) in inode 426133: 873016
/1: Duplicate/bad block(s) in inode 427934: 873016
/1: (There are 2 inodes containing duplicate/bad blocks.)
/1: File /var/log/prelink.log (inode #426133, mod time Thu Feb 21 04:02:03 2008)
 has 1 duplicate block(s), shared with 1 file(s):
/1:      /var/spool/clientmqueue/dfm2160166001562 (inode #427934, mod time Sat
Mar 1 01:00:01 2008)

I'll create filesystem images today.  Can those be attached to the bugzilla ticket?

Comment 7 Eric Sandeen 2008-04-24 13:36:23 UTC
let's see how big the images are (bzipped per e2image manpage instructions...).
 I'm not sure how big an attachment you can make, or how public you want the
images to be, if they have any sensitive/private information.  ftp or whatnot
would be ok too.

Hum; the 2nd host also has a file in clientmqueue with duplicately-mapped
blocks?  that's .... interesting ....



Comment 8 andy 2008-04-25 19:23:41 UTC
compressed e2image file mailed to you.  Compressed down, it's about a meg.

Comment 9 andy 2008-04-28 18:43:20 UTC
New host crashed today in the middle of a tar operation.  We grabbed the e2image
file and will append errors to this ticket.

Comment 10 andy 2008-04-28 19:24:33 UTC
fsck output as attachment.  There was so much of it, in fact, that we couldn't
even keep up with the capture (mental note, typescript next time).  Disk image
is compressing.

Comment 11 andy 2008-04-28 19:32:56 UTC
Created attachment 304033 [details]
fsck output from IBM Blade running Redhat ES 4 update 4

Comment 12 Eric Sandeen 2008-05-02 14:48:31 UTC
so:

Illegal block #12 (192503584) in inode 1393785.  CLEARED.
Illegal block #13 (1734438497) in inode 139785.  CLEARED.
Illegal block #14 (540701285) in inode 1393785.  CLEARED.
Illega block #15 (1987208558) in inode 1393785.  CLEARED.
Illegal block #16 (1952526384) in inode 139378.  CLEARED.
Illegal block #17 (1952525932) in inode 1393785.  CLEARED.
Illegal blck #18 (1663971436) in inode 1393785.  CLEARED.
Illegal block #19 (1870226786) ininode 1393785.  CLEARED.
Illegal block #20 (1848534126) in inode 1393785.  CLEARED.
Illegal block #21 (1091204197) in inode 1393785.  CLEARED.
Illegal block #22 (153391975) in inode 1393785.  CLEARED.

... 

so, it appears that the first indirect block for this inode is not an indirect
block at all, but contains file data (the above block numbers actually translate
into an email address at cbeyond.net...!)

(the first 12 blocks (0..11) are stored directly in the inode; for the 12th and
beyond, they are gathered from indirect blocks).

So either the inode points to the wrong indirect block, or the indirect block
was never updated with metadata, or it got overwritten by data... pretty hard to
say which, from the information at hand.

I'll look at the image you mailed, now... but wow, that fsck output really is a
mess.

Can you give me any more information about the machines; is the kernel tainted?
 How is storage set up (lvm? md? simple partitions?) etc...

Thanks,
-Eric

Comment 13 andy 2008-05-02 15:04:00 UTC
Stock kernel, right from the kickstart.

Storage is onboard LSI hardware raid, 2 drives in RAID 1.  Two partitions on
/dev/sda, one swap and one root.

Comment 14 Eric Sandeen 2008-05-02 18:46:45 UTC
I wonder if there would be any value at all to checking each of the 2 legs of
the mirrors independently?   Oh... if it's hardware raid then is that possible?

Looking at the image you sent, no clues as to *why* things went wrong yet, but I
can characterize it.

/var/cache/man/whatis shares blocks with several files in /var/spool/clientmqueue/*

the "whatis" file is 51 blocks, timestamped Apr 6; it came first.
the mqueue files are each 1 block, timestamped Apr 8 and 16, so they came later.

each file has a single value for atime, mtime, & ctime.

the single-block mqueue files seem to have "stolen" blocks from the whatis file,
which is laid out like:

BLOCKS:
(0-11):15495780-15495791, (IND):15495792, (12-15):15495793-15495796,
(16-23):15495798-15495805, (24-49):15495810-15495835
TOTAL: 51

the first 2 blocks (15495780, 15495781) are duplicated, as are the blocks in the
last extent (15495810, 15495811)

Because the first 2 blocks are stored in the inode itself, which looks perfectly
fine, and because they are contiguous with a nice string of blocks in the file,
it seems like the mqueue files that came along later somehow wrongly were given
those same 2 blocks (rather than, say, the whatis inode getting corrupted).

Comment 15 Eric Sandeen 2008-05-02 18:49:22 UTC
Oh, also, the filesystem was marked as having errors.  There should be something
in the logs as to why... any chance of finding that?

-/bin/sh-3.00# fsck -y /dev/sda1
fsck 1.35 (28-Feb-2004)
WARNIG: couldn't open /etc/fstab: No such file or directory
e2fsck 1.35 (28-Feb-2004)
 contains a file system with errors, check forced.

Comment 16 Eric Sandeen 2008-12-04 22:01:19 UTC
One other question on this; did the machines lose power prior to the reboot?  Are you still experiencing these problems?

Thanks,
-Eric

Comment 17 Eric Sandeen 2011-05-05 17:34:52 UTC
In NEEDINFO since 2008, and RHEL4 is now very late in life.  Closing WONTFIX unless something new comes up here.


Note You need to log in before you can comment on or make changes to this bug.