146247 – EXT3-fs error results in corrupt file system

Bug 146247 - EXT3-fs error results in corrupt file system

Summary: EXT3-fs error results in corrupt file system

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-01-26 13:42 UTC by Keith Winston
Modified:	2007-11-30 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-01-27 12:36:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Keith Winston 2005-01-26 13:42:11 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5)
Gecko/20041107 Firefox/1.0

Description of problem:
About three days after upgrading to kernel 2.4.21-20, I received
hundreds of EXT3 error messages in /var/log/messages:

Dec 21 01:05:01 linux01 kernel: EXT3-fs error (device cciss0(104,1)):
ext3_new_block: Allocating block in system zone - block = 96
Dec 21 01:05:01 linux01 kernel: EXT3-fs error (device cciss0(104,1)):
ext3_new_block: Allocating block in system zone - block = 160
Dec 21 01:05:01 linux01 kernel: EXT3-fs error (device cciss0(104,1)):
ext3_new_block: Allocating block in system zone - block = 161
Dec 21 01:05:01 linux01 kernel: EXT3-fs error (device cciss0(104,1)):
ext3_new_block: Allocating block in system zone - block = 162

At the time, I was using the SMP kernel on an Intel P4 with
hyperthreading.  Brand new HP server, Compaq RAID SCSI controller with
two RAID1 disks.  This system showed no errors and was up for over a
year without problems.

When I noticed the errors, the machine was still running.  An 'ls' in
the root directory showed no files.  The hard disks showed no
indication of a problem.  I brought the machine to run level 1 to run
fsck.  Fsck reparied hundreds of file system errors.  When it was
finished, there were no files left.  The entire system had been
destroyed and it would not boot.

I was forced to do a bare metal restore from tape and downgraded the
kernel to the one that shipped on the RHEL 3 CDs.  I also ran the UP
kernel instead of SMP.  Since the kernel downgrade, the server has
been up for a month with no sign of the previous errors.

I speculate that the problem is in the ext3 code and gets triggered
when using the SMP kernel.

Version-Release number of selected component (if applicable):
3

How reproducible:
Didn't try

Steps to Reproduce:
1. use kernel vmlinuz-2.4.21-20.ELsmp with ext3
2.
3.
    

Additional info:

Comment 2 Stephen Tweedie 2005-01-27 11:19:37 UTC

Unfortunately there's not enough information to even begin to diagnose this. 
Without the e2fsck logs or full kernel logs all I can see is that there was a
bitmap corruption in the first part of the filesystem.

Now, corruption near the start of the filesystem would also be consistent with
losing the root directory.  If that happened, e2fsck would recover most of your
files, but would place them into /lost+found.  "no files left" is not a
situation that e2fsck normally results in --- but if the root dir is gone, all
files gone into /lost+found is expected.

So it looks as if we lost some data at the beginning of the disk, including the
initial block bitmap and root directory.  But even that is just a guess; as for
guessing *why*, I've no idea, but it could easily be bad hardware, some other
component of the kernel or driver stomping on memory, a raid fault, etc.

The problem description also sounds odd: if an "ls" showed no files in the root
directory, how did you successfully manage to bring it to runlevel 1?

This sounds undiagnosable as it stands, but I'll ask to see if anyone knows
about problems with cciss in that release.

Comment 3 Keith Winston 2005-01-27 12:18:45 UTC

Even though an "ls" showed no files in the root directory, I could
"cd" into directory that I knew was there, like "cd bin".  Then I
could "ls" and see all the files in /bin.  Very strange.  Somehow, it
got to runlevel 1.

After the e2fsck, the only thing left was the lost+found directory,
but there were no files or partial files in lost+found.  It was empty.
 I only found this out after booting from a rescue CD because the
reboot after the e2fsck failed.  The boot loader could not find the
kernel.

I found an old bug from 2002 where the ext3_new_block error showed up
when many small files were written to disk at the same time.  From
reading the kernel mailing list archives, it was not clear that the
problem was completely resolved, only mitigated.  It is also possible
that the bug is in the HP SCSI RAID driver (cciss).

I would gladly provide you with logs and core dumps if I had them, but
everything was blown away after the e2fsck.  I only have logs from the
backup the night before it died, showing all the ext3 errors.

Comment 4 Stephen Tweedie 2005-01-27 12:36:40 UTC

The ability to "cd" through the empty directory was probably because
"ls" was trying to read the directory --- which was broken --- whereas
"cd" was simply traversing the existing in-core directory cache tree
(which would have been pinned in memory because executable files from
/bin were still running.)

The ext3_new_block error simply marks that ext3 has detected certain
on-disk data corruption.  On its own it tells us nothing about how it
got into that state.

Running e2fsck on a live, still-mounted filesystem is excessively
dangerous, and may have contributed to the problem.

ext3 is tested very extensively, and I've got no reason to believe it
still has massive fs-destroying bugs.  

About all I can suggest is that you open a support ticket to find out
if there are any known problems with the particular combination of
CCISS driver version, firmware version, disks, and other software you
may have running.

Note You need to log in before you can comment on or make changes to this bug.