Bug 57592 - disk corruption on 2.4.9-13smp w/adaptec SCSI
disk corruption on 2.4.9-13smp w/adaptec SCSI
Status: CLOSED WORKSFORME
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.2
i386 Linux
high Severity medium
: ---
: ---
Assigned To: Arjan van de Ven
Brock Organ
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2001-12-16 22:55 EST by Preston Brown
Modified: 2007-03-26 23:50 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-06-07 19:48:48 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ksymoops output (3.97 KB, text/plain)
2001-12-16 22:56 EST, Preston Brown
no flags Details

  None (edit)
Description Preston Brown 2001-12-16 22:55:50 EST
Description of Problem:


I've been getting corruption on my SCSI drive from time to time and errors in
the system log since upgrading from 7.1.  I have a 2-way Pentium II system
with 512 MB RAM, and Adaptec SCSI.

Version-Release number of selected component (if applicable):
2.4.9-13smp

How Reproducible:
randomly, I get oopses and other strange ext3 errors in my logs.

Here is the ksymoops info from the latest problem:

Attached is the latest oops decoded with ksymoops.
Comment 1 Preston Brown 2001-12-16 22:56:31 EST
Created attachment 40787 [details]
ksymoops output
Comment 2 Stephen Tweedie 2001-12-17 07:25:09 EST
The BUG() is coming from 

		J_ASSERT_JH(jh, buffer_uptodate(jh2bh(jh)));

which is somewhere deep in the journaling layer getting upset about the fact
that there have been IO failures elsewhere.  That shouldn't happen, but the
failure is in an inode table, which is a bit hard to recover from if the block
goes bad after you've already started using it.

Part of the underlying problem here is the stupid block device layer, which only
has one bit of error state and which heavy-handedly marks blocks as being
non-uptodate if a write error occurs.  That should be fixed for 2.5, but all
filesystems will have the problem in 2.4 that they cannot reliably tell what
blocks are actually uptodate in the presence of write errors.  So the ext3
assert fail should probably be relaxed: I'll reproduce this and fix.

Just so that I can decode the trace a little more accurately, can you tell me
which kernel version this is?  We have 3 different 2.4.9-13smp kernels: one each
for i586, i686 and athlon.

I don't know if this answers your problem: it's not clear from the report
whether you are wanting the SCSI IO errors or the filesystem fixed.
Comment 3 Preston Brown 2001-12-17 13:53:50 EST
this is with the i686 -13smp kernel.

I hadn't had any I/O errors for several days leading up to this crash, and the 
disk had been fsck'd, so I was assuming this might not be due to those 
previous SCSI I/O problems.  I will update the report if anything else happens 
of consequence.
Comment 4 Need Real Name 2002-01-11 12:16:02 EST
I'm seeing basically the exact same problem. IBM xSeries 350 with integrated 
Adapted U160 SCSI with a secondary /data filesystem on an external RAID using 
the Adaptec SCSI. Primary boot filesystems on an IBM RaidServ 4LX RAID 
controller and are fine. The filesystem on the Adaptec is getting I/O errors, 
hangs, load average goes to 9+, etc... Running 2.4.9-13smp RedHat kernel, 2 x 
700Mhz P3 XEON procs, 1.5GB RAM. Am planning on booting back to the stock 2.4.7 
kernel to see if it's more stable. 

The server I described has an _exact_ twin sitting right beside it as a load-
balanced fault-tolerant redundant server and it's experiencing the exact same 
problem. I'm fairly certain it's not faulty hardware.
Comment 5 Stephen Tweedie 2002-01-11 12:30:06 EST
Wendell, what sort of errors are being reported in your logs?  If there are no
scsi errors being reported then this may be a new fs bug; otherwise it's likely
to be the fs getting confused by errors coming back from the scsi layer, and
that may indicate a driver or controller fault.
Comment 6 Alan Cox 2003-06-07 19:48:48 EDT
No reply in over a year - closing

Note You need to log in before you can comment on or make changes to this bug.