Bug 57592

Summary:

disk corruption on 2.4.9-13smp w/adaptec SCSI

Product:

[Retired] Red Hat Linux

Reporter:

Preston Brown <pbrown>

Component:

kernel

Assignee:

Arjan van de Ven <arjanv>

Status:

CLOSED WORKSFORME

QA Contact:

Brock Organ <borgan>

Severity:

medium

Docs Contact:

Priority:

high

Version:

7.2

CC:

sct, wendell

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2003-06-07 23:48:48 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ksymoops output	none

Description Preston Brown 2001-12-17 03:55:50 UTC

Description of Problem:


I've been getting corruption on my SCSI drive from time to time and errors in
the system log since upgrading from 7.1.  I have a 2-way Pentium II system
with 512 MB RAM, and Adaptec SCSI.

Version-Release number of selected component (if applicable):
2.4.9-13smp

How Reproducible:
randomly, I get oopses and other strange ext3 errors in my logs.

Here is the ksymoops info from the latest problem:

Attached is the latest oops decoded with ksymoops.

Comment 1 Preston Brown 2001-12-17 03:56:31 UTC

Created attachment 40787 [details]
ksymoops output

Comment 2 Stephen Tweedie 2001-12-17 12:25:09 UTC

The BUG() is coming from 

		J_ASSERT_JH(jh, buffer_uptodate(jh2bh(jh)));

which is somewhere deep in the journaling layer getting upset about the fact
that there have been IO failures elsewhere.  That shouldn't happen, but the
failure is in an inode table, which is a bit hard to recover from if the block
goes bad after you've already started using it.

Part of the underlying problem here is the stupid block device layer, which only
has one bit of error state and which heavy-handedly marks blocks as being
non-uptodate if a write error occurs.  That should be fixed for 2.5, but all
filesystems will have the problem in 2.4 that they cannot reliably tell what
blocks are actually uptodate in the presence of write errors.  So the ext3
assert fail should probably be relaxed: I'll reproduce this and fix.

Just so that I can decode the trace a little more accurately, can you tell me
which kernel version this is?  We have 3 different 2.4.9-13smp kernels: one each
for i586, i686 and athlon.

I don't know if this answers your problem: it's not clear from the report
whether you are wanting the SCSI IO errors or the filesystem fixed.

Comment 3 Preston Brown 2001-12-17 18:53:50 UTC

this is with the i686 -13smp kernel.

I hadn't had any I/O errors for several days leading up to this crash, and the 
disk had been fsck'd, so I was assuming this might not be due to those 
previous SCSI I/O problems.  I will update the report if anything else happens 
of consequence.

Comment 4 Need Real Name 2002-01-11 17:16:02 UTC

I'm seeing basically the exact same problem. IBM xSeries 350 with integrated 
Adapted U160 SCSI with a secondary /data filesystem on an external RAID using 
the Adaptec SCSI. Primary boot filesystems on an IBM RaidServ 4LX RAID 
controller and are fine. The filesystem on the Adaptec is getting I/O errors, 
hangs, load average goes to 9+, etc... Running 2.4.9-13smp RedHat kernel, 2 x 
700Mhz P3 XEON procs, 1.5GB RAM. Am planning on booting back to the stock 2.4.7 
kernel to see if it's more stable. 

The server I described has an _exact_ twin sitting right beside it as a load-
balanced fault-tolerant redundant server and it's experiencing the exact same 
problem. I'm fairly certain it's not faulty hardware.

Comment 5 Stephen Tweedie 2002-01-11 17:30:06 UTC

Wendell, what sort of errors are being reported in your logs?  If there are no
scsi errors being reported then this may be a new fs bug; otherwise it's likely
to be the fs getting confused by errors coming back from the scsi layer, and
that may indicate a driver or controller fault.

Comment 6 Alan Cox 2003-06-07 23:48:48 UTC

No reply in over a year - closing