57937 – System panics with "scsi_free:Bad offset" after 3-48 hours of use

Bug 57937 - System panics with "scsi_free:Bad offset" after 3-48 hours of use

Summary: System panics with "scsi_free:Bad offset" after 3-48 hours of use

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.2
Hardware:	i686
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-01-02 19:08 UTC by Joshua M. Thompson
Modified:	2008-08-01 16:22 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:39:19 UTC
Embargoed:

Attachments	(Terms of Use)

Description Joshua M. Thompson 2002-01-02 19:08:37 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461)

Description of problem:
Background: two MySQL servers running on identical Dell PowerEdge 6350 
Systems:

Dual PIII 550 processors
512 MB RAM
Dell PercRAID controller (using aacraid driver) running the OS drives
Dual Qlogic 2200 FC controllers
Dual dual-port Intel EEPro 100 cards

The machines share a Dell fibre channel drive array with two controllers; 
each machine has one qlogic card connected to each of the controllers on 
the array. The machines mount separate raid arrays from the unit and do 
not share drives. The array is used soley to store the MySQL database 
files.

Previously these machines ran for months on Red Hat 7.1 using kernel 2.4.9-
12 i686 enterprise with no incidents. Recently we upgraded to 7.2 and 
installed the 2.4.9-13 i686 entperprise kernel. Now the servers will run 
anywhere from 3 hours to 2 days and then panic with the error:

scsi:Bad offset
In interrupt handler - not syncing

At that point the machine is stone dead.

The upgrade was done Friday and the crashes occured over the holiday 
weekend when the database usage was much lighter than normal. At the 
moment I am totally perplexed because the 2.4.9-12 and 2.4.9-13 kernels 
are identical from what I can tell.

One last note. This happens with the plain smp kernels as well.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Instal RH 7.2
2. Upgrade to 2.4.9-13 kernel RPMs
3. Use machine and wait


Additional info:

Comment 1 Arjan van de Ven 2002-01-04 13:30:45 UTC

Ok this is strange. Very strange.
There is VERY little difference between 2.4.9-12 and 2.4.9-13: only the 3D
acceleration for XFree86 is different.

Comment 2 Michael K. Johnson 2002-01-04 15:55:34 UTC

Did you by any chance convert any (or all) of your filesystems to ext3
during the upgrade?  That would change the load pattern on the driver
and could possibly show up bugs in the drivers that would otherwise be
hidden; going from "never triggered" to "once every few hours/days"
doesn't necessarily take a huge change.

Comment 3 Joshua M. Thompson 2002-01-04 17:48:07 UTC

I think we have a winner because one minor thing that was done while upgrading (at least, it _should_ be minor, but obviously isnt) is upping the journal size on the database parititions since they're running in data=journal mode. Originally they were at the default of 4 MB and I had upped them to 32 MB; I've now move them down to 8 MB and so far things are quite stable.

Comment 4 Michael K. Johnson 2002-01-04 18:30:08 UTC

OK, that's interesting.  I think it points to a driver bug in the qla2x00
driver rather than an ext3 bug.

Comment 5 Joshua M. Thompson 2002-01-08 16:57:09 UTC

Looking briefly at some of the scsi mode I am wondering if this is actually
a scsi subsystem issue and not a qlogic issue. The error in question
("scsi_free:bad offset") occurs in the scsi_free() function in
drivers/scsi/scsi_dma.c and not called directly from the qlogic drivers. It
_is_ however called from the scsi scatter-gather code in scsi_lib.c in quite
a few places that could be interesting. Unfortunately since the error freezes
the box without any sort of dump info there's no way (at the moment) to get the call backtrace at the time of the error.

Since we're dying every two days I'll happily test any kernels you want, either newer ones or something with more debugging turned on.

Comment 6 Bugzilla owner 2004-09-30 15:39:19 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.