Red Hat Bugzilla – Bug 57937
System panics with "scsi_free:Bad offset" after 3-48 hours of use
Last modified: 2008-08-01 12:22:52 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461)
Description of problem:
Background: two MySQL servers running on identical Dell PowerEdge 6350
Dual PIII 550 processors
512 MB RAM
Dell PercRAID controller (using aacraid driver) running the OS drives
Dual Qlogic 2200 FC controllers
Dual dual-port Intel EEPro 100 cards
The machines share a Dell fibre channel drive array with two controllers;
each machine has one qlogic card connected to each of the controllers on
the array. The machines mount separate raid arrays from the unit and do
not share drives. The array is used soley to store the MySQL database
Previously these machines ran for months on Red Hat 7.1 using kernel 2.4.9-
12 i686 enterprise with no incidents. Recently we upgraded to 7.2 and
installed the 2.4.9-13 i686 entperprise kernel. Now the servers will run
anywhere from 3 hours to 2 days and then panic with the error:
In interrupt handler - not syncing
At that point the machine is stone dead.
The upgrade was done Friday and the crashes occured over the holiday
weekend when the database usage was much lighter than normal. At the
moment I am totally perplexed because the 2.4.9-12 and 2.4.9-13 kernels
are identical from what I can tell.
One last note. This happens with the plain smp kernels as well.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Instal RH 7.2
2. Upgrade to 2.4.9-13 kernel RPMs
3. Use machine and wait
Ok this is strange. Very strange.
There is VERY little difference between 2.4.9-12 and 2.4.9-13: only the 3D
acceleration for XFree86 is different.
Did you by any chance convert any (or all) of your filesystems to ext3
during the upgrade? That would change the load pattern on the driver
and could possibly show up bugs in the drivers that would otherwise be
hidden; going from "never triggered" to "once every few hours/days"
doesn't necessarily take a huge change.
I think we have a winner because one minor thing that was done while upgrading (at least, it _should_ be minor, but obviously isnt) is upping the journal size on the database parititions since they're running in data=journal mode. Originally they were at the default of 4 MB and I had upped them to 32 MB; I've now move them down to 8 MB and so far things are quite stable.
OK, that's interesting. I think it points to a driver bug in the qla2x00
driver rather than an ext3 bug.
Looking briefly at some of the scsi mode I am wondering if this is actually
a scsi subsystem issue and not a qlogic issue. The error in question
("scsi_free:bad offset") occurs in the scsi_free() function in
drivers/scsi/scsi_dma.c and not called directly from the qlogic drivers. It
_is_ however called from the scsi scatter-gather code in scsi_lib.c in quite
a few places that could be interesting. Unfortunately since the error freezes
the box without any sort of dump info there's no way (at the moment) to get the call backtrace at the time of the error.
Since we're dying every two days I'll happily test any kernels you want, either newer ones or something with more debugging turned on.
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases,
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/