From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461) Description of problem: Background: two MySQL servers running on identical Dell PowerEdge 6350 Systems: Dual PIII 550 processors 512 MB RAM Dell PercRAID controller (using aacraid driver) running the OS drives Dual Qlogic 2200 FC controllers Dual dual-port Intel EEPro 100 cards The machines share a Dell fibre channel drive array with two controllers; each machine has one qlogic card connected to each of the controllers on the array. The machines mount separate raid arrays from the unit and do not share drives. The array is used soley to store the MySQL database files. Previously these machines ran for months on Red Hat 7.1 using kernel 2.4.9- 12 i686 enterprise with no incidents. Recently we upgraded to 7.2 and installed the 2.4.9-13 i686 entperprise kernel. Now the servers will run anywhere from 3 hours to 2 days and then panic with the error: scsi:Bad offset In interrupt handler - not syncing At that point the machine is stone dead. The upgrade was done Friday and the crashes occured over the holiday weekend when the database usage was much lighter than normal. At the moment I am totally perplexed because the 2.4.9-12 and 2.4.9-13 kernels are identical from what I can tell. One last note. This happens with the plain smp kernels as well. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Instal RH 7.2 2. Upgrade to 2.4.9-13 kernel RPMs 3. Use machine and wait Additional info:
Ok this is strange. Very strange. There is VERY little difference between 2.4.9-12 and 2.4.9-13: only the 3D acceleration for XFree86 is different.
Did you by any chance convert any (or all) of your filesystems to ext3 during the upgrade? That would change the load pattern on the driver and could possibly show up bugs in the drivers that would otherwise be hidden; going from "never triggered" to "once every few hours/days" doesn't necessarily take a huge change.
I think we have a winner because one minor thing that was done while upgrading (at least, it _should_ be minor, but obviously isnt) is upping the journal size on the database parititions since they're running in data=journal mode. Originally they were at the default of 4 MB and I had upped them to 32 MB; I've now move them down to 8 MB and so far things are quite stable.
OK, that's interesting. I think it points to a driver bug in the qla2x00 driver rather than an ext3 bug.
Looking briefly at some of the scsi mode I am wondering if this is actually a scsi subsystem issue and not a qlogic issue. The error in question ("scsi_free:bad offset") occurs in the scsi_free() function in drivers/scsi/scsi_dma.c and not called directly from the qlogic drivers. It _is_ however called from the scsi scatter-gather code in scsi_lib.c in quite a few places that could be interesting. Unfortunately since the error freezes the box without any sort of dump info there's no way (at the moment) to get the call backtrace at the time of the error. Since we're dying every two days I'll happily test any kernels you want, either newer ones or something with more debugging turned on.
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/