From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR
Description of problem:
Compaq 8500, 8 P3 CPU, 4GB RAM, QLA2200 FC
RH 7.2, all errata, kernel 2.4.18-17.7.x
Under heavy load (backups, running real-time monitoring system, and lookupd
data) across the fibre-channel card, the system locks up. Upon
reboot, /var/log/messages contains 36 lines like:
kernel: qlogicfc0 : no handle slots, this should not happen
kernel: hostdata->queued is 4d, in_ptr: 38
The '4d' and 'in_ptr' values will vary.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
2.run under heavy load
Actual Results: system locks up
Expected Results: system does not lock up
This URL contains information about changes made in 2.5 that supposedly fix
this problem. Looks like a change was made in the way drivers need to handle
locks (per device vs. global).
I consider 2.4.18-17.7.x to be an extremely buggy kernel. This is the third
bug related to this kernel I've filed since I upgraded an RH7.2 box to this
kernel yesterday. Was any QA done on this kernel at all?
please use the qla2200 driver instead; that one is actually supported
Will do. Under 2.4.9-31, I was using the qla2x00 without incident, but that
disappeared in the new release. The qla2200 driver under that kernel never
worked for us, so I didn't bother to try it again.
Okay, switching to the supported qla2200 driver appears to fix the problems
with the machine lockup (and another bug, 77803, which I have no idea why or
how), but the kjournald thread for the ext3 partition that is on the RAID
accessed through that card is taking up around 11% of the total CPU on the box,
whereas before it took up around 2%. Why the increase? Is that qla2200 driver
Under 2.4.9-31 and the qla2x00 driver, we didn't have that much journal
activity, but we were also running under a different VM. Under the qlogicfc0
driver and the new VM, we had basically the same system usage as the qla2x00
The 77803 bug is likely due to dropped interrupts if a driver change fixes it.
As for the kjournald overhead, that could be a number of things, including
bounce buffer overhead. We'd need to see a kernel profile to have any hope of
diagnosing it. (Boot with the kernel parameters "profile=2"; man readprofile to
see how to extract info.)
At the risk of being taken for an idiot, when I enable profiling (with
profile=2), no matter what, I always get:
# readprofile -m /boot/System.map-2.4.18-18.7.xbigmem
4 _stext 0.0500
4 total 0.0000
No matter what. Do I need to do something else to enable accurate profiling on
this machine? The system is under heavy load. The /proc/profile file is
constantly being updated (according to its timestamp), but it's always the same
size, and it always contains that same data (in -v, everything is set to 0).
This is with 2.4.18-18.7.xbigmem.
you need to ALSO specify nmi_watchdog=1 in addition to profile=
Created attachment 86072 [details]
output from readprofile -v
Fixed in the 2.4.20-20 erratas ?