From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705) Description of problem: Compaq 8500, 8 P3 CPU, 4GB RAM, QLA2200 FC RH 7.2, all errata, kernel 2.4.18-17.7.x Under heavy load (backups, running real-time monitoring system, and lookupd data) across the fibre-channel card, the system locks up. Upon reboot, /var/log/messages contains 36 lines like: kernel: qlogicfc0 : no handle slots, this should not happen kernel: hostdata->queued is 4d, in_ptr: 38 The '4d' and 'in_ptr' values will vary. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.boot system 2.run under heavy load 3.wait Actual Results: system locks up Expected Results: system does not lock up Additional info: http://ldm.bkbits.net:8080/linux-2.5-cpu/cset@1.621.1.10?nav=index.html% 7CChangeSet@-4w This URL contains information about changes made in 2.5 that supposedly fix this problem. Looks like a change was made in the way drivers need to handle locks (per device vs. global). I consider 2.4.18-17.7.x to be an extremely buggy kernel. This is the third bug related to this kernel I've filed since I upgraded an RH7.2 box to this kernel yesterday. Was any QA done on this kernel at all?
please use the qla2200 driver instead; that one is actually supported
Will do. Under 2.4.9-31, I was using the qla2x00 without incident, but that disappeared in the new release. The qla2200 driver under that kernel never worked for us, so I didn't bother to try it again.
Okay, switching to the supported qla2200 driver appears to fix the problems with the machine lockup (and another bug, 77803, which I have no idea why or how), but the kjournald thread for the ext3 partition that is on the RAID accessed through that card is taking up around 11% of the total CPU on the box, whereas before it took up around 2%. Why the increase? Is that qla2200 driver that poor? Under 2.4.9-31 and the qla2x00 driver, we didn't have that much journal activity, but we were also running under a different VM. Under the qlogicfc0 driver and the new VM, we had basically the same system usage as the qla2x00 driver.
The 77803 bug is likely due to dropped interrupts if a driver change fixes it. As for the kjournald overhead, that could be a number of things, including bounce buffer overhead. We'd need to see a kernel profile to have any hope of diagnosing it. (Boot with the kernel parameters "profile=2"; man readprofile to see how to extract info.)
At the risk of being taken for an idiot, when I enable profiling (with profile=2), no matter what, I always get: # readprofile -m /boot/System.map-2.4.18-18.7.xbigmem 4 _stext 0.0500 4 total 0.0000 No matter what. Do I need to do something else to enable accurate profiling on this machine? The system is under heavy load. The /proc/profile file is constantly being updated (according to its timestamp), but it's always the same size, and it always contains that same data (in -v, everything is set to 0). This is with 2.4.18-18.7.xbigmem.
you need to ALSO specify nmi_watchdog=1 in addition to profile=
Created attachment 86072 [details] output from readprofile -v
Fixed in the 2.4.20-20 erratas ?
Yes.