Description of Problem: We have three identical machines (each is a dual CPU PIII, with 4 SCSI disks configured as Software RAID1 pairs using ext3) running RedHat 7.3. and kernel-smp-2.4.18-4. These machines are mail hubs for our University. Actual load on these systems isn't very high and the load average figures reported by e.g: uptime normally reflect this. However, at apparently random points in the day the load average on the system peaks for a number of minutes despite the fact that vmstat and iostat continue to report no obvious load on the system. Example: Right now:: 4:01pm up 6 days, 22:49, 1 user, load average: 0.17, 0.24, 0.23 Same time yesterday afternoon when the effect was observed:: 4:05pm up 5 days, 22:54, 3 users, load average: 2.49, 2.64, 2.23 Output of "mpstat 10" on system reporting high load yesterday afternoon:: Linux 2.4.18-4smp (purple.csi.cam.ac.uk) 06/17/02 16:06:29 CPU %user %nice %system %idle intr/s 16:06:39 all 0.15 0.00 0.25 99.60 160.10 16:06:49 all 1.00 0.00 0.95 98.05 284.80 16:06:59 all 0.45 0.00 1.40 98.15 273.60 16:07:09 all 0.40 0.00 0.55 99.05 233.50 16:07:19 all 0.10 0.00 0.05 99.85 124.20 Output of "iostat 10" on same system reporting high load yesterday afternoon:: Linux 2.4.18-4smp (purple.csi.cam.ac.uk) 06/17/02 . . . Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dev8-0 8.10 0.00 160.80 0 1608 dev8-1 8.20 0.00 160.80 0 1608 dev8-2 4.50 0.00 80.80 0 808 dev8-3 4.50 0.00 80.80 0 808 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dev8-0 23.10 0.00 394.40 0 3944 dev8-1 23.40 0.00 394.40 0 3944 dev8-2 13.80 0.00 280.80 0 2808 dev8-3 13.80 0.00 280.80 0 2808 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dev8-0 16.30 0.00 304.00 0 3040 dev8-1 16.50 0.00 304.00 0 3040 dev8-2 19.30 0.00 396.80 0 3968 dev8-3 19.40 0.00 396.80 0 3968 I realise that load average is a fairly meaningless statistic. Its only important at all to us because mail transports (we use Exim) include load average cutouts and the (apparently bogus) high numbers are preventing us from using these properly at the moment. Sorry about the rather vague problem report. If you can suggest some sensible diagnostic tools than I would be happy to use them. I have run various tools to check that someone hasn't hacked in to run some kind of rootkit quietly behind our backs.
The most useful piece of information would be which processes are in "D" state while the load overage is spiking. (load average = processes running + processes in "D" state) if possible, a sysreq-t dump during such a spike could also help in diagnosis
> The most useful piece of information would be which processes are in "D" > state while the load overage is spiking. (load average = processes > running + processes Nothing as far as we can see (this was the first thing that we looked for). > if possible, a sysreq-t dump during such a spike could also help in diagnosis How do I do this? Thanks for the amazing fast response here. I'm going to get us signed up for a real RedHat support contract: we pay Sun large amounts of money each year for telephone/email support, but the responses that I have had from Redhat are consistently faster and better than anything we get from Sun, without any support contract at all. It seems only fair that we try to reward RedHat for the excellent service that you provide.
> How do I do this? 1) echo 1 > /proc/sys/kernel/sysrq this enables the "magic sysreq key" 2) hit the following three keys at the same time alt - "sysrq" (eg printscreen) and "t" (if you use the spacebar instead of "t" you get a brief menu of possible options) the kernel will then dump the "threadinfo" information to /var/log/messages; basically this is the state of all processes and where they are in the kernel
> 2) hit the following three keys at the same time > alt - "sysrq" (eg printscreen) and "t" Is this possible on a headless system which is using a serial console? I suspect that the answer is no!
Actually the answer is yes... instead of alt-sysreq you can send a "break" and then the "t" key (sending a break is ctrl-A F in minicom) Note: I haven't tried this recently but it's supposed to work
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/