Bug 66965 - Strange Load Average Artifacts
Summary: Strange Load Average Artifacts
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.3
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-06-19 08:50 UTC by David Carter
Modified: 2008-08-01 16:22 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-09-30 15:39:41 UTC
Embargoed:


Attachments (Terms of Use)

Description David Carter 2002-06-19 08:50:57 UTC
Description of Problem:

We have three identical machines (each is a dual CPU PIII, with 4 SCSI
disks configured as Software RAID1 pairs using ext3) running RedHat 7.3.
and kernel-smp-2.4.18-4. These machines are mail hubs for our University.

Actual load on these systems isn't very high and the load average figures
reported by e.g: uptime normally reflect this. However, at apparently random
points in the day the load average on the system peaks for a number of
minutes despite the fact that vmstat and iostat continue to report no
obvious load on the system. Example:

Right now::
  4:01pm  up 6 days, 22:49,  1 user,  load average: 0.17, 0.24, 0.23

Same time yesterday afternoon when the effect was observed::
  4:05pm  up 5 days, 22:54,  3 users,  load average: 2.49, 2.64, 2.23

Output of "mpstat 10" on system reporting high load yesterday afternoon::

Linux 2.4.18-4smp (purple.csi.cam.ac.uk)        06/17/02

16:06:29     CPU   %user   %nice %system   %idle    intr/s
16:06:39     all    0.15    0.00    0.25   99.60    160.10
16:06:49     all    1.00    0.00    0.95   98.05    284.80
16:06:59     all    0.45    0.00    1.40   98.15    273.60
16:07:09     all    0.40    0.00    0.55   99.05    233.50
16:07:19     all    0.10    0.00    0.05   99.85    124.20

Output of "iostat 10" on same system reporting high load yesterday afternoon::

Linux 2.4.18-4smp (purple.csi.cam.ac.uk)        06/17/02
 . . .
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev8-0            8.10         0.00       160.80          0       1608
dev8-1            8.20         0.00       160.80          0       1608
dev8-2            4.50         0.00        80.80          0        808
dev8-3            4.50         0.00        80.80          0        808

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev8-0           23.10         0.00       394.40          0       3944
dev8-1           23.40         0.00       394.40          0       3944
dev8-2           13.80         0.00       280.80          0       2808
dev8-3           13.80         0.00       280.80          0       2808

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev8-0           16.30         0.00       304.00          0       3040
dev8-1           16.50         0.00       304.00          0       3040
dev8-2           19.30         0.00       396.80          0       3968
dev8-3           19.40         0.00       396.80          0       3968

I realise that load average is a fairly meaningless statistic. Its only
important at all to us because mail transports (we use Exim) include load
average cutouts and the (apparently bogus) high numbers are preventing us
from using these properly at the moment.

Sorry about the rather vague problem report. If you can suggest some sensible
diagnostic tools than I would be happy to use them. I have run various tools
to check that someone hasn't hacked in to run some kind of rootkit quietly
behind our backs.

Comment 1 Arjan van de Ven 2002-06-19 08:54:51 UTC
The most useful piece of information would be which processes are in "D" state
while the load overage is spiking. (load average = processes running + processes
in "D" state)
if possible, a sysreq-t dump during such a spike could also help in diagnosis


Comment 2 David Carter 2002-06-19 09:08:33 UTC
> The most useful piece of information would be which processes are in "D"
> state while the load overage is spiking. (load average = processes
> running + processes

Nothing as far as we can see (this was the first thing that we looked for).

> if possible, a sysreq-t dump during such a spike could also help in diagnosis

How do I do this?

Thanks for the amazing fast response here. I'm going to get us signed up
for a real RedHat support contract: we pay Sun large amounts of money each
year for telephone/email support, but the responses that I have had from
Redhat are consistently faster and better than anything we get from Sun,
without any support contract at all. It seems only fair that we try to
reward RedHat for the excellent service that you provide.


Comment 3 Arjan van de Ven 2002-06-19 09:26:49 UTC
> How do I do this?

1)  echo 1 > /proc/sys/kernel/sysrq

this enables the "magic sysreq key"

2) hit the following three keys at the same time
   alt - "sysrq" (eg printscreen) and "t"
(if you use the spacebar instead of "t" you get a brief menu of possible options)

the kernel will then dump the "threadinfo" information to /var/log/messages;
basically this is the state of all processes and where they are in the kernel

Comment 4 David Carter 2002-06-19 09:59:08 UTC
> 2) hit the following three keys at the same time
>   alt - "sysrq" (eg printscreen) and "t"

Is this possible on a headless system which is using a serial console?
I suspect that the answer is no!


Comment 5 Arjan van de Ven 2002-06-19 10:11:08 UTC
Actually the answer is yes...
instead of alt-sysreq you can send a "break" and then the "t" key
(sending a break is ctrl-A F in minicom)

Note: I haven't tried this recently but it's supposed to work

Comment 6 Bugzilla owner 2004-09-30 15:39:41 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/



Note You need to log in before you can comment on or make changes to this bug.