Bug 49805 - Load goes to 1.0 and stays there when real load is 0.0x
Summary: Load goes to 1.0 and stays there when real load is 0.0x
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel   
(Show other bugs)
Version: 7.0
Hardware: i386
OS: Linux
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brock Organ
Depends On:
TreeView+ depends on / blocked
Reported: 2001-07-24 04:01 UTC by Scott Dowdle
Modified: 2007-04-18 16:35 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2001-08-03 15:32:32 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
My boot message as given by the dmesg command... reporting complete list of hardware on problem system. (3.54 KB, text/plain)
2001-07-24 04:03 UTC, Scott Dowdle
no flags Details

Description Scott Dowdle 2001-07-24 04:01:07 UTC
Description of problem:
After updating to the 2.2.19-7.0.8 update, after running a while, my load (as reported by w) goes to 1.0 and stays there even when there isn't 
anything going on.  The average load of the system is usually 0.0x.  I noted behavior like this with the original kernel package that came with 
RHL 7.0 when mounting .iso files with the loopback interface... but that bug was fixed in the previous kernel update package.  I'm not sure if the 
bug in this kernel is related to mounting .iso image files or not.  I don't know if it is related to having a Cyrix MII processor or what.  Attached is 
my dmesg report for a complete list of hardware.
Example: 10:02pm  up 1 day,  2:44,  1 user,  load average: 1.00, 1.00, 1.00

$ free
             total       used       free     shared    buffers     cached
Mem:         95840      89796       6044          0       4940      50328
-/+ buffers/cache:      34528      61312
Swap:       136072          0     136072

top shows 98% CPU free.

How reproducible:

Steps to Reproduce:
1.  Boot system
2.  Operate with light load for a few hours
3. System load reported as 1.0 1.0 1.0 with w.

Actual Results:  Load goes to 1.0 and stays there.  The system seems to perform ok so it appears to be a clerical / cosmetic error.

Expected Results:  Load should report accurately.

Additional info:

Comment 1 Scott Dowdle 2001-07-24 04:03:48 UTC
Created attachment 24667 [details]
My boot message as given by the dmesg command... reporting complete list of hardware on problem system.

Comment 2 Scott Dowdle 2001-07-24 04:19:04 UTC
Oh, and another question... I'm not sure with what kernel it started but... I see it on all 7.0 and 7.1 kernels... Why don't have I have any shared memory?

Kernel 2.2.19-7.0.8 on previously mentioned Cyrix MII system on RHL 7.0.
[dowdle@ns public_html]$ free
                      total         used       free     shared    buffers     cached
Mem:         95840      87948       7892          0       4960      50336

Kernel 2.4.2-2 on Pentium II on RHL 7.1 system
[dowdle@sandstone dowdle]$ free
                   total             used         free     shared    buffers     cached
Mem:        255572     199088      56484          0       7780      99424

Kernel 2.4.7 compiled by myself just yesterday on a AMD Athelon on RHL 7.1 system.
[dowdle@amd dowdle]$ free
                   total            used          free      shared    buffers     cached
Mem:        126668     123736       2932          0      15804      38640

Comment 3 Scott Dowdle 2001-07-27 14:57:25 UTC
Nevermind on the question about shared memory.  I did some reading and found out
that calculating the total shared memory on a system was rather expensive to do
and has been dropped... but the info is still listed (even if it is 0) to
maintain compatibility.

IN OTHER NEWS... my system load has been slowly growing over the past 4 days and
now says:

8:56am  up 4 days, 13:38,  5 users,  load average: 5.02, 5.01, 4.99

Not bad for a system that is totally responsive with plenty of free memory anda
CPU that is 98% idle.  Obviously it is a just a cosmetic miscalculation... BUT
some applications look at the load and alter their behaviour as a result of it. 
By default I think sendmail starts quequeing email at a load of 7 (or whatever
RedHat's default setting for that value is)... so this bug has the potential to
disrupt anything that takes notice of the system load.

Comment 4 Arjan van de Ven 2001-07-30 13:40:11 UTC
Could you run "ps -waux" as root, and see which processes are in "D" state (the
STAT column)?

Comment 5 Scott Dowdle 2001-08-03 14:51:42 UTC
root 23391 0.0 2.6 3900 2512 ? D Aug01 
0:06 /usr/bin/analog /usr/local/analog/www.montanalinux.org-analog.cfg

To answer your question, I have 4 copies of the above listed processe that are 
currently in the listed with a status of "D"... which I learned (from the ps 
man page) means "uninterruptable sleep usually IO".  I have a daily cron job 
that runs a script where analog parses all of my apache virtual host web server 
logs.  Only one seems to be getting stuck.  I'll have to look at it.

So, is my high load my own fault as a result of telling analog to do something 
really stupid, it getting confused and becoming a deadlocked process?  That 
would certainly explain why my load would jump up a notch once a day.

Any way to kill a deadlocked process?  I've tried the traditional things. How 
about rm -rf /proc/<pid>/ :)

Comment 6 Arjan van de Ven 2001-08-03 14:55:48 UTC
Such a deadlock SHOULD not happen. Q: could it be that the analog that gets
stuck coincides with a logrotate or so ?

(Unless it's evil file locking stuff)

Comment 7 Scott Dowdle 2001-08-03 15:32:28 UTC
Well, I'm not sure what is causing the deadlock condition but I've since 
noticed that any process that tries to access that particular file (a web 
server access log) becomes deadlocked (oddly enough, except for apache).  I 
attempted the following: 1) stopped apache, 2) cp suspectfile newname, 3) 
delete suspectfile, 4) mv newname suspectfile.  I got deadlocked on step two.  
I suspect some form of file corruption.  I edited my apache config file and 
told it to use a different file for that virtual host... so the suspect file 
isn't used anymore.  I did a shutdown -r now (I'm at work and the machine is at 
home so I'm glad it did indeed come back up)... and will check the disk when I 
get a chance.  I'm updating the analog script to use the new file as well.

If and when I get the suspect file fixed, I'll just append the new file to it 
and call it good.

So, to summarize, it appears that I'm getting deadlocked processes as a result 
of a corrupted file(system).  I'll wait a day or two to make sure this is the 
case before I close this bug report... if you don't mind.

I greatly appreciate your help!

Note You need to log in before you can comment on or make changes to this bug.