Bug 112129

Summary:	system is very sluggish when nice 0 tasks are running (scheduler problem)
Product:	[Fedora] Fedora	Reporter:	Jeremy Sanders <jss>
Component:	kernel	Assignee:	Ingo Molnar <mingo>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	1
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-29 19:51:22 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeremy Sanders 2003-12-15 10:26:31 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5)
Gecko/20031007 Firebird/0.7

Description of problem:
When a CPU intensive process is running at nice 0, then other
processes run very slowly. For instance logging-in takes about 10
seconds, when it normally completes in 1 second. Also the nfsd
processes on the machine are very sluggish (they are running at nice 0
too). Accessing the nfs share from another machine often gives

nfs: server xpc5.ast.cam.ac.uk not responding, still trying
nfs: server xpc5.ast.cam.ac.uk OK

warnings in the system logs.


Version-Release number of selected component (if applicable):
kernel-2.4.22-1.2129.nptl

How reproducible:
Always

Steps to Reproduce:
1. Start a cpu-bound nice 0 process
2. Log in
3.
    

Actual Results:  Takes 10s of seconds

Expected Results:  Take a couple of seconds

Additional info:

This is a Intel P4 2.8GHz machine

Comment 1 Robert Haas 2004-03-06 02:18:42 UTC

I have a related problem (or possibly the same problem).

I have two FC1 boxes both of which appear to exhibit extreme
sluggishness during periods of high disk utilization.  I started the
following loop on one machine (not running FC1):

while true; do time ssh root@prism date; sleep 3; done

Each ssh took around 0.4 seconds to complete.  Then I ran this command
on prism, an FC1 box running kernel 2.4.22-1.2115.nptl:

dd if=/dev/zero of=foo bs=1k count=51242880

After the dd had been running for 10-15 seconds, the ssh got veeeery
slow...  the times for the next few were: 114 s, 81 s, 21 s, 24 s, 13
s, 38 s, 17 s, 34 s, 24 s, 40 s, 205 s.  `vmstat 2' output from the
machine looked like this:

 1  5    256  10492  27264 938920    0    0     0 15586  134    24  0
 1  0 99
 1  3    256   9792  27292 939592    0    0     0 15840  135   339  3
35  0 62
 0  4    256   5400  27380 922512    0    0     2 15072  134   824  4
84  0 12
 0  4    256   3448  27376 893588    0    0     0 16032  136   116  0
12  0 88
 0  5    256   6296  27376 870804    0    0     0 15392  135   151  0
16  0 84
 0  4    256  30796  27376 870808    0    0     2 14624  137    27  0
 7  0 93
 0  4    256  60908  27376 870808    0    0     0 15906  136    24  0
 2  0 98
 1  3    256  43708  27408 902264    0    0     0 15230  137    26  1
27  0 72
 0  4    256   9244  27500 936220    0    0     2 15906  136   526  3
85  0 11
 0  4    256   9068  27500 936220    0    0     0 15104  133    23  0
 1  0 99

I've also had this problem when downloading large files via `curl'
over a 100 MBps connections.  If I start up `curl' to download an ISO
image (i.e. big file) from another machine on my local network and
then try to open a new terminal window, I rarely get a prompt before
the `curl' finishes.

CPU load doesn't appear to be the issue, as the `vmstat' output above
shows pretty clearly.  In fact, I tried to replicate the problem the
initial opener of this bugzilla had by running:

perl -e 'while (1) {}'

That increased the time for the ssh command to complete somewhat but
it was still under a second - nothing like the huge delays brought on
by writing to disk.

This is a 700 MHz Pentium III machine.  I will try this again under
the kernel described above and verify that the behavior is the same -
I'm pretty sure that it will be, because I just applied all the latest
patches to another FC1 box and it still shows the same sluggishness -
very noticable during interactive use if you do anything that makes
heavy use of the disk.

Incidentally, even though the following should produce a lot of disk
activity, at least the first time, it doesn't replicate the problem.  

find / -exec cat {} \; > /dev/null

Either this doesn't produce sufficiently sustained disk I/O, or the
problem involves writing rather than reading, or... something.

Comment 2 Ingo Molnar 2004-08-18 11:46:52 UTC

(oops, orphaned bug.)

Does this occur with current kernels too (and in particular with FC2)?

Comment 3 Jeremy Sanders 2004-08-18 12:24:09 UTC

The situation appears a lot better for me. Using
kernel-2.6.7-1.494.2.2 in FC2, logging in takes around 3 seconds
without anything running, and around 4 seconds with a nice 0 process
running at the same time. These times both seem longer than the
unloaded time I reported above, but the configuration could easily be
different.

I haven't seen any of the "NFS server not responding" messages since
using FC2.

Comment 4 David Lawrence 2004-09-29 19:51:22 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/