Bug 140123

Summary: CPU scheduling issue
Product: Red Hat Enterprise Linux 3 Reporter: Wendy Cheng <nobody+wcheng>
Component: kernelAssignee: Ingo Molnar <mingo>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: dowdle, k.georgiou, peterm, petrides, riel, sct, tao, tburke
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-15 20:41:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test program -1
none
scheduler patch none

Description Wendy Cheng 2004-11-19 21:18:24 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1)
Gecko/20020830

Description of problem:
Troublesome CPU scheduling issues are reported by various customers.
Our test runs show that the RHEL 3 scheduler allows CPU hog to get the
shares more than it should. It results extremely poor performance
relative to other kernels such as AS2.1 or Fedora in some customer
production environment. 

The problem can be best demonstrated by a test program that forks a
child (cpu hogs) who continues sending signals to the parent. The
parent is in a sleep(1) loop but its signal handler counts
the number of times it gets to run. After the count reach 10,000,000,
the parent notifies the child (kill-SIGTERM) and exits the test run.

With an upstream scheduler test patch that I made during U2-U3 IO
performance troubleshooting, astonishing differences with/out the
patch are observed:

1. The test program completes in 1m43.448s (with the patch) vs. over
5m44.032s (I <ctrl><c> it) in our test environment.
2. Without the patch, the child hogs more than 90% of cpu shares
(viewed by top command). With the patch, the child/parent share a
reasonable 40%/60% split of CPU shares.

The very same test kernel is subsequently sent to other customers.
Positive results are also received. Will clean up the patch a little
bit over the weekend and attach it in this bugzilla.


Version-Release number of selected component (if applicable):
2.4.21-20.ELsmp

How reproducible:
Always

Steps to Reproduce:
1.Untar the attached test program and do a "make".
2.Run "time ./hang"
3.Watch the program completion time and monitor the status using "top"
command. 
 
    

Additional info:

Comment 1 Wendy Cheng 2004-11-19 21:57:39 UTC
Created attachment 107097 [details]
test program -1 

The executable location (currently set to /usr/src/wendy/) needs to get
modified accordingly (see hang.c for details).

Comment 2 Wendy Cheng 2004-11-19 22:06:21 UTC
The test program (comment #1) is based on: 

 http://www.hpl.hp.com/research/linux/kernel/o1-starve.php

by David Mosberger from HP Palo Alto.


Comment 8 Scott Dowdle 2004-12-13 22:35:54 UTC
This is the second place I've tacked on my problem asking if it
appears to be related to the bug described.  Please advise.  speedycgi
is a parent process that talks to children over a long period of
time... so does it fit this?

- - - - -

I've been having a big performance problem of RHEL AS 3 with the
2.4.21-20 kernel.  The situtation is frustrating.  I'm running
OpenWebMail using SpeedyCGI.  SpeedyCGI speeds things up and is
supposed to be a good thing.  After the kernel upgrade, I see the
following in /var/log/messages

Dec 12 04:08:38 mail kernel: application bug: speedy_backend(20437)
has SIGCHLD set to SIG_IGN but calls wait().
Dec 12 04:08:38 mail kernel: (see the NOTES section of 'man 2 wait').
Workaround activated.

The iowait from top is always very high.  The machine gets bogged down
and has to be restarted every few days when it gets into situations
where the load is in the 20s or higher and just won't come down.  Even
when it's almost doing nothing it has a load average 0.76.  I am NOT
running on underpowered hardware.

I've been told that I need to do some vm tweaking but all of my
attempts have helped a bit here and there but have not solved the problem.

Are my performance issues related to use of perl / openwebmail /
speedycgi and this issue interacting?

Please help.


Comment 9 Wendy Cheng 2004-12-13 22:45:44 UTC
For comment #8, the warning messages you had in /var/log/messages file
is *not* related to this issue. 

What did the "kernel upgrade" mean ? From AS2.1 to RHEL 3 ? 

RHEL 3 does have some vm issues that would cause similar symptoms. I
would suggest you either trying out the newest U4 beta kernel (.27EL
will be out of door very soon) or reporting this issue to Red Hat support.

Comment 10 Scott Dowdle 2004-12-13 23:11:18 UTC
The machine in question is running RHEL AS 3 U4.  It was installed as
RHEL AS 3 U3 if I remember correctly.  By kernel upgrade, I ment from
whatever kernel came before.  Perhaps the problem has existed for some
time.  I only noticed it sometime after using the latest kernel.

Regarding RH Support.  I'm at a college.  I bought 3 copies of RHEL ES
2.1 and before getting them installed... (about a month later), Red
Hat came out with academic pricing.  As I understood it at the time
there wasn't an upgrade path from ES to AS so when it came time to
install, I just bought AS 3 at academic pricing and let the RHEL ES
rot... although they did get registered but not really used.

Anyway, my point here is that I filed a support request with Red Hat
support but never got a response back.  I'm assuming it was becuase I
reported it against AS Academic Edition which doesn't come with email
/ phone support.  I thought about filing the bug against the unused ES
2.1's but the problem didn't apply to those.

Since I'm not sure where to get beta kernels, I'll wait until the
.27EL is released.

Hunting down performance reports that looked related to mine, I found
them going back to March and April (on Dell's Linux support forums for
example) so I was starting to think that Red Hat was ignorning these
problems or just not being vocal about trouble fixing the issue.  I
certainly appreciate all that Red Hat does for the community with all
of the development.

Thank you for replying to me even if this is in an off topic way. :)

Comment 11 Scott Dowdle 2004-12-14 16:54:46 UTC
Ok, update.  I had a brainfart yesterday with version numbers.  So, I
did a clean install of RHEL AS 3 U2 (Academic Edition) this summer. 
Ran that for a while and didn't notice the load/performance problem. 
It may have existed but not as bad.  Then updated to U3.  Then the
problem got really bad.  Is that vague enough for you? :)

I did pull my head out and went to the RHEL AS 3 beta channel and
download the kernel-smp-2.4.21-27.EL beta kernel.  I've been running
it for about 12 hours now and have not seen much of a change.  The
system seems more responsive at the command prompt (when I'm ssh'ed
in) but the load is still way too high for what the machine is doing.
 The RAM caching and swap seem to have come to reasonable levels. 
Perhaps the load accounting is somewhat off?  Anyway, haven't been
running it long enough to make a definitive call... but thus far, it
does NOT appear that this kernel fixes my problem.

So, I have contacted Red Hat Support but I went through that in a
previous submission.

Doing my best to research the problem on my own I come across
references to people having similiar problems with the RHEL kernel
dating back to March and April of this year.  In most cases it appears
they were told to contact Red Hat Support and tweak the vm system. 
That appears to have worked for many people.  Of course they would
have to do various system dumps and it appears their vm settings were
customized to their load.

So far, I haven't found the right balance.

This leads me to some questions about Red Hat.  I do not mean to
question the integrity of the employees... but my perception from this
 whole issue is that:

1) Perhaps in adding support for enterprise class hardware, the
complexity of making the scheduling system and the vm system work on
all loads has increased to a degree that Red Hat is having a problem
making it work for most people most of the time without tweaking

or

2) Red Hat has, one way or another, shipped a somewhat broken kernel
in an effort to make their support system needed by a significant
number of customers

I don't have enough data to pick from the two and chances are it is
some combination of both.

I will continue to work on this problem until it is resolved but given
the Academic Edition status, I'm not a legitimate support customer and
have to suffer through it.

So, does the new beta kernel fix the problem?  Not for me, or at least
that's how it appears now.  Can Red Hat support help me?  No, I don't
qualify.  Will the next kernel update fix the problem?  Not sure, hope so.

Have I given you enough information about my hardware, software and
load?  No.  But if you ask for specific things, I'll provide them.  I
guess I could see what everyone else was asked and just provide the
information but I'm not sure anyone wants to hear it... because they
have seen this problem over and over already and tuning my specific
system would actually be end user support.

I also run, as mentioned previously, some third-party software
(MailScanner, clamav, OpenWebMail w/speedycgi) and I'm sure those
don't play well in the support mix... but I can tell you there are a
lot of people out there using those because they work well.

Thank you for trying.



Comment 12 Wendy Cheng 2004-12-14 17:40:51 UTC
Let's avoid filling this bugzilla with unrelated issues.

I'll follow-up with above two comments via email.


Comment 14 Wendy Cheng 2004-12-15 15:39:17 UTC
Created attachment 108619 [details]
scheduler patch

This patch was used to build the test kernel run by the above 4 customers.
Included for reference purpose.

Comment 22 Ernie Petrides 2005-09-15 20:41:38 UTC
Closing as recommended by last comment.