Bug 103121 - LTC4108-[perf][SPECweb99] RHEL 3 Beta 1 4-way performs better than 8-way
LTC4108-[perf][SPECweb99] RHEL 3 Beta 1 4-way performs better than 8-way
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Ingo Molnar
:
Depends On:
Blocks: 101028 103278
  Show dependency treegraph
 
Reported: 2003-08-26 15:34 EDT by IBM Bug Proxy
Modified: 2007-11-30 17:06 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-08-18 08:00:55 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
rhel.errors (3.56 KB, text/plain)
2003-09-11 00:03 EDT, IBM Bug Proxy
no flags Details
getconfig script (2.67 KB, text/plain)
2003-09-15 14:58 EDT, Ingo Molnar
no flags Details

  None (edit)
Description IBM Bug Proxy 2003-08-26 15:34:18 EDT
The following has be reported by IBM LTC:  
[perf][SPECweb99] RHEL 3 Beta 1  4-way performs better than 8-way
Hardware Environment:
  Server:  Netfinity 8500r  (8 x 900 MHz)
           28 GB RAM
           4 Intel e1000 dual-port gigabit adapters

 Clients: 8 xSeries x330  (2 x 1GHz)
          1.5 GB RAM
          1 Intel e1000 gigabit adapter
            

Software Environment:
RHEL 3 Beta 1  plus all errata as of Aug 11  (via up2date)
SPECweb99 v1.02
Apache 2.0.47 w/mod_specweb


Steps to Reproduce:
1. Run SPECweb99 benchmark
2.
3.

Actual Results:
8-way performs poorly, 4-way performs better than 8-way

Expected Results:
Better 8-way performance

Additional Information:
Metric is "Simultaneous Connections" in SPECweb99

RHEL 3 Beta 1
8-way = 1500  us 16  sy 84  id 0
4-way = 1600  us 29  sy 71  id 0

A 2.5.73 kernel.org kernel built on the same RHEL 3 Beta 1 install sees 3600
8-way, and 2400 4-way.

I suspect TCP locking problems as seen in pre-Beta 1 release of RHEL 3:

8-way @ 1500 conns
------------------
  0.000%        1 .text.lock.tcp_input
  0.003%        5 .text.lock.tcp_minisocks
  0.221%      339 .text.lock.tcp_timer
  0.793%     1213 .text.lock.tcp_ipv4
  2.027%     3098 .text.lock.tcp

4-way @ 1600 conns
------------------
  0.001%        1 .text.lock.tcp_input
  0.003%        2 .text.lock.tcp_minisocks
  0.004%        3 .text.lock.tcp_timer
  0.188%      117 .text.lock.tcp_ipv4
  0.300%      186 .text.lock.tcp



What additional information would be helpful in debugging this issue?


*  SPEC(tm) and the benchmark name SPECweb(tm) are registered
trademarks of the Standard Performance Evaluation Corporation.
This benchmarking was performed for research purposes only,
and is non-compliant, with the following deviations from the
rules -

  1 - Runs were shorter than 1200 seconds.

  2 - Access_log wasn't kept for full accounting.  It was
      written, but deleted every 200 seconds.
Comment 1 David Miller 2003-08-27 02:10:08 EDT
Not enough information to analyze this.  Please obtain more detailed
locking profiles, or standard kernel profiles during an 8-way run.
Comment 2 IBM Bug Proxy 2003-08-27 12:04:22 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-27-08 11:37 -------
/usr/sbin/readprofile does not work for me on Beta 1. (all totals 0)

1- I can send raw captures of /proc/profile if that is helpful.

2- Can RedHat supply a working readprofile?

Thanks. 
Comment 3 Arjan van de Ven 2003-08-27 12:13:57 EDT
you need to enable the nmi watchdog as well:
nmi_watchdog=1 profile=1

Comment 4 IBM Bug Proxy 2003-08-28 19:14:03 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-28-08 18:36 -------
Can't complete a run with those kernel args-- double fault.

The box is then locked.  How do I get at the stack values for debugging?

NPTL-related?  assume_kernel? 
Comment 5 Arjan van de Ven 2003-08-29 04:28:32 EDT
If this is one of those IBM machines where the BIOS corrupts registers if you
use the NMI, then I think we're out of luck on this one....
Comment 6 IBM Bug Proxy 2003-09-02 12:49:01 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-02-09 12:08 -------
Would captures of /proc/profile be useful? 
Comment 7 Arjan van de Ven 2003-09-02 12:52:43 EDT
only when decoded by readprofile; the addresses at which modules get loaded vary
per machine and even between reboots.....
Comment 8 IBM Bug Proxy 2003-09-02 20:52:20 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-02-09 19:44 -------
I've removed the system management card from the SPECweb system (Netfinity
8500r).  In theory, that'll make the race condition under which the register
corruption happens much less likely to occur.  I'm producing profiles now.  With
luck, I'll have data for you tomorrow AM. 
Comment 9 IBM Bug Proxy 2003-09-04 20:57:14 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-04-09 18:31 -------
Even with the system management card removed, I am unable to get far enough into
a run to capture profiles.

In apic.c, can I just add a call to x86_do_profile under the CONFIG_SMP case in
smp_local_timer_interrupt?

It would seem that then I'd be able to profile without having to enable NMIs and
do it through nmi_watchdog_tick.

Seem sane? 
Comment 10 Ingo Molnar 2003-09-08 17:54:27 EDT
yes, that change should work. The resulting profiling info should be taken with
a grain of salt - irq-disabled overhead (irq handler overhead, etc.) wont show
up, and the overhead might be added to some unrelated code. But it's better than
nothing.

also, please try a newer kernel.

on a related note, why cannot the BIOS do a proper return from the SMI handler
if it interrupts an NMI handler? How does the BIOS solve the case when the BIOS
itself handles an NMI and is interrupted by an SMI?
Comment 11 IBM Bug Proxy 2003-09-10 19:32:20 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-10-09 19:27 -------
The profiling worked, but the profiles look strange.

When trying a new kernel, up2date fails on mkinitrd when updating
linux-2.4.21-1.1931.2.423.ent.

The source installed OK but when I try and build, it errors out in the e1000
drivers-- which I need.  I may be able to use other e1000 drivers, but that
would negate the point of trying to demonstrate a networking problem.

I did an rpm -e of the kernel stuff, and then an up2date -p to re-sync with RHN,
but up2date still refuses to try the update again.

I'll attach the errors from up2date and the kernel build. 
Comment 12 IBM Bug Proxy 2003-09-11 00:03:26 EDT
Created attachment 94406 [details]
rhel.errors
Comment 13 Arjan van de Ven 2003-09-11 04:00:19 EDT
All of your loopback devices are in use.

means that most likely you were running a kernel where you didn't compile in loop.
that's mantadory for being able to install kernels
Comment 14 IBM Bug Proxy 2003-09-15 12:36:03 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-15-09 11:56 -------
Very slight improvement with 2.4.21-1.1931.2.399.entsmp.  Picked up ~200
additional SPECweb simultaneous connections, but still very idle-- nearly 50%. 
 Other 2.4.21 will go 0% idle and ~2700 simultaneous connections on this hardware.

I'll up2date all errata and re-test, then modify again for profiling on local
timer interrupts. 
Comment 15 IBM Bug Proxy 2003-09-15 14:42:43 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-15-09 14:38 -------
up2date fails when run on stock beta2 as installed from ISO CDs.

...
Installing...
...
 149:wvdial                 ########################################### [100%]
New Up2date available
Traceback (most recent call last):
  File "/usr/sbin/up2date", line 1148, in ?
    #    try:
  File "/usr/sbin/up2date", line 747, in main
    sys.exit(batchRun(options.list, pkgNames,
  File "/usr/sbin/up2date", line 1014, in batchRun
    # quiet mode for rhn_check
  File "/usr/share/rhn/up2date_client/up2dateBatch.py", line 76, in run
    self.__installPackages()
  File "/usr/share/rhn/up2date_client/up2dateBatch.py", line 145, in
__installPackages
    self.kernelsToInstall = up2date.installPackages(self.packagesToInstall,
self.rpmCallback)
  File "/usr/share/rhn/up2date_client/up2date.py", line 719, in installPackages
    if "kernel" in hdr['Providename']:
  File "/usr/share/rhn/up2date_client/up2date.py", line 769, in runPkgSpecialCases
    return kernels
TypeError: installBootLoader() takes exactly 1 argument (3 given)
[root@x4408way1 root]#


re-running up2date produces:

[root@x4408way1 root]# up2date -u

Fetching package list for channel: rhel3-beta1-as-i386...
########################################

Fetching Obsoletes list for channel: rhel3-beta1-as-i386...

Fetching rpm headers...
########################################
The following Packages were marked to be skipped by your configuration:

Name                                    Version        Rel  Reason
-------------------------------------------------------------------------------
initscripts                             7.31.1.EL      1    Config modified

All packages are currently up to date
[root@x4408way1 root]#

Is my system OK to re-boot, given the installBootLoader() error, and that my
initscripts may be out of sync?

Are things in an OK state to proceed with testing? 
Comment 16 Ingo Molnar 2003-09-15 14:58:57 EDT
Created attachment 94504 [details]
getconfig script

could you please run the attached script and attach the result? I suspect it's
some of the TCP settings that is causing problems, but i'm not sure.

also, could you run 'top -b d 10 > top.log' during the test and attach top.log?
Similarly, please run 'vmstat 10 > vmstat.log' too during the test and attach
the resulting vmstat.log.
Comment 17 IBM Bug Proxy 2003-09-26 11:26:34 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-26-09 11:24 -------
Why did it take until the 25th for Ingo's reply to show up in this defect? 
Comment 18 Ingo Molnar 2003-09-26 13:01:55 EDT
I made the comment on the 15th, and got the email acknowledgement from bugzilla
a couple of minutes later, so the bugzilla side seems to be OK. Are you at IBM
running some other bug tracking system that feeds into bugzilla?
Comment 19 IBM Bug Proxy 2003-09-26 14:31:32 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-26-09 14:25 -------
> Are you at IBM running some other bug tracking system that feeds into bugzilla?

  Yes, that's it exactly.  And, it appears the "Internal only" flag fails to
work as well.  :-)

  I re-ran up2date this morning, and have the benchmark running now.  I'll run
your data-collection script once it reaches maximum load.  Thanks.  I aplogize
for the delay. 
Comment 20 IBM Bug Proxy 2003-09-27 09:17:15 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-26-09 19:13 -------
A change after the beta 2 ISOs has greatly helped networking.  SPECweb is
currently still running from this morning and is well beyond the point at which
I expected the benchmark conformance to drop off.

I'll send benchmark results and the output of your 'getconfig' script once
SPECweb reaches its maximum conformance point. 
Comment 21 IBM Bug Proxy 2003-09-29 17:46:13 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-29-09 17:04 -------
It still falls off a cliff-- only later now.  There is an associated huge drop
in interrupt rate when this happens... probably NAPI kicking in (since it is
enabled by default in your kernel).  e1000 NAPI has never worked for me.  I'm
re-building to try running without NAPI (unless there is a module option to turn
it off?  NAPI_HOWTO.txt is not helpful here.). 
Comment 22 IBM Bug Proxy 2003-09-29 18:11:27 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-29-09 18:06 -------
Using the RedHat-supplied /boot/config-2.4.21-3.ELsmp with the only change being
'CONFIG_E1000_NAPI=y' to '# CONFIG_E1000_NAPI is not set', the kernel will
compile fine, but modules will not.  Is there another way to turn off NAPI
without my having to fight this build process again?  Every time I go through
this I have to make so many chnages in order to make things compile that my
resulting kernel is in no way similar to your released kernel-- which is what
I'm trying to help you test. 
Comment 23 Arjan van de Ven 2003-09-30 02:50:21 EDT
Our source and configs are identical to what we ship.
HOWEVER
You have to issue a make mrproper in the /usr/src/linux-2.4 directory before
doing ANYTHING because that directory comes preconfigured (to allow external
modules to build) and the 2.4 kernel makefiles don't have complete dependencies
to wipe these  :(
Comment 24 IBM Bug Proxy 2003-10-01 17:48:19 EDT
------ Additional Comments From wilsont@us.ibm.com  2003-01-10 11:13 -------
OK, turning off NAPI was the answer.  The system now loads up and runs as it
should.  Thanks! 
Comment 27 IBM Bug Proxy 2005-03-27 12:52:11 EST
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ACCEPTED                    |CLOSED
             Impact|------                      |Performance




------- Additional Comments From khoa@us.ibm.com  2005-03-27 12:50 EST -------
Bug clean-up time.  I'd like to close this bug report based on Comment #28
above.  Thanks. 

Note You need to log in before you can comment on or make changes to this bug.