Bug 175180

Summary: SPECWeb99_SSL on RHEL 4 Performance -30% versus upstream
Product: Red Hat Enterprise Linux 4 Reporter: John Shakshober <dshaks>
Component: kernelAssignee: David Miller <davem>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: high    
Version: 4.0CC: davej, jbaron, kajtzu, mingo
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rhel4.4 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-23 13:45:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Lockmeter Results
none
Oprofile Results none

Description John Shakshober 2005-12-07 13:14:56 UTC
Description of problem:

From AMD Partner as SUT is large and not onsite at Red Hat.

What we have observed is that when running the SPECWeb99_SSL benchmark, that 
the RHEL 2.6.9-22.EL kernel exhibits a dramatic fall off in the number of 
conforming connections that can be supported, starting at about 5860 
connections in our configuration (see below for details on that).   As a 
comparison, the SuSE SLES 9 SP2 x86-64 kernel can support 8835 conforming 
connections with the same hardware configuration.   We investigated this by 
running oprofile and lockmeter against the 2.6.9-22.EL kernel.    Here is what 
we found out:

 
System Description

------------------

 

Here is a description of the test environment:

 

The server is a Sun v40z system (4-socket, dual core Opteron 880 (2.4 GHZ), 32 
GB of 400 MHz DDR memory).   There are 4 dual port Intel GB NICS, each is an 
Intel 82546GB adapter.   The server software is Zeus 4.2r4.   8 Zeus processes 
are running, each is bound (via taskset) to a specific core, and the irq for 
the associated adapter is also bound to that core.   Each Zeus process listens 
on a separate IP address, so all packets and interrupts are handled by the core 
associated with the Zeus process for that IP address.    The client network 
consists of 24 clients, each is a Sun v20z, and there are 8 subnets, so each 
Zeus process is getting its requests from a private subnet.  (This is pretty 
standard stuff for running these kind of submissions.)

 

The file system on the server is ext2, implemented as a software raid on top of 
3 disks.   There is a separate dedicated disk for the log files as well as a 
separate disk for the installed O/S.  The file systems are mounted with the 
noatime and nodirtime options to reduce update overhead.

 

As mentioned above, the benchmark is SpecWeb99_SSL.
Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Setup SPEC99 on large system(s)
2.Run SPECweb99ssl and tune
3.profile and lockmeter
  
Actual results:



Oprofile Results

----------------

 

(When we turned on Oprofile, it introduced a small amount of overhead, so we 
had to recalibrate where the knee occurred.    Instead of happening at 5860, it 
happened in the vicinity of 5760 connections, which is why these names of the 
attached files don't quite match the above number.)

 

If we compare the oprofile results for 5750 connections to that for 5760 
connections (see attachments 5750.report.tmp.recalc and 
5760.report.tmp.recalc), the only thing that appears to be different is that at 
5760 connections the system is now spending 10.7% of its time in 
__write_lock_failed(), whereas for 5750 that number is 0.0033%.   That appears 
significant to us and indicated that there is some read/write spin lock in the 
system that is causing contention.    Given that, we built up a 2.6.9-22.EL 
kernel with the lockmeter patch attached and re-ran the experiment.

 

(The recalc suffix of the file names refers to the fact that when we ran the 
experiments, we didn't have the oprofile options quite right and we got data 
for the kernel and all user processes.   Rather than re-run the experiments, we 
just wrote a little Perl script to filter out the non-kernel observations and 
produce a kernel-only profile).

 

Lockmeter Results

-----------------

 

Running with lockmeter on introduces a fair amount of overhead in the system, 
so we didn't ramp up the experiment with lockmeter running.   Instead, we 
booted the lockmeter enabled kernel, got the system up with the correct number 
of connections, and then turned the lockmeter statistics on for 300 seconds.    
In this case we compare outputs for 5800 and 5860 connections (see: 
lockstat.out.5800 and lockstat.out.5860) -- ignore the spin lock statistics at 
the top and look at the RW lock READ and WRITE reports -- what we see is that 
there appears to be some contention for unix_table_lock in the report for 5860, 
but no contention for that lock in the report for 5800 connections.   Writers 
are seeing contention for this lock something like 30% of the time writers are 
trying to acquire the lock, and the lock is busy in read mode nearly 20% of the 
time.

 

The unix_table_lock appears to protect the hash table of sockets that are used 
for AF_UNIX.    This is surprising, since AF_UNIX is only used for 
communication on the local machine.   I would have expected to see contention 
in the AF_INET family instead.    However, I'm not really a communications 
expert in Linux, so I may be reading this incorrectly.    Nonetheless, this 
appears to be significant and likely explains the time that we saw spent in the 
__write_lock_failed() routine above, since that is where the process goes when 
it tries to get a rwlock in write mode and can't because of a reader (or 
another writer) holding the lock.

 

Expected results:

Should have been able to saturate cpus and scale at least to 8800 connections

Additional info:

this appears to be an AF_UNIX bottleneck as opposed to a AF_INET bottleneck.   
5600.report.tmp.recalc:      18   583438     0.0031    99.6723   
__write_lock_failed
5750.report.tmp.recalc:      38  1157755     0.0033    99.6552   
__write_lock_failed
5760.report.tmp.recalc:  241230  2257574    10.6854    28.7934   
__write_lock_failed
5780.report.tmp.recalc:  195230  2018453     9.6723    29.1407   
__write_lock_failed

Comment 1 John Shakshober 2005-12-07 13:18:01 UTC
Created attachment 121968 [details]
Lockmeter Results

Comment 2 John Shakshober 2005-12-07 13:19:42 UTC
Created attachment 121969 [details]
Oprofile Results

Comment 3 John Shakshober 2006-01-04 19:52:20 UTC
Also the most recent kernel-perf test results show tcp/ip to be upto 15-23%
higher for 2.6.13 and 15 kernels compared to 2.6.9.   Any idea what the changes
to the upstream kernel are?

http://kernel-perf.sourceforge.net/results.machine_id=1.html

Comment 4 David Miller 2006-01-04 21:25:42 UTC
It's very likely the massive TSO TCP rewrite that was done upstream.
We can never merge that in, it's way too invasive.


Comment 5 John Shakshober 2006-01-04 21:52:02 UTC
We are already using the tso=off using ethtool for the benchmark.
So perhaps this does NOT correlate with what we are seeing in SPECwebSSL.

I will recheck with AMD (until we get our setup running by the end of Jan 06).


Comment 6 David Miller 2006-01-04 22:06:32 UTC
Yes, please recheck that just to make sure.

It could be something as simple as a driver update, but asking
"what changed" in networking from 2.6.9 to 2.6.{13,15}... that's
like tens of megabytes of networking changes.

This sounds totally different, this performance report, from the
SSL stuff this particular bug report is about.  I think these
extra data points will end up being distractions from the core
issue, which appears to be simply an issue of SMP lock cache-line
ping-pong in the AF_UNIX code.

I recommended that they put the unix_table_lock in a seperate data
section or similar, then retest.  I never heard back after making
that recommendation.


Comment 7 John Shakshober 2006-01-11 20:17:48 UTC
The rw_lock unix_table_lock has high utilization, hold and wait times for the
write phase.

RWLOCK WRITES     HOLD           WAIT (ALL)           WAIT (WW)wait on
writer 
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )( %CPU)   MEAN(  MAX )     TOTAL
NOWAIT SPIN(  WW )  NAME
 0.01% 32.9%  0.5us(1795us) 2071us(9063us)( 1.5%)  197us(3760us)     59294
67.1% 32.7%(0.20%)  unix_table_lock
 0.00%  4.8%  0.3us( 622us) 1789us(8666us)(0.10%)  1.5us(  46us)     29960
95.2%  4.4%(0.31%)    unix_create1+0x15b
 0.01% 61.7%  0.8us(1795us) 2093us(9063us)( 1.4%)  988us(3760us)     29334
38.3% 61.6%(0.08%)    unix_release_sock+0x2c

As you can see there is ~2ms mean,~9ms max wait times all and ~1ms mean,
~3.7ms max wait times on writers. Even hold max times are hi ~1.7ms... 
And unix_release_sock() just clears out link, should not be that long. 
RW_locks favors reads so writers could get starved! Also the rw_lock does _not_
disable interrupts so that could be causing long hold times and holding off others.

Why there is unix domain activity I not not know. Is there some deamon running
that is causing this activity that Suse does not run? 

I have looked at Suse Sp2 2.6.5-7.191 and I don't see any difference from rhel
4u2 in the Unix domain code, files net/unix/af_unix.c or garbage.c 

In kernel.org as of 15 Dec 05 there is a patch to make the unix_table_lock
as spinlock for performance reasons.

Comment 8 David Miller 2006-01-11 23:44:20 UTC
Yes upstream changes it to a spinlock, but that shouldn't be showing the kind
of performance killing you see here.

Any number of daemons on the machine could be using AF_UNIX, why not use tools
such as "netstat" and "lsof" to find out who has them open? :-)

I've asked to place the unix_table_lock in a seperate cache line, and test what
that does to your results.  Has that been done yet?  I think I recommended that
a few weeks ago, so you should have been able to get a chance to perform that
quick check for us by now, it's a one-line change.


Comment 9 John Shakshober 2006-01-12 02:17:22 UTC
Yup - I still do not have a system to reproduce and why am pouring through the 
data again.  This load requires 16 clients to drive a single server.  Will try 
to pad the lock out if the compiler is doing the wrong thing.  Do you have the 
oner liner you want to try?

I understand the ton of changes in the upstream... thanks for comments.


Comment 10 John Shakshober 2006-02-23 14:46:23 UTC
I still do not have a setup at Red Hat to attempt to reproduce.

The upstream kernels contain significant Network performance optimizations.
We recommend using FC5 2.6.15 kernels for future performance benchmarking.


Comment 11 Daniel Riek 2006-08-14 15:22:08 UTC
Found this in a cleanup.

Not sure if this is a feature or a bug. The SuSE kernel that has been compared
is not newer than our kernel, is it?

Putting on 4.6 for a review.


Comment 13 Tim Burke 2007-05-23 13:45:10 UTC
Shak notes the following, based on this closing as currentrelease.

This bottleneck was not observed in SPECweb2005 for either 4.4 and 4.5 kernels.
  We have since published 7 of the top ten results for SPECweb2005 with various
partners.  This ticket was for AMD 64 which with HP owns the #1, 9 and 10th spot
on the list below.
Want me to enter into the BZ?  Dave perhaps can comment on why this went away
with SPECweb2005?

Shak