Description of problem: From AMD Partner as SUT is large and not onsite at Red Hat. What we have observed is that when running the SPECWeb99_SSL benchmark, that the RHEL 2.6.9-22.EL kernel exhibits a dramatic fall off in the number of conforming connections that can be supported, starting at about 5860 connections in our configuration (see below for details on that). As a comparison, the SuSE SLES 9 SP2 x86-64 kernel can support 8835 conforming connections with the same hardware configuration. We investigated this by running oprofile and lockmeter against the 2.6.9-22.EL kernel. Here is what we found out: System Description ------------------ Here is a description of the test environment: The server is a Sun v40z system (4-socket, dual core Opteron 880 (2.4 GHZ), 32 GB of 400 MHz DDR memory). There are 4 dual port Intel GB NICS, each is an Intel 82546GB adapter. The server software is Zeus 4.2r4. 8 Zeus processes are running, each is bound (via taskset) to a specific core, and the irq for the associated adapter is also bound to that core. Each Zeus process listens on a separate IP address, so all packets and interrupts are handled by the core associated with the Zeus process for that IP address. The client network consists of 24 clients, each is a Sun v20z, and there are 8 subnets, so each Zeus process is getting its requests from a private subnet. (This is pretty standard stuff for running these kind of submissions.) The file system on the server is ext2, implemented as a software raid on top of 3 disks. There is a separate dedicated disk for the log files as well as a separate disk for the installed O/S. The file systems are mounted with the noatime and nodirtime options to reduce update overhead. As mentioned above, the benchmark is SpecWeb99_SSL. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.Setup SPEC99 on large system(s) 2.Run SPECweb99ssl and tune 3.profile and lockmeter Actual results: Oprofile Results ---------------- (When we turned on Oprofile, it introduced a small amount of overhead, so we had to recalibrate where the knee occurred. Instead of happening at 5860, it happened in the vicinity of 5760 connections, which is why these names of the attached files don't quite match the above number.) If we compare the oprofile results for 5750 connections to that for 5760 connections (see attachments 5750.report.tmp.recalc and 5760.report.tmp.recalc), the only thing that appears to be different is that at 5760 connections the system is now spending 10.7% of its time in __write_lock_failed(), whereas for 5750 that number is 0.0033%. That appears significant to us and indicated that there is some read/write spin lock in the system that is causing contention. Given that, we built up a 2.6.9-22.EL kernel with the lockmeter patch attached and re-ran the experiment. (The recalc suffix of the file names refers to the fact that when we ran the experiments, we didn't have the oprofile options quite right and we got data for the kernel and all user processes. Rather than re-run the experiments, we just wrote a little Perl script to filter out the non-kernel observations and produce a kernel-only profile). Lockmeter Results ----------------- Running with lockmeter on introduces a fair amount of overhead in the system, so we didn't ramp up the experiment with lockmeter running. Instead, we booted the lockmeter enabled kernel, got the system up with the correct number of connections, and then turned the lockmeter statistics on for 300 seconds. In this case we compare outputs for 5800 and 5860 connections (see: lockstat.out.5800 and lockstat.out.5860) -- ignore the spin lock statistics at the top and look at the RW lock READ and WRITE reports -- what we see is that there appears to be some contention for unix_table_lock in the report for 5860, but no contention for that lock in the report for 5800 connections. Writers are seeing contention for this lock something like 30% of the time writers are trying to acquire the lock, and the lock is busy in read mode nearly 20% of the time. The unix_table_lock appears to protect the hash table of sockets that are used for AF_UNIX. This is surprising, since AF_UNIX is only used for communication on the local machine. I would have expected to see contention in the AF_INET family instead. However, I'm not really a communications expert in Linux, so I may be reading this incorrectly. Nonetheless, this appears to be significant and likely explains the time that we saw spent in the __write_lock_failed() routine above, since that is where the process goes when it tries to get a rwlock in write mode and can't because of a reader (or another writer) holding the lock. Expected results: Should have been able to saturate cpus and scale at least to 8800 connections Additional info: this appears to be an AF_UNIX bottleneck as opposed to a AF_INET bottleneck. 5600.report.tmp.recalc: 18 583438 0.0031 99.6723 __write_lock_failed 5750.report.tmp.recalc: 38 1157755 0.0033 99.6552 __write_lock_failed 5760.report.tmp.recalc: 241230 2257574 10.6854 28.7934 __write_lock_failed 5780.report.tmp.recalc: 195230 2018453 9.6723 29.1407 __write_lock_failed
Created attachment 121968 [details] Lockmeter Results
Created attachment 121969 [details] Oprofile Results
Also the most recent kernel-perf test results show tcp/ip to be upto 15-23% higher for 2.6.13 and 15 kernels compared to 2.6.9. Any idea what the changes to the upstream kernel are? http://kernel-perf.sourceforge.net/results.machine_id=1.html
It's very likely the massive TSO TCP rewrite that was done upstream. We can never merge that in, it's way too invasive.
We are already using the tso=off using ethtool for the benchmark. So perhaps this does NOT correlate with what we are seeing in SPECwebSSL. I will recheck with AMD (until we get our setup running by the end of Jan 06).
Yes, please recheck that just to make sure. It could be something as simple as a driver update, but asking "what changed" in networking from 2.6.9 to 2.6.{13,15}... that's like tens of megabytes of networking changes. This sounds totally different, this performance report, from the SSL stuff this particular bug report is about. I think these extra data points will end up being distractions from the core issue, which appears to be simply an issue of SMP lock cache-line ping-pong in the AF_UNIX code. I recommended that they put the unix_table_lock in a seperate data section or similar, then retest. I never heard back after making that recommendation.
The rw_lock unix_table_lock has high utilization, hold and wait times for the write phase. RWLOCK WRITES HOLD WAIT (ALL) WAIT (WW)wait on writer UTIL CON MEAN( MAX ) MEAN( MAX )( %CPU) MEAN( MAX ) TOTAL NOWAIT SPIN( WW ) NAME 0.01% 32.9% 0.5us(1795us) 2071us(9063us)( 1.5%) 197us(3760us) 59294 67.1% 32.7%(0.20%) unix_table_lock 0.00% 4.8% 0.3us( 622us) 1789us(8666us)(0.10%) 1.5us( 46us) 29960 95.2% 4.4%(0.31%) unix_create1+0x15b 0.01% 61.7% 0.8us(1795us) 2093us(9063us)( 1.4%) 988us(3760us) 29334 38.3% 61.6%(0.08%) unix_release_sock+0x2c As you can see there is ~2ms mean,~9ms max wait times all and ~1ms mean, ~3.7ms max wait times on writers. Even hold max times are hi ~1.7ms... And unix_release_sock() just clears out link, should not be that long. RW_locks favors reads so writers could get starved! Also the rw_lock does _not_ disable interrupts so that could be causing long hold times and holding off others. Why there is unix domain activity I not not know. Is there some deamon running that is causing this activity that Suse does not run? I have looked at Suse Sp2 2.6.5-7.191 and I don't see any difference from rhel 4u2 in the Unix domain code, files net/unix/af_unix.c or garbage.c In kernel.org as of 15 Dec 05 there is a patch to make the unix_table_lock as spinlock for performance reasons.
Yes upstream changes it to a spinlock, but that shouldn't be showing the kind of performance killing you see here. Any number of daemons on the machine could be using AF_UNIX, why not use tools such as "netstat" and "lsof" to find out who has them open? :-) I've asked to place the unix_table_lock in a seperate cache line, and test what that does to your results. Has that been done yet? I think I recommended that a few weeks ago, so you should have been able to get a chance to perform that quick check for us by now, it's a one-line change.
Yup - I still do not have a system to reproduce and why am pouring through the data again. This load requires 16 clients to drive a single server. Will try to pad the lock out if the compiler is doing the wrong thing. Do you have the oner liner you want to try? I understand the ton of changes in the upstream... thanks for comments.
I still do not have a setup at Red Hat to attempt to reproduce. The upstream kernels contain significant Network performance optimizations. We recommend using FC5 2.6.15 kernels for future performance benchmarking.
Found this in a cleanup. Not sure if this is a feature or a bug. The SuSE kernel that has been compared is not newer than our kernel, is it? Putting on 4.6 for a review.
Shak notes the following, based on this closing as currentrelease. This bottleneck was not observed in SPECweb2005 for either 4.4 and 4.5 kernels. We have since published 7 of the top ten results for SPECweb2005 with various partners. This ticket was for AMD 64 which with HP owns the #1, 9 and 10th spot on the list below. Want me to enter into the BZ? Dave perhaps can comment on why this went away with SPECweb2005? Shak