The following has be reported by IBM LTC: [perf][SPECweb99] RHEL 3 Beta 1 4-way performs better than 8-way Hardware Environment: Server: Netfinity 8500r (8 x 900 MHz) 28 GB RAM 4 Intel e1000 dual-port gigabit adapters Clients: 8 xSeries x330 (2 x 1GHz) 1.5 GB RAM 1 Intel e1000 gigabit adapter Software Environment: RHEL 3 Beta 1 plus all errata as of Aug 11 (via up2date) SPECweb99 v1.02 Apache 2.0.47 w/mod_specweb Steps to Reproduce: 1. Run SPECweb99 benchmark 2. 3. Actual Results: 8-way performs poorly, 4-way performs better than 8-way Expected Results: Better 8-way performance Additional Information: Metric is "Simultaneous Connections" in SPECweb99 RHEL 3 Beta 1 8-way = 1500 us 16 sy 84 id 0 4-way = 1600 us 29 sy 71 id 0 A 2.5.73 kernel.org kernel built on the same RHEL 3 Beta 1 install sees 3600 8-way, and 2400 4-way. I suspect TCP locking problems as seen in pre-Beta 1 release of RHEL 3: 8-way @ 1500 conns ------------------ 0.000% 1 .text.lock.tcp_input 0.003% 5 .text.lock.tcp_minisocks 0.221% 339 .text.lock.tcp_timer 0.793% 1213 .text.lock.tcp_ipv4 2.027% 3098 .text.lock.tcp 4-way @ 1600 conns ------------------ 0.001% 1 .text.lock.tcp_input 0.003% 2 .text.lock.tcp_minisocks 0.004% 3 .text.lock.tcp_timer 0.188% 117 .text.lock.tcp_ipv4 0.300% 186 .text.lock.tcp What additional information would be helpful in debugging this issue? * SPEC(tm) and the benchmark name SPECweb(tm) are registered trademarks of the Standard Performance Evaluation Corporation. This benchmarking was performed for research purposes only, and is non-compliant, with the following deviations from the rules - 1 - Runs were shorter than 1200 seconds. 2 - Access_log wasn't kept for full accounting. It was written, but deleted every 200 seconds.
Not enough information to analyze this. Please obtain more detailed locking profiles, or standard kernel profiles during an 8-way run.
------ Additional Comments From wilsont.com 2003-27-08 11:37 ------- /usr/sbin/readprofile does not work for me on Beta 1. (all totals 0) 1- I can send raw captures of /proc/profile if that is helpful. 2- Can RedHat supply a working readprofile? Thanks.
you need to enable the nmi watchdog as well: nmi_watchdog=1 profile=1
------ Additional Comments From wilsont.com 2003-28-08 18:36 ------- Can't complete a run with those kernel args-- double fault. The box is then locked. How do I get at the stack values for debugging? NPTL-related? assume_kernel?
If this is one of those IBM machines where the BIOS corrupts registers if you use the NMI, then I think we're out of luck on this one....
------ Additional Comments From wilsont.com 2003-02-09 12:08 ------- Would captures of /proc/profile be useful?
only when decoded by readprofile; the addresses at which modules get loaded vary per machine and even between reboots.....
------ Additional Comments From wilsont.com 2003-02-09 19:44 ------- I've removed the system management card from the SPECweb system (Netfinity 8500r). In theory, that'll make the race condition under which the register corruption happens much less likely to occur. I'm producing profiles now. With luck, I'll have data for you tomorrow AM.
------ Additional Comments From wilsont.com 2003-04-09 18:31 ------- Even with the system management card removed, I am unable to get far enough into a run to capture profiles. In apic.c, can I just add a call to x86_do_profile under the CONFIG_SMP case in smp_local_timer_interrupt? It would seem that then I'd be able to profile without having to enable NMIs and do it through nmi_watchdog_tick. Seem sane?
yes, that change should work. The resulting profiling info should be taken with a grain of salt - irq-disabled overhead (irq handler overhead, etc.) wont show up, and the overhead might be added to some unrelated code. But it's better than nothing. also, please try a newer kernel. on a related note, why cannot the BIOS do a proper return from the SMI handler if it interrupts an NMI handler? How does the BIOS solve the case when the BIOS itself handles an NMI and is interrupted by an SMI?
------ Additional Comments From wilsont.com 2003-10-09 19:27 ------- The profiling worked, but the profiles look strange. When trying a new kernel, up2date fails on mkinitrd when updating linux-2.4.21-1.1931.2.423.ent. The source installed OK but when I try and build, it errors out in the e1000 drivers-- which I need. I may be able to use other e1000 drivers, but that would negate the point of trying to demonstrate a networking problem. I did an rpm -e of the kernel stuff, and then an up2date -p to re-sync with RHN, but up2date still refuses to try the update again. I'll attach the errors from up2date and the kernel build.
Created attachment 94406 [details] rhel.errors
All of your loopback devices are in use. means that most likely you were running a kernel where you didn't compile in loop. that's mantadory for being able to install kernels
------ Additional Comments From wilsont.com 2003-15-09 11:56 ------- Very slight improvement with 2.4.21-1.1931.2.399.entsmp. Picked up ~200 additional SPECweb simultaneous connections, but still very idle-- nearly 50%. Other 2.4.21 will go 0% idle and ~2700 simultaneous connections on this hardware. I'll up2date all errata and re-test, then modify again for profiling on local timer interrupts.
------ Additional Comments From wilsont.com 2003-15-09 14:38 ------- up2date fails when run on stock beta2 as installed from ISO CDs. ... Installing... ... 149:wvdial ########################################### [100%] New Up2date available Traceback (most recent call last): File "/usr/sbin/up2date", line 1148, in ? # try: File "/usr/sbin/up2date", line 747, in main sys.exit(batchRun(options.list, pkgNames, File "/usr/sbin/up2date", line 1014, in batchRun # quiet mode for rhn_check File "/usr/share/rhn/up2date_client/up2dateBatch.py", line 76, in run self.__installPackages() File "/usr/share/rhn/up2date_client/up2dateBatch.py", line 145, in __installPackages self.kernelsToInstall = up2date.installPackages(self.packagesToInstall, self.rpmCallback) File "/usr/share/rhn/up2date_client/up2date.py", line 719, in installPackages if "kernel" in hdr['Providename']: File "/usr/share/rhn/up2date_client/up2date.py", line 769, in runPkgSpecialCases return kernels TypeError: installBootLoader() takes exactly 1 argument (3 given) [root@x4408way1 root]# re-running up2date produces: [root@x4408way1 root]# up2date -u Fetching package list for channel: rhel3-beta1-as-i386... ######################################## Fetching Obsoletes list for channel: rhel3-beta1-as-i386... Fetching rpm headers... ######################################## The following Packages were marked to be skipped by your configuration: Name Version Rel Reason ------------------------------------------------------------------------------- initscripts 7.31.1.EL 1 Config modified All packages are currently up to date [root@x4408way1 root]# Is my system OK to re-boot, given the installBootLoader() error, and that my initscripts may be out of sync? Are things in an OK state to proceed with testing?
Created attachment 94504 [details] getconfig script could you please run the attached script and attach the result? I suspect it's some of the TCP settings that is causing problems, but i'm not sure. also, could you run 'top -b d 10 > top.log' during the test and attach top.log? Similarly, please run 'vmstat 10 > vmstat.log' too during the test and attach the resulting vmstat.log.
------ Additional Comments From wilsont.com 2003-26-09 11:24 ------- Why did it take until the 25th for Ingo's reply to show up in this defect?
I made the comment on the 15th, and got the email acknowledgement from bugzilla a couple of minutes later, so the bugzilla side seems to be OK. Are you at IBM running some other bug tracking system that feeds into bugzilla?
------ Additional Comments From wilsont.com 2003-26-09 14:25 ------- > Are you at IBM running some other bug tracking system that feeds into bugzilla? Yes, that's it exactly. And, it appears the "Internal only" flag fails to work as well. :-) I re-ran up2date this morning, and have the benchmark running now. I'll run your data-collection script once it reaches maximum load. Thanks. I aplogize for the delay.
------ Additional Comments From wilsont.com 2003-26-09 19:13 ------- A change after the beta 2 ISOs has greatly helped networking. SPECweb is currently still running from this morning and is well beyond the point at which I expected the benchmark conformance to drop off. I'll send benchmark results and the output of your 'getconfig' script once SPECweb reaches its maximum conformance point.
------ Additional Comments From wilsont.com 2003-29-09 17:04 ------- It still falls off a cliff-- only later now. There is an associated huge drop in interrupt rate when this happens... probably NAPI kicking in (since it is enabled by default in your kernel). e1000 NAPI has never worked for me. I'm re-building to try running without NAPI (unless there is a module option to turn it off? NAPI_HOWTO.txt is not helpful here.).
------ Additional Comments From wilsont.com 2003-29-09 18:06 ------- Using the RedHat-supplied /boot/config-2.4.21-3.ELsmp with the only change being 'CONFIG_E1000_NAPI=y' to '# CONFIG_E1000_NAPI is not set', the kernel will compile fine, but modules will not. Is there another way to turn off NAPI without my having to fight this build process again? Every time I go through this I have to make so many chnages in order to make things compile that my resulting kernel is in no way similar to your released kernel-- which is what I'm trying to help you test.
Our source and configs are identical to what we ship. HOWEVER You have to issue a make mrproper in the /usr/src/linux-2.4 directory before doing ANYTHING because that directory comes preconfigured (to allow external modules to build) and the 2.4 kernel makefiles don't have complete dependencies to wipe these :(
------ Additional Comments From wilsont.com 2003-01-10 11:13 ------- OK, turning off NAPI was the answer. The system now loads up and runs as it should. Thanks!
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ACCEPTED |CLOSED Impact|------ |Performance ------- Additional Comments From khoa.com 2005-03-27 12:50 EST ------- Bug clean-up time. I'd like to close this bug report based on Comment #28 above. Thanks.