Description of problem: There is a noticeable slowdown in the update 2 performance of the gettimeofday() call when compared to update 1. It apprears to be partly related to the switch from the PIT timer in U1 to the PM timer in U2. However, there is also a synchronization issue that causes the gettimeofday() system call in U2 to not scale across processors. The kernel boot line directives to boot other timers doesn't appear to work in x86_64 and the PM timer is always used no matter which timer is selected at boot time. I created a U2 kernel that booted with the PIT timer and the performance of a single thread increased to what U1 was able to do. However, there is no scaling across multiple processors. I am attaching a pdf chart showing the output of a bmark.c test that I am also attaching. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. run the bmark gettimeofday test on both U1 and U2 2. 3. Actual results: Expected results: Additional info:
Created attachment 124758 [details] program to measure gettimeofday performance and scaling This is the program I used to generate the gettimeofday() perf data.
Created attachment 124759 [details] Chart showing bmark test data from U1, U2, and custom U2. The U2-PIT test was run with a kernel that had the PM timer config file option turned off.
Is this amd system or intel?
AMD.
Created attachment 124794 [details] gettimeof day fixes i think with this patch, and passing 'nohpet' and 'nopmtimer' at the command line the tsc will be used. you can veryify the gettimeofday timer via 'dmesg | grep timer.c". I think the straight line is just going to be the default, unless the commandline arguments are passed for AMD systems. it be interesting to see how upstream benchmarks too. thanks.
Created attachment 124817 [details] upstream reference patch
Also, passing 'nopmtimer' and 'nohpet' on the commandline with the shipping kernel should get the better scaling behavior.
what issue do these patches in comment #5 & #6 address? this was a regression introduced when the PMtimer changes went into U2. looks the regression is probably in time.c being that the regression exists even when using PIT. being that the PMTimer patches got us inline with upstream the lack of scalability almost certainly exists there as well (although it would certainly be usefully to know this for sure).
The patch in comment #5 address the fact that 'unsynchronized_tsc' will almost always return true due to the fact that the clustermap data structure, is not properly initialized. This mean that Intel chips, never use the tsc for gtod, which they really should. I agree that the 'flatline' in the chart is now going to be the default for most x86_64 systems now. However, i suspect the 'PIT' line in the chart is incorrect, and is likely one of the other timers.
the patch in comment #5 may be a good thing to investigate for Intel related time scalibility issues, but this BZ affects AMD not Intel. AMD doesnt use tsc for timekeeping being its deemed unreliable. this is why the PMTimer patch went into U2. it would be nice to see a graph of PIT, HPET and PMtimer scalability for U2 and U3 to see if this is truly a generic regression or a regression that only affects one method of timekeeping
How can I force hpet timer usage in the stock kernel? I'm using RHEL4 U3 .29 nohpet and nopmtimer forced use of PIT/TSC according to time.c npmtimer only also forced use of PIT/TSC according to time.c no options used PM timer as expected.
Created attachment 124829 [details] Graph of RHEL4 U3 versus 2.6.15.4 When using PM timer the 2.6.15.4 kernel maps directly with RHEL4 U3. Using nopmtimer in RHEL4 U3 shows scaling unlike my initial test with RHEL4 U2. However, 2.6.15.4 with nopmtimer starts over 2x higher at 1 thread (6,736,148 vs 3,145,266).
ok. that looks much better :) I suspect that the tsc code might run faster with this patch: http://marc.theaimsgroup.com/?l=git-commits-head&m=113705338714125&w=2. The comment says its Intel specific. but it really isn't. If you think its important for us to scale better we might try backporting it. I'm also curious how you encountered this issue in the first place.
I went back and ran the stock U2 kernel with nopmtimer flag and it performs just like U3 did in my latest comparision graph. It scales linearly starting from 3.1 million gtime()/sec at 1 thread. In my original test with U2 forcing PIT, I had recompiled the U2 kernel with the CONFIG_X86_PM_TIMER unset. That is what produced a higher 1 thread number in my first chart, but exhibited the same flat line across multiple theads. We had originally been doing some testing with IA-64 gettimeofday() calls to solve a similar non-scaling issue with that architecture. A comparison test was run with an Opteron box running U3 and we were surprised to find out it didn't show any scaling. We stepped back to U2 and U1 to see how they performed and discovered the U1/U2 delta.
Created attachment 128670 [details] patch to resolve "nopmtimer" not calling the virtual gettimeofday syscall
committed in stream U4 build 35.3. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
Created attachment 128895 [details] Partner help from Andy to plot the new RHEL4 U4 kernel against other data
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0575.html