Bug 461184
Summary: | Significant regression in time() performance | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jay Turner <jturner> | ||||||
Component: | kernel | Assignee: | Prarit Bhargava <prarit> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Zhang Kexin <kzhang> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 5.3 | CC: | ahecox, dhoward, duck, dzickus, james.leddy, jburke, jpirko, jplans, kzhang, prarit, qcai, srevivo, woodard | ||||||
Target Milestone: | beta | Keywords: | Regression, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-01-20 19:40:46 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 466427 | ||||||||
Attachments: |
|
Description
Jay Turner
2008-09-04 19:16:51 UTC
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. Jay, My patch for 250708 is responsible for this. I'm surprised it is such a significant performance reduction :(. Here is some background on the issue and some options. In the past two or three years, RH has *not* run any gettimeofday() or time() tests on the RHEL5 kernel. This was because there was a mistaken belief that the HW was inherently flakey. jburke, now cc'd, can shed more light on the situation if needed. I finally upon the real issue in 250708 -- rounding errors were compounding and returning an occasional "off-by-one" result on gettimeofday() queries. After patching the kernel, I turned on the gettimeofday testsuite in RHTS and noted that all tests but one passed. The test that didn't pass was tests/kernel/syscalls/gettimeofday/timedifftest-qa.c This test does time() gettimeofday() time() and compares the values to determine if "time" is increasing. If time decreases, the test fails. [Aside: The difference between time() and gettimeofday() is that time() returns the # of seconds since Jan 1, 1970 and gettimeofday() returns the # of seconds and microseconds since Jan 1, 1970.] The test failed after my initial patch for 250708 because the value returned in time() does not take into account the amount of time since the last timer interrupt and "now", whereas gettimeofday() does take this small amount into account in the calculation. This resulted in an occasional error in the value that time() reports. For example, A timer interrupt is handled. t(int) = 10001 seconds time() is called and returns t(time) = t(int) = 10001 seconds gettimeofday is called and returns t(time) = t(int) + delta(now-t(int)) = 10001 + 1 = 10002 seconds time() is called again, and returns t(time) = t(int) = 10001 seconds To fix this problem I changed time() to now call gettimeofday() and return the macroscopic portion of the gettimeofday() calculation. Coincidentally, this solution matched upstream, and AFAICT, is the only method to correctly synchronize time() and gettimeofday(). With the change a call to time() results in a read of the clock which takes some time and IMO is responsible for the performance hit. Without this fix, the timedifftest-qa.c test _will fail_ 100% of the time. IIRC, a similar test is performed in the ltp testsuite and it will fail as well. I did not realize that the change to time() would result in such a significant reduction in performance. Here are some options to resolve the problem: 1. Keep the code as is because the test that is failing is a valid test. This option results in a stable system ;) 2. Remove the change to time() to return only the time that the last timer interrupt was serviced. This would also require a change to the gettimeofday testsuite to disable the timedifftest-qa.c test. Believe it or not, I'm leaning toward option #2 because the time()/gettimeofday()/time() test never worked so we are no worse off than we were before the patch for 250708, and at least gettimeofday() would be fixed. Thoughts/concerns? P. I think either resolution would be adequate: - users of time() should not expect resolution higher than the timer interrupt, so the old solution seems fine. - however, because of this, they also shouldn't be calling time() thousands of times a second or whatever would be necessary to introduce a meaningful penalty. So a slowdown should be fine as well. The test as coded seems invalid and reverting the behavior (and removing the test) seem safer (if not better) to me. At worst, we can say the time() is providing a fast, low resolution timer while gettimeofday() is providing a more expensive high-resolution timer. That jives with what I think of those facilities already anyway... -Andrew This is the plan of action: 1. jburke will disable the timedifftest-qa.c test in RHTS, so the rest of the gettimeofday() tests can run. 2. I will post a patch to revert the 2nd last chunk of the patch (the part that changes the time() syscall) in 250708 to RHKL after confirmation that the test has been disabled. This patch will NOT revert changes to the gettimeofday() syscall. P. Created attachment 317400 [details]
RHEL5 fix for this issue
Revert time() syscall changes made in 250708.
P.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 317410 [details]
RHEL5 fix for this issue
Updated patch.
Er ... s/jlayton/jturner/ P. in kernel-2.6.18-117.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Testing on test1014.test shows we've gained the performance back with the change using the -117.el5.x86_64 kernel. Need to conduct broader testing across other arches. on 2.6.18-116 kernel, run several time, result are about the same. [root@hp-xw4200-01 bin-x86_64]# ./time -E -C 300 -L -S -W -N "time" -I 50 Running: time# ./time -E -C 300 -L -S -W -N time -I 50 prc thr usecs/call samples errors cnt/samp time 1 1 0.07790 298 0 2000 # # STATISTICS usecs/call (raw) usecs/call (outliers removed) # min 0.07738 0.07738 # max 0.10644 0.08007 # mean 0.07816 0.07802 # median 0.07790 0.07790 # stddev 0.00188 0.00071 # standard error 0.00011 0.00004 # 99% confidence level 0.00025 0.00010 # skew 11.81892 1.25615 # kurtosis 169.77581 0.69024 # time correlation -0.00000 -0.00000 # # elasped time 0.04898 # number of samples 298 # number of outliers 4 # getnsecs overhead 113 # # DISTRIBUTION # counts usecs/call means # 283 0.00000 |******************************** 0.07792 # # 15 > 95% |* 0.07985 # # mean of 95% 0.07792 # 95th %ile 0.07943 for 0.05433 seconds with 2.6.18-125 kernel, run several times,last line is about 0.02 seconds. don't know whether this means the bug is fixed. just for record. Kexin, you can use the below command to compare two (or more) runs. Makes the data much easier to parse. '/usr/local/libMicro/multiview <res1> <res2> <resX> > /tmp/compare.html' tested on two machines: hp-xw9400-01.rhts.bos.redhat.com & hp-xw6800-01.rhts.bos.redhat.com on hp-xw6800-01.rhts.bos.redhat.com, 2.6.18-113.el5(former) and 2.6.18-128.el5(latter) is tested for 6 six times, result is similar to following: BENCHMARK USECS USECS [percentage] time 0.651200 0.00965[+6648.2%] on hp-xw9400-01.rhts.bos.redhat.com, 2.6.18-116.el5(former) and 2.6.18-126.el5(latter) is tested for 6 six times, result is similar to following: BENCHMARK USECS USECS [percentage] time 0.093400 0.02377[ +292.9%] An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |