Bug 461184

Summary:

Significant regression in time() performance

Product:

Red Hat Enterprise Linux 5

Reporter:

Jay Turner <jturner>

Component:

kernel

Assignee:

Prarit Bhargava <prarit>

Status:

CLOSED ERRATA

QA Contact:

Zhang Kexin <kzhang>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

5.3

CC:

ahecox, dhoward, duck, dzickus, james.leddy, jburke, jpirko, jplans, kzhang, prarit, qcai, srevivo, woodard

Target Milestone:

beta

Keywords:

Regression, ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-01-20 19:40:46 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

466427

Attachments:

Description	Flags
RHEL5 fix for this issue	none
RHEL5 fix for this issue	none

Description Jay Turner 2008-09-04 19:16:51 UTC

Description of problem:
I'm seeing a significant regression in the performance of time() under the 5.3-candidate kernels with respect to the 5.2 kernels.

Version-Release number of selected component (if applicable):
Starting with 2.6.18-98.el5.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install the libMicro package from http://cobalt.devel.redhat.com/libMicro/
2. 2 options here; either 'cd /usr/local/libMicro/; ./bench > /tmp/libMicro.out' which will run the entire testsuite OR 'cd /usr/local/libMicro/bin-x86_64/ ; ./time -E -C 300 -L -S -W -N "time" -I 50'
  
Actual results:
There's a significant increase in the median time for time() to execute with respect to the 5.2 kernel.


Additional info:
'/usr/local/libMicro/multiview <res1> <res2> <resX> > /tmp/compare.html'
will generate a comparison chart as can be seen at
http://cobalt.devel.redhat.com/libMicro/compare.html

Comment 1 RHEL Program Management 2008-09-04 19:17:15 UTC

This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 2 Prarit Bhargava 2008-09-22 13:29:51 UTC

Jay,

My patch for 250708 is responsible for this.  I'm surprised it is such a significant performance reduction :(.

Here is some background on the issue and some options.

In the past two or three years, RH has *not* run any gettimeofday() or time() tests on the RHEL5 kernel.  This was because there was a mistaken belief that the HW was inherently flakey.  jburke, now cc'd, can shed more light on the situation if needed.

I finally upon the real issue in 250708 -- rounding errors were compounding and returning an occasional "off-by-one" result on gettimeofday() queries.

After patching the kernel, I turned on the gettimeofday testsuite in RHTS and noted that all tests but one passed.  The test that didn't pass was

tests/kernel/syscalls/gettimeofday/timedifftest-qa.c

This test does

time()
gettimeofday()
time()

and compares the values to determine if "time" is increasing.  If time decreases, the test fails.

[Aside: The difference between time() and gettimeofday() is that time() returns the # of seconds since Jan 1, 1970 and gettimeofday() returns the # of seconds and microseconds since Jan 1, 1970.]

The test failed after my initial patch for 250708 because the value returned in time() does not take into account the amount of time since the last timer interrupt and "now", whereas gettimeofday() does take this small amount into account in the calculation.  This resulted in an occasional error in the value that time() reports.

For example,

A timer interrupt is handled.  t(int) = 10001 seconds

time() is called and returns t(time) = t(int) = 10001 seconds

gettimeofday is called and returns t(time) = t(int) + delta(now-t(int)) = 10001 + 1 = 10002 seconds

time() is called again, and returns t(time) = t(int) = 10001 seconds 

To fix this problem I changed time() to now call gettimeofday() and return the macroscopic portion of the gettimeofday() calculation.  Coincidentally, this solution matched upstream, and AFAICT, is the only method to correctly synchronize time() and gettimeofday().

With the change a call to time() results in a read of the clock which takes some time and IMO is responsible for the performance hit.

Without this fix, the timedifftest-qa.c test _will fail_ 100% of the time.  IIRC, a similar test is performed in the ltp testsuite and it will fail as well.

I did not realize that the change to time() would result in such a significant reduction in performance.

Here are some options to resolve the problem:

1.  Keep the code as is because the test that is failing is a valid test.  This option results in a stable system ;)

2.  Remove the change to time() to return only the time that the last timer interrupt was serviced.  This would also require a change to the gettimeofday testsuite to disable the timedifftest-qa.c test.

Believe it or not, I'm leaning toward option #2 because the time()/gettimeofday()/time() test never worked so we are no worse off than we were before the patch for 250708, and at least gettimeofday() would be fixed.

Thoughts/concerns?

P.

Comment 3 Andrew Hecox 2008-09-22 14:39:34 UTC

I think either resolution would be adequate: 

 - users of time() should not expect resolution higher than the timer interrupt, so the old solution seems fine.
 - however, because of this, they also shouldn't be calling time() thousands of times a second or whatever would be necessary to introduce a meaningful penalty. So a slowdown should be fine as well.

The test as coded seems invalid and reverting the behavior (and removing the test) seem safer (if not better) to me. At worst, we can say the time() is providing a fast, low resolution timer while gettimeofday() is providing a more expensive high-resolution timer. That jives with what I think of those facilities already anyway...

-Andrew

Comment 4 Prarit Bhargava 2008-09-22 14:46:47 UTC

This is the plan of action:

1.  jburke will disable the timedifftest-qa.c test in RHTS, so the rest of the gettimeofday() tests can run.

2.  I will post a patch to revert the 2nd last chunk of the patch (the part that changes the time() syscall) in 250708 to RHKL after confirmation that the test has been disabled.  This patch will NOT revert changes to the gettimeofday() syscall.

P.

Comment 5 Prarit Bhargava 2008-09-22 18:01:53 UTC

Created attachment 317400 [details]
RHEL5 fix for this issue

Revert time() syscall changes made in 250708.

P.

Comment 7 RHEL Program Management 2008-09-22 18:23:50 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Prarit Bhargava 2008-09-22 19:15:10 UTC

Created attachment 317410 [details]
RHEL5 fix for this issue

Updated patch.

Comment 13 Prarit Bhargava 2008-09-23 13:19:40 UTC

Er ... s/jlayton/jturner/

P.

Comment 19 Don Zickus 2008-09-30 16:01:49 UTC

in kernel-2.6.18-117.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 21 Jay Turner 2008-10-01 12:36:07 UTC

Testing on test1014.test shows we've gained the performance back with the change using the -117.el5.x86_64 kernel.  Need to conduct broader testing across other arches.

Comment 24 Zhang Kexin 2008-12-11 10:47:11 UTC

on 2.6.18-116 kernel, run several time, result are about the same.

[root@hp-xw4200-01 bin-x86_64]# ./time -E -C 300 -L -S -W -N "time" -I 50
Running:                time# ./time -E -C 300 -L -S -W -N time -I 50 
             prc thr   usecs/call      samples   errors cnt/samp 
time           1   1      0.07790          298        0     2000 
#
# STATISTICS         usecs/call (raw)          usecs/call (outliers removed)
#                    min      0.07738                 0.07738
#                    max      0.10644                 0.08007
#                   mean      0.07816                 0.07802
#                 median      0.07790                 0.07790
#                 stddev      0.00188                 0.00071
#         standard error      0.00011                 0.00004
#   99% confidence level      0.00025                 0.00010
#                   skew     11.81892                 1.25615
#               kurtosis    169.77581                 0.69024
#       time correlation     -0.00000                -0.00000
#
#           elasped time      0.04898
#      number of samples          298
#     number of outliers            4
#      getnsecs overhead          113
#
# DISTRIBUTION
#             counts   usecs/call                                         means
#                283      0.00000 |********************************     0.07792
#
#                 15        > 95% |*                                    0.07985
#
#        mean of 95%      0.07792
#          95th %ile      0.07943
 for      0.05433 seconds

with 2.6.18-125 kernel,

run several times,last line is about 0.02 seconds.

don't know whether this means the bug is fixed. just for record.

Comment 25 Jay Turner 2008-12-11 13:03:24 UTC

Kexin, you can use the below command to compare two (or more) runs.  Makes the data much easier to parse.
'/usr/local/libMicro/multiview <res1> <res2> <resX> > /tmp/compare.html'

Comment 26 Zhang Kexin 2008-12-23 10:09:04 UTC

tested on two machines: hp-xw9400-01.rhts.bos.redhat.com & hp-xw6800-01.rhts.bos.redhat.com

on hp-xw6800-01.rhts.bos.redhat.com, 2.6.18-113.el5(former) and 2.6.18-128.el5(latter) is tested for 6 six times, result is similar to following:

BENCHMARK 	USECS 	USECS [percentage]
time 	      0.651200    0.00965[+6648.2%]

on hp-xw9400-01.rhts.bos.redhat.com, 2.6.18-116.el5(former) and 2.6.18-126.el5(latter) is tested for 6 six times, result is similar to following:

BENCHMARK 	USECS 	USECS [percentage]
time 	      0.093400    0.02377[ +292.9%]

Comment 29 errata-xmlrpc 2009-01-20 19:40:46 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html