Bug 633037

Summary: Process' user time incorrectly accounted as system time
Product: [Fedora] Fedora Reporter: Michael Chapman <redhat-bugzilla>
Component: kernelAssignee: Stanislaw Gruszka <sgruszka>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 13CC: anton, dougsland, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, sgruszka
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-05 21:03:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
lsmod none

Description Michael Chapman 2010-09-12 15:04:02 UTC
Created attachment 446761 [details]
lsmod

Description of problem:

Processes on my system that have been running for a while are incorrectly accounting their CPU time as system time rather than user time. When a process is launched, it starts off correctly accounting user and system time. However there comes a point after which *all* CPU time for the process is accounted as system time. The user time for the process never increases hereafter, no matter how much time in userspace the process subsequently consumes.

The actual transition point between correct and incorrect behaviour seems to be roughly related to the CPU load generated by the process: a process that is attempting to use all CPU time will start to be mis-accounted after approximately 65 seconds, whereas one that's only consuming 50% CPU may take a few minutes. It is not simply a case of reaching a certain amount of user time before the problem occurs, and I have not been able to pin down an exact relation. For a given load, however, the problem will consistently begin at the same time.

Version-Release number of selected component (if applicable):

kernel-2.6.33.8-149.fc13.i686 AND
kernel-2.6.34.6-47.fc13.i686 AND
kernel-2.6.34.6-54.fc13.i686

How reproducible: Always

Steps to Reproduce:

Problem was first noticed while running "openssl speed", which uses as much CPU as it can get. For example:

$ openssl speed rsa
Doing 512 bit private rsa's for 10s: 18132 512 bit private RSA's in 9.90s
Doing 512 bit public rsa's for 10s: 265444 512 bit public RSA's in 9.84s
Doing 1024 bit private rsa's for 10s: 4319 1024 bit private RSA's in 9.91s
Doing 1024 bit public rsa's for 10s: 86704 1024 bit public RSA's in 9.88s
Doing 2048 bit private rsa's for 10s: 536 2048 bit private RSA's in 9.92s
Doing 2048 bit public rsa's for 10s: 24839 2048 bit public RSA's in 9.95s
Doing 4096 bit private rsa's for 10s: 99 4096 bit private RSA's in 0.00s
Doing 4096 bit public rsa's for 10s: 5158 4096 bit public RSA's in 0.00s
OpenSSL 1.0.0a-fips 1 Jun 2010
built on: Fri Jun  4 14:20:39 UTC 2010
options:bn(64,32) md2(int) r
Expected results:
c4(1x,char) des(ptr,risc1,16,long) aes(partial) blowfish(idx) 
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -DL_ENDIAN -DTERMIO -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i686 -mtune=atom -fasynchronous-unwind-tables -Wa,--noexecstack -DOPENSSL_BN_ASM_PART_WORDS -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DRMD160_ASM -DAES_ASM -DWHIRLPOOL_ASM
                  sign    verify    sign/s verify/s
rsa  512 bits 0.000546s 0.000037s   1831.5  26976.0
rsa 1024 bits 0.002295s 0.000114s    435.8   8775.7
rsa 2048 bits 0.018507s 0.000401s     54.0   2496.4
rsa 4096 bits 0.000000s 0.000000s      inf      inf

Note how the 4096-bit RSAs were completed in "0.00s", even though these took 10 seconds of wall-clock time to complete -- user time for the process did not increment during these two benchmarks.

Another demonstration of the problem:

$ perl -MBSD::Resource -le '$, = " "; while (1) { 1 foreach 1..10_000_000; print +(getrusage)[0,1] }'
0.734888 0.013997
1.44578 0.013997
2.160671 0.013997
2.840568 0.013997
...
62.268533 0.033994
62.932432 0.033994
63.611329 0.033994
64.321221 0.033994
64.997118 0.033994
64.997118 0.713891   <- Problem starts here
64.997118 1.406786
64.997118 2.120677
64.997118 2.798574
64.997118 3.464473
64.997118 4.133371

Additional info:

lsmod attached. My kernel is currently tainted with the NVIDIA driver, however I have verified that the problem occurs without this driver loaded.

Comment 1 Michael Chapman 2010-09-14 07:09:05 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=16559

This looks like exactly the same issue.

Comment 2 Stanislaw Gruszka 2010-09-14 10:52:22 UTC
I'm able to reproduce problem on 2.6.33 kernel using perl command. Can not reproduce on 2.6.36-rc3 kernel (but on different machine, not sure if that matters).

Comment 3 Stanislaw Gruszka 2010-09-14 12:36:42 UTC
Architecture make difference. I can reproduce bug with 2.6.36-rc3 kernel on i686, but not in x86_64.

Comment 4 Stanislaw Gruszka 2010-09-14 14:37:56 UTC
Posted proposed fix upstream:
http://lkml.org/lkml/2010/9/14/180

Comment 5 Chuck Ebbert 2010-09-21 21:44:16 UTC
Fixed in 2.6.35.5-29.fc14 and 2.6.34.7-57.fc13