Bug 812148

Summary:

High CPU usage of scheduler when sleeping

Product:

Red Hat Enterprise Linux 6

Reporter:

Dave Johansen <dlj>

Component:

kernel

Assignee:

Larry Woodman <lwoodman>

Status:

CLOSED NOTABUG

QA Contact:

Cui Chun <ccui>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.2

CC:

amy.murtha, angelotech, aquini, atomlin, bhubbard, ccui, chorn, davejohansen, dchia, dhoward, fsorenso, jburke, jekirkpa, jpoimboe, jstancek, lwang, lwoodman, michele, mtessun, mwagner, ngalvin, pep, perfbz, pportant, qcai, rbinkhor, rda, rfreire, rpiddapa, rrajaram, salmy, tbowling

Target Milestone:

Target Release:

---

Hardware:

i386

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-02-13 16:32:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

782183, 835616, 840683, 961662, 991172

Attachments:

Description	Flags
Example of calls to select with high CPU usage	none
Added output of CPU usage	none
Reset prevStat after each calculation	none
Just doing a select for 1 ms of sleep	none
Doing a sleep and some work in threads to demonstrate high CPU usage	none
Doing a sleep and some work in threads to demonstrate high CPU usage	none
Script to run test for a given number of threads and sleep types	none
Results on Dell Vostro 200 with Ubuntu 7.10	none
Results on Dell Vostro 200 with CentOS 5.6	none
Results on Dell Vostro 200 with CentOS 6.2	none
Results on Dell Vostro 200 with Ubuntu 11.10	none
Results on Dell Vostro 200 with Ubuntu 12.04	none
Results on Dell Vostro 200 with Fedora 16	none
Results on Dell Vostro 200 with Ubuntu 11.10 with BFS	none
Doing a sleep and some work in threads to demonstrate high CPU usage	none
getimeofday results on Dell Vostro 200 with CentOS 5.6	none
utime results on Dell Vostro 200 with CentOS 5.6	none
stime results on Dell Vostro 200 with CentOS 5.6	none
getimeofday results on Dell Vostro 200 with CentOS 6.2	none
utime results on Dell Vostro 200 with CentOS 6.2	none
stime results on Dell Vostro 200 with CentOS 6.2	none
gettimeofday results on Dell Vostro 200 with Ubuntu 7.10	none
utime results on Dell Vostro 200 with Ubuntu 7.10	none
stime results on Dell Vostro 200 with Ubuntu 7.10	none
gettimeofday results on Dell Vostro 200 with Ubuntu 8.04-4	none
utime results on Dell Vostro 200 with Ubuntu 8.04-4	none
stime results on Dell Vostro 200 with Ubuntu 8.04-4	none
gettimeofday results on Dell Vostro 200 with Ubuntu 11.10	none
utime results on Dell Vostro 200 with Ubuntu 11.10	none
stime results on Dell Vostro 200 with Ubuntu 11.10	none
gettimeofday results on Dell Vostro 200 with Ubuntu 11.10 with BFS	none
utime results on Dell Vostro 200 with Ubuntu 11.10 with BFS	none
stime results on Dell Vostro 200 with Ubuntu 11.10 with BFS	none
Results on Dell Precision m6500	none
Results on ProLiant DL380 G4	none
Results on Xeon W35xx	none
Simpler version of test_sleep program	none
Update to simpler version of test_sleep program	none
Customer provided reproducer program	none

Description Dave Johansen 2012-04-12 21:35:22 UTC

Created attachment 577173 [details]
Example of calls to select with high CPU usage

Description of problem:
The select function shows unexpectedly high CPU usage.

Version-Release number of selected component (if applicable):
2.12

How reproducible:
Always

Steps to Reproduce:
1. Compile test_select.cpp
2. Run generated executable
  
Actual results:
CPU usage of ~4% on a Dell Precision T3400.

Expected results:
Insignificant amount of CPU usage since the program should basically just sleeping.

Additional info:
Running the same code on RHEL 5 shows no significant change in CPU usage.

Comment 2 Dave Johansen 2012-04-13 16:53:23 UTC

Created attachment 577384 [details]
Added output of CPU usage

Bob Ardent pointed out to me that I was using a sleep time of 10 microseconds, but RHEL 5 will sleep for 1 millisecond and RHEL 6 will sleep for 100 microseconds so this was artificially increasing the amount of processing done on RHEL 6, so I changed the sleep time to 1 millisecond to put the comparison on even ground for both OSes.

I also added some code to output the CPU usage of the process. On RHEL 5, it's always 0, but on RHEL 6 it starts out as 0 and then will jump up to a non-zero value. The user percentage seems to stay relatively stable, but the system percentage seems to be slowly increasing. I let it run for a few minutes and it about doubled in that time, but I'm not sure what the long term trend would be.

Comment 3 Dave Johansen 2012-04-13 17:05:18 UTC

Created attachment 577389 [details]
Reset prevStat after each calculation

The prevStat struct wasn't being updated after the stats were calculated each time, so that is what was causing the slowly increasing system percentage. The values now occasionally appear as non-zero values.

Comment 4 Jeff Law 2012-04-13 17:31:16 UTC

So, given the information in comments #2 & #3, is this still an issue?

Comment 5 Dave Johansen 2012-04-13 18:21:03 UTC

Yes, I believe that there is still an issue. Even with the increased sleep time of 1 millisecond, the process is showing up in top with a CPU usage percentage of ~0.3% on a Dell Precision T3400 when it should be basically 0% like that seen with RHEL 5.

In isolation, the 0.3% doesn't seem like a lot, but when there are several processes calling select() to perform a sleep in a portable manner, it dramatically increases the CPU usage unnecessarily. The man page also states that this is can be done:

"Some  code calls select() with all three sets empty, nfds zero, and a non-NULL timeout as a fairly portable way to sleep with subsecond precision."

Comment 6 Dave Johansen 2012-04-25 18:13:06 UTC

Created attachment 580227 [details]
Just doing a select for 1 ms of sleep

To further emphasize that running several of these eats up a non-trivial amount of CPU. I've attached a .cpp file that just does the select and then you can spawn 100 of them doing something like the script at the bottom of this comment. On a RHEL 5 machine, the CPU usage stays close to 0%, as you'd expect, but on RHEL 6 it shows non-trivial values in both user and system CPU usage.

#!/bin/bash

for i in {1..100}
do
  ./test_select_small &
done


I believe that this higher CPU usage is related to the CFS or some other kernel mechanism, because here's a snippet of the output from perf with vdso enabled:

# Events: 16K cycles
#
# Overhead  Samples    Command        Shared Object                                Symbol
# ........ ..........  .......  ...................  ....................................
#
     9.87%       1888  test_select_sma  [kernel.kallsyms]    [k] find_busiest_group
     5.00%        813  test_select_sma  [kernel.kallsyms]    [k] system_call
     4.69%        868  test_select_sma  [kernel.kallsyms]    [k] schedule
     4.29%        723  test_select_sma  [vdso]               [.] 0x2ef416
     3.15%        536  test_select_sma  [kernel.kallsyms]    [k] do_select
     3.08%        568  test_select_sma  [kernel.kallsyms]    [k] native_write_msr_safe
     3.06%        523  test_select_sma  [kernel.kallsyms]    [k] copy_to_user
     2.73%        445  test_select_sma  [kernel.kallsyms]    [k] read_tsc
     2.47%        462  test_select_sma  [kernel.kallsyms]    [k] update_curr
     2.39%        429  test_select_sma  [kernel.kallsyms]    [k] native_sched_clock
     2.32%        398  test_select_sma  [kernel.kallsyms]    [k] restore_nocheck
     2.25%        394  test_select_sma  [kernel.kallsyms]    [k] sched_clock_local
     2.15%        411  test_select_sma  [kernel.kallsyms]    [k] cpumask_next_and
...


And with vdso disabled:

# Events: 5K cycles
#
# Overhead  Samples    Command        Shared Object                                Symbol
# ........ ..........  .......  ...................  ....................................
#
    11.36%        623  test_select_sma  [kernel.kallsyms]    [k] find_busiest_group
     5.39%        294  test_select_sma  [kernel.kallsyms]    [k] schedule
     3.74%        203  test_select_sma  [kernel.kallsyms]    [k] do_select
     3.25%        179  test_select_sma  [kernel.kallsyms]    [k] system_call
     3.22%        174  test_select_sma  libc-2.12.so         [.] __GI_select
     3.19%        175  test_select_sma  ld-2.12.so           [.] _dl_sysinfo_int80
     3.10%        238  test_select_sma  [kernel.kallsyms]    [k] native_write_msr_safe
     2.88%        157  test_select_sma  [kernel.kallsyms]    [k] core_sys_select
     2.74%        148  test_select_sma  [kernel.kallsyms]    [k] update_curr
     2.59%        142  test_select_sma  [kernel.kallsyms]    [k] copy_to_user
     2.53%        137  test_select_sma  [kernel.kallsyms]    [k] read_tsc
     2.50%        139  test_select_sma  [kernel.kallsyms]    [k] native_sched_clock
     2.28%        127  test_select_sma  [kernel.kallsyms]    [k] sched_clock_local
     2.15%        122  test_select_sma  [kernel.kallsyms]    [k] cpumask_next_and
     2.00%        109  test_select_sma  [kernel.kallsyms]    [k] find_next_bit
...

Comment 7 Dave Johansen 2012-04-26 21:57:37 UTC

Created attachment 580576 [details]
Doing a sleep and some work in threads to demonstrate high CPU usage

I compiled test_select_work with the following command on RHEL 5:

cc test_select_work.c -O2 -DSLEEP_TYPE=0 -Wall -Wextra -lm -lpthread -o test_select_work

I then found a work size that required about 1 ms per iteration on a Dell Precision m6500, and got the following results on RHEL 5:

./test_select_work 1000 10000 300 4
time_per_iteration: min: 911.5 us avg: 913.7 us max: 917.1 us stddev: 2.4 us
./test_select_work 1000 10000 300 8
time_per_iteration: min: 1802.6 us avg: 1803.9 us max: 1809.1 us stddev: 2.1 us
./test_select_work 1000 10000 300 40
time_per_iteration: min: 7580.4 us avg: 8567.3 us max: 9022.0 us stddev: 299.6 us

And the following on RHEL 6:

./test_select_work 1000 10000 300 4
time_per_iteration: min: 914.6 us avg: 975.7 us max: 1034.5 us stddev: 50.0 us
./test_select_work 1000 10000 300 8
time_per_iteration: min: 1683.9 us avg: 1771.8 us max: 1810.8 us stddev: 43.4 us
./test_select_work 1000 10000 300 40
time_per_iteration: min: 7997.1 us avg: 8709.1 us max: 9061.8 us stddev: 310.0 us

On both versions, these results were about what I expected with the average amount of time per iteration scaling relatively linearly. I then recompiled with -DSLEEP_TYPE=1 and got the following results on RHEL 5:

./test_select_work 1000 10000 300 4
time_per_iteration: min: 1803.3 us avg: 1902.8 us max: 2001.5 us stddev: 113.8 us
./test_select_work 1000 10000 300 8
time_per_iteration: min: 1997.1 us avg: 2002.0 us max: 2010.8 us stddev: 5.0 us
./test_select_work 1000 10000 300 40
time_per_iteration: min: 6958.4 us avg: 8397.9 us max: 9423.7 us stddev: 619.7 us

And the following results on RHEL 6:

./test_select_work 1000 10000 300 4
time_per_iteration: min: 2107.1 us avg: 2143.1 us max: 2177.7 us stddev: 30.3 us
./test_select_work 1000 10000 300 8
time_per_iteration: min: 2903.3 us avg: 2903.8 us max: 2904.3 us stddev: 0.3 us
./test_select_work 1000 10000 300 40
time_per_iteration: min: 8877.7.1 us avg: 9016.3 us max: 9112.6 us stddev: 62.9 us

On RHEL 5, the results were about what I expected (4 threads taking twice as long because of the 1 ms sleep but the 8 threads taking the same amount of time since each thread is now sleeping for about half the time, and a still fairly linear increase).

However, with RHEL 6, the time taken with 4 threads increased by about 15% more than the expected doubling and the 8 thread case increased by about 45% more than the expected slight increase. The increase in the 4 thread case seems to be that RHEL 6 is actually sleeping for a handful of microseconds more than 1 ms while RHEL 5 is only sleep about 900 us, but this doesn't explain the unexpectedly large increase in the 8 and 40 thread cases.

I saw similar types of behaviour with all 3 -DSLEEP_TYPE values. This indicates to me that the time/CPU increased in a non-trivial manner from RHEL 5 to RHEL 6 and has a negative impact on the performance of applications that make use of the the kernel sleeping mechanisms (select, poll, usleep).

Comment 8 Dave Johansen 2012-04-30 18:30:32 UTC

Created attachment 581243 [details]
Doing a sleep and some work in threads to demonstrate high CPU usage

I made some improvements to the program to make comparisons a bit easier and the standard deviation a bit more meaningful. I also made a git repo to track the code and it's available at git://github.com/daveisfera/test_sleep.git

I also ran the tests on a Dell Vostro 200 (dual core CPU) with several different OS versions. I realize that several of these will have different patches applied and won't be "pure kernel code" or directly related to the kernel used in RHEL, but this was the simplest way that I could run the tests on different versions of the kernel and get comparisons. I genrated plots with gnuplot and I will attach them as well.

All of these tests were run with the following command with the executable that was built on Ubuntu 11.10 (except for CentOS 6.2 which had gcc available on it so I re-compiled it) with the to be attached "run_test" script and this command ./run_test 1000 10 1000 250 8 4.

The first interesting results is with Ubuntu 7.10. usleep seems to behave as expected, but select and poll took significantly longer than expected. Then with CentOS 5.6, the opposite is true, so I'm not sure what result can be drawn from those conclusions.

I tried running the Ubuntu 8.04-4 LiveCD (because that's when the CFS was introduced), but it wouldn't boot on the Dell Vostro 200.

CentOS 6.2 is where things start to get interesting, because a 10-15% increase in the execution time is seen with select and poll. With usleep, that penalty doesn't seem to be present for low numbers of threads but seems to reach similar levels once the number of threads is 2x the number of cores.

Ubuntu 11.10, Ubuntu 12.04, and Fedora 16 all show results basically in line with CentOS 6.2 but with usleep always being a comparable penalty the other sleep methods. The one exception to this is that on Fedora 16, poll is lower than the other sleep methods with 3 threads.

One final test that I did was to install the PPA on Ubuntu 11.10 that has BFS instead of CFS. Info about it can be found at: https://launchpad.net/~chogydan/+archive/ppa It doesn't seem to be suffering from this sleep issue (or at least in a different way) because the results for 1-3 threads are at 2 ms as expected (basically no penalty), but then at 4 threads it's about 20% higher than the no sleep methods and it seems to maintain more of a penalty at higher numbers of threads. I also didn't see the system CPU usage when running the previous test with BFS (but this might just be an accounting thing since CFS and BFS account for CPU usage differently).

I know that these results aren't 100% conclusive, but they seem to indicate to me that something about the sleeping mechanism in the kernel is using more CPU than I expect it to and that this is the cause of the increased CPU usage. It doesn't appear to be the scheduler itself because then a similar sort of penalty would have been seen with sched_yield. It appears to me that it is something to do with the sleep mechanism's interaction with the scheduler (particularly the CFS more than the BFS).

Is there any more data I can gather or tests that I can run that can help diagnose this problem?

Comment 9 Dave Johansen 2012-04-30 18:32:07 UTC

Created attachment 581244 [details]
Script to run test for a given number of threads and sleep types

Comment 10 Dave Johansen 2012-04-30 18:38:06 UTC

Created attachment 581245 [details]
Results on Dell Vostro 200 with Ubuntu 7.10

Comment 11 Dave Johansen 2012-04-30 18:38:37 UTC

Created attachment 581246 [details]
Results on Dell Vostro 200 with CentOS 5.6

Comment 12 Dave Johansen 2012-04-30 18:39:10 UTC

Created attachment 581247 [details]
Results on Dell Vostro 200 with CentOS 6.2

Comment 13 Dave Johansen 2012-04-30 18:39:36 UTC

Created attachment 581248 [details]
Results on Dell Vostro 200 with Ubuntu 11.10

Comment 14 Dave Johansen 2012-04-30 18:40:04 UTC

Created attachment 581249 [details]
Results on Dell Vostro 200 with Ubuntu 12.04

Comment 15 Dave Johansen 2012-04-30 18:40:26 UTC

Created attachment 581250 [details]
Results on Dell Vostro 200 with Fedora 16

Comment 16 Dave Johansen 2012-04-30 18:40:46 UTC

Created attachment 581251 [details]
Results on Dell Vostro 200 with Ubuntu 11.10 with BFS

Comment 17 Jeff Law 2012-04-30 18:58:16 UTC

Dave, the one test I would run (and planned to do so myself once I can get out from under my backlog) would be to compare standard rhel5 with rhel5+glibc from rhel6.   If that doesn't show a performance problem, then we can conclude that glibc isn't like the problem and reassign to the kernel team.

This is generally looking like a kernel issue and that test would seal the deal for me.  Obviously if it's a kernel issue, then it is in everyone's best interest to get it assigned to the kernel team who are the most qualified to fix kernel issues.

Comment 18 Dave Johansen 2012-04-30 20:50:56 UTC

How do I put the glibc from RHEL 6 on a RHEL 5 machine? Is it as simple as installing the .rpm from RHEL 6?

Comment 19 Jeff Law 2012-04-30 21:00:29 UTC

Something like

rpm --upgrade --force --nodeps glibc-<whatever>.rpm glibc-common-<whatever>.rpm

Use a scratch box as the "--force --nodeps" will likely leave your system in an undesirable state as it's bypassing most of the rpm dependency checks, duplicate file checks, etc etc.    In fact, make sure your test program is already on the system and compiled.

I can't stress enough, use a test box :-)

Comment 20 Dave Johansen 2012-04-30 23:05:03 UTC

I tried that on a test box and also with '--nomd5' but it gave me this error:
error: unpacking of archive failed: cpio: Bad magic

I'm guessing that that means that it didn't actually install. Is there something I can do to get it to install?

Comment 21 Jeff Law 2012-04-30 23:32:53 UTC

Ah, joys.  Yes there's a difference in the RPM payload formats.  If it can wait a few more days, I'll get the test run.  I'll probably just extract the files I need from the relevant RPM rather than workaround the payload format changes between RHEL5 and RHEL6.  Sorry for wasting your time.

Comment 22 Dave Johansen 2012-05-01 17:33:57 UTC

No worries. I was just hoping to help out however I could to diagnose the cause of the problem.

Would that run of rpm have changed anything before the failure? Because I tried running the test just to see what the results would be like and I definitely got different results than I did before, but I'm guessing that that might be because it left the system in some sort of intermediate state with only part of the change being applied.

Comment 23 Bob Arendt 2012-05-01 17:52:12 UTC

rpm -qa glibc-\*  will show the versions installed.  If that's correct, then
rpm -Va glibc-\*  will highlight any files with different checksums from the original rpm installation.  We might have to force-install back to the original version.

Comment 24 Jeff Law 2012-05-01 19:37:25 UTC

So I'm seeing a fair amount of variation in the most recent testcase.   It's a bit hard to know how much variation to expect.

Regardless, what I'm trying to do at this stage is rule out glibc as a component of this problem so the problem can be addressed by the proper development team.  I think using the testcase from c#3 is sufficient for those needs.

First I ran it on a RHEL5 box, stock glibc.  The user & system times were consistently zero.

Then I hacked up the same RHEL5 box to use glibc from RHEL6 and re-ran the test.  The user & system times were still consistently zero.

I would have run the test on that same box under RHEL 6, but that box unfortunately doesn't support RHEL 6.  So I ran the test on my F16 system and I could consistently see some time being reported in user/system space (system far more often than user).

The RHEL5 tests seem to rule out glibc as a component.  There clearly is a reproducible issue as evidenced by my F16 tests.   This is most likely a kernel issue, so I'm going to reassign to the kernel team for further analysis.

Comment 25 RHEL Program Management 2012-05-05 04:17:34 UTC

Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 26 Dave Johansen 2012-05-05 05:19:58 UTC

This issue appears to be unexpectedly increasing the CPU usage of any program that makes use of the kernel's sleeping mechanisms, so I don't agree with closing it. If it needs to be moved to a different version of RHEL to be more thoroughly investigated then I am ok with that but I am not ok with it being closed until this specific issue is identified and resolved.

Comment 27 Dave Johansen 2012-05-07 17:03:55 UTC

Created attachment 582713 [details]
Doing a sleep and some work in threads to demonstrate high CPU usage

I added measurements of the user and system CPU usage from /proc/stat/<pid>/tasks/<tid>/stat as an output of the test to try and get another point of observation. I also found an issue with the way the mean and standard deviation were being updated that was introduced when I added the outer iteration loop, so I will add the new plots that have the corrected mean and standard deviation measurements.

Here's the summary of what I observed. I will compare them in pairs this time because I think that it is a bit more informative that way.

CentOS 5.6 vs CentOS 6.2:
The wall clock time (gettimeofday) per iteration on CentOS 5.6 is more varied than 6.2, but this makes sense since the CFS should do a better job of giving the processes equal CPU time resulting in more consistent results. It's also pretty clear that CentOS 6.2 is more accurate and consistent in the amount of time that it sleeps for with the different sleeping mechanisms.

The "penalty" is definitely apparent on 6.2 with a low number of threads (visible on gettimeofday and user time plots) but it seems to be reduced with a higher number of threads (the difference in user time may just be an accounting thing since the user time measurements are so course).

The system time plot shows that the sleep mechanisms in 6.2 are consuming more system than they did in 5.6, which corresponds with the previous results of the simple test of 50 processes just calling select consuming a non-trivial amount of CPU on 6.2 but not 5.6.

Something that I believe that is worth note is that the use of sched_yield() doesn't induce the same penalty as seen by the sleep methods. My conclusion from this is that it's not the scheduler itself that is the source of the issue, but the interaction of the sleep methods with the scheduler that is the issue.

Ubuntu 7.10 vs Ubuntu 8.04-4:
The difference in the kernel version between these two is smaller than that of CentOS 5.6 and 6.2, but they still span the time period when CFS was introduced. The first interesting result is that select and poll seem to be the only sleep mechanisms that have the "penalty" on 8.04 and that penalty continues to a higher number of threads than what was seen with CentOS 6.2.

The user time for select and poll and Ubuntu 7.10 are unreasonably low, so this appears to be some sort of accounting issue that existed then, but I believe is not relevant to the current issue/discussion.

The system time does seem to be higher with Ubuntu 8.04 than with Ubuntu 7.10 but this difference is FAR less distinct than what was seen with CentOS 5.6 vs 6.2.

Notes on Ubuntu 11.10 and Ubuntu 12.04:
The first thing to note here is that the plots for Ubuntu 12.04 were comparable to those from 11.10 so they not show to prevent unnecessary redundancy.

Overall the plots for Ubuntu 11.10 show the same sort of trend that was observed with CentOS 6.2 (which indicates that this is a kernel issue in general and not just a RHEL issue). The one exception is that the system time appears to be a bit higher with Ubuntu 11.10 than with CentOS 6.2, but once again, the resolution on this measurement is very course so I think that any conclusion other than "it appears to be a bit higher" would be stepping onto thin ice.

Ubuntu 11.10 vs Ubuntu 11.10 with BFS:
A PPA that uses BFS with the Ubuntu kernel can be found at https://launchpad.net/~chogydan/+archive/ppa and this was installed to generate this comparison. I couldn't find an easy way to run CentOS 6.2 with BFS so I ran with this comparison and since the results of Ubuntu 11.10 compares so well with CentOS 6.2, I believe that it is a fair and meaningful comparison.

The major point of note is that with BFS only select and nanosleep induce the "penalty" at low numbers of threads, but that it seems to induce a similar "penalty" (if not a greater one) as that seen with CFS for a higher number of threads.

The other interesting point is that the system time appears to be lower with BFS than with CFS. Once again, this is starting to step on thin ice because of the coarseness of the data, but some difference does appear to be present and this result does match with the simple 50 process select loop test did show less CPU usage with BFS than with CFS.

The conclusion that I draw from these two points is that BFS does not solve the problem but at least seems to reduce its affects in some areas.

Conclusion:
As previously stated, I don't believe that this is an issue with the scheduler itself, but with the interaction between the sleeping mechanisms and the scheduler. I consider this increased CPU usage in processes that should be sleeping and using little to no CPU a regression from CentOS 5.6 and a major hurdle for any program that wants to use an event loop or polling style of mechanism.

Please let me know if there's any other data I can get or tests I can run to help further diagnose the problem.

Comment 28 Dave Johansen 2012-05-07 17:14:33 UTC

Created attachment 582723 [details]
getimeofday results on Dell Vostro 200 with CentOS 5.6

Comment 29 Dave Johansen 2012-05-07 17:15:02 UTC

Created attachment 582724 [details]
utime results on Dell Vostro 200 with CentOS 5.6

Comment 30 Dave Johansen 2012-05-07 17:15:21 UTC

Created attachment 582725 [details]
stime results on Dell Vostro 200 with CentOS 5.6

Comment 31 Dave Johansen 2012-05-07 17:15:45 UTC

Created attachment 582726 [details]
getimeofday results on Dell Vostro 200 with CentOS 6.2

Comment 32 Dave Johansen 2012-05-07 17:16:08 UTC

Created attachment 582727 [details]
utime results on Dell Vostro 200 with CentOS 6.2

Comment 33 Dave Johansen 2012-05-07 17:16:40 UTC

Created attachment 582728 [details]
stime results on Dell Vostro 200 with CentOS 6.2

Comment 34 Dave Johansen 2012-05-07 17:17:08 UTC

Created attachment 582729 [details]
gettimeofday results on Dell Vostro 200 with Ubuntu 7.10

Comment 35 Dave Johansen 2012-05-07 17:17:29 UTC

Created attachment 582730 [details]
utime results on Dell Vostro 200 with Ubuntu 7.10

Comment 36 Dave Johansen 2012-05-07 17:17:48 UTC

Created attachment 582731 [details]
stime results on Dell Vostro 200 with Ubuntu 7.10

Comment 37 Dave Johansen 2012-05-07 17:18:13 UTC

Created attachment 582732 [details]
gettimeofday results on Dell Vostro 200 with Ubuntu 8.04-4

Comment 38 Dave Johansen 2012-05-07 17:18:40 UTC

Created attachment 582733 [details]
utime results on Dell Vostro 200 with Ubuntu 8.04-4

Comment 39 Dave Johansen 2012-05-07 17:19:00 UTC

Created attachment 582734 [details]
stime results on Dell Vostro 200 with Ubuntu 8.04-4

Comment 40 Dave Johansen 2012-05-07 17:19:29 UTC

Created attachment 582735 [details]
gettimeofday results on Dell Vostro 200 with Ubuntu 11.10

Comment 41 Dave Johansen 2012-05-07 17:19:55 UTC

Created attachment 582736 [details]
utime results on Dell Vostro 200 with Ubuntu 11.10

Comment 42 Dave Johansen 2012-05-07 17:20:19 UTC

Created attachment 582737 [details]
stime results on Dell Vostro 200 with Ubuntu 11.10

Comment 43 Dave Johansen 2012-05-07 17:20:51 UTC

Created attachment 582738 [details]
gettimeofday results on Dell Vostro 200 with Ubuntu 11.10 with BFS

Comment 44 Dave Johansen 2012-05-07 17:21:29 UTC

Created attachment 582739 [details]
utime results on Dell Vostro 200 with Ubuntu 11.10 with BFS

Comment 45 Dave Johansen 2012-05-07 17:21:54 UTC

Created attachment 582740 [details]
stime results on Dell Vostro 200 with Ubuntu 11.10 with BFS

Comment 46 Dave Johansen 2012-05-07 17:48:11 UTC

I apologize for uploading so many files, but the comment I just added and the images are available at the stackexchange question I posted, so that might be an easier way to view the plots with my comments by them. The link is:
http://unix.stackexchange.com/questions/37391/high-cpu-usage-with-cfs

Comment 48 Dave Johansen 2012-05-10 21:13:55 UTC

Created attachment 583676 [details]
Results on Dell Precision m6500

Here are the results from running the test on a Dell Precision m6500 with RHEL 5.5 and 6.2. This is the original hardware that issue was noticed on and exhibits some different and unexpected behaviour which I assume is at least partially due to the HyperThreading (Intel i7 X 940). I've attached a gzipped tarball of the .svg files and once again I'll discuss the comparison in pairs.

RHEL 5.5 vs RHEL 6.2:
On RHEL 5.5, the gettimeofday and user time results look very similar to what was seen on the Dell Vostro 200 and what is expected (user time being fairly flat up to 4 threads then a linear progression up to 8 and flat again). However, on RHEL 6.2, there's an unexpected upward trend starting at 4 threads that then goes away above 8 threads. The only explanation I can think for this is that the scheduler is putting all the tasks on the first 4 processors (causing competition on the logical cores) instead of spreading them over the 8 processors (so all 4 processors can be utilized). I'm guessing that this is done so the other processors can be utilized to a smaller degree and potentially allowed to enter a lower power state. This appears to come at the cost of processing, but this isn't a concern for our project since we should be typically operating with more than 4 CPU bound processes.

The system time is also higher and more sporadic on RHEL 5.5, but as before, the measurements are coarse and the uncertainty high, so it's hard to draw anything more than general conclusions from this.

RHEL 5.5 (No HyperThreading) vs RHEL 6.2 (No HyperThreading):
I disabled one of the cores on each of the processors to get results without HyperThreading. These results basically mirror the results on the Dell Vostro 200, but the one exception is that the gap between the workloads which sleep and the one which don't sleep is wider which is also because of the previously mentioned use of logical processors by the scheduler.

RHEL 6.2 (built on RHEL 5.5) vs RHEL 6.2 (built on RHEL 6.2):
The final comparison I did was to run the executable built on RHEL 5.5 on RHEL 6.2. In the past, there wasn't a difference in results between the two executables, but now that I make use of a direct system call to get the thread id, the executable built of RHEL 6.2 can't run on RHEL 5.5 because of the newer glibc version and the executable built on RHEL 5.5 gives different results than the executable built on RHEL 6.2.

The results from running the RHEL 5.5 executable on RHEL 6.2 are those title rhel_6.2_old*

The general shapes of the plots are basically identical, but the  "penalty" seems to be higher with the executable built on RHEL 5.5 than that built with RHEL 6.2. I'm not sure what conclusion to draw from this, but it is definitely an interesting result.

Comment 49 Larry Woodman 2012-05-11 15:08:30 UTC

Can anyone tell me if this is an issue with the upstream kernel as well?  We/I back ported the CFS code from upstream into RHEL6.2 so we should see it there as well.

Larry Woodman

Comment 50 Bob Arendt 2012-05-11 18:31:25 UTC

Yes, we see the same sort of behavior on Fedora 16 with both 32 and 64-bit kernels:
kernel-3.3.4-3.fc16.i686
kernel-3.3.4-3.fc16.x86_64

We've also seen this with current generation Arch and Ubuntu kernels.  It seems to be associated with the upstream kernel.

Comment 51 Dave Johansen 2012-05-14 17:19:51 UTC

Created attachment 584416 [details]
Results on ProLiant DL380 G4

These are the results from running the test on a ProLiant DL380 G4. The rhel_6.2_old results are from running the executable built on RHEL 5.3 on RHEL 6.2 and the rhel_6.2 are with the executable that was rebuilt on RHEL 6.2.

Probably the most interesting part of these results is that the sched_yield results track with the other sleep types once it hits 8 threads (in previous results it tracked with the no sleep test). It's also interesting that the "penalty" is increasing once 8 threads is hit. On other hardware, the "penalty" was either constant or decreasing with larger numbers of threads.

Comment 52 Dave Johansen 2012-05-14 17:35:51 UTC

Created attachment 584419 [details]
Results on Xeon W35xx

These are the results from running on a Xeon W3503 CPU with RHEL 6.2 x86 and a Xeon X3520 CPU with RHEL 6.2 x64.

The results on the W3503 with RHEL 6.2 x86 look very similar to the results on most of the other systems that don't have HyperThreading.

The gettimeofday and utime results on the W3520 look similar to the HyperThreading results seen on the Dell m6500 with the exception that all the sleep types other than nanosleep seem more well behaved on x64. The stime results show an unexpected hump around 13 threads.

I ran the test on the W3503 out to 16 threads just to make sure it wasn't something related to the number of threads, but no hump was seen, so this appears to be something related to the x64 build or to the W3520 CPU itself.

Comment 53 Larry Woodman 2012-05-14 18:07:04 UTC

Dave, some late breaking news on this.  So far we have managed to reduce this problem by increasing /proc/sys/kernel/sched_migration_cost.

This is what the upstream tuning guide says about this:


sched_migration_cost

    Amount of time after the last execution that a task is considered to be “cache hot” in migration decisions. A “hot” task is less likely to be migrated, so increasing this variable reduces task migrations. The default value is 500000 (ns).

    If the CPU idle time is higher than expected when there are runnable processes, try reducing this value. If tasks bounce between CPUs or nodes too often, try increasing it. 


And this is what the scheduler does with this:

----------------------------------------------------------------
/*
 * schedule() is the main scheduler function.
 */
asmlinkage void __sched schedule(void)
{
...
        if (unlikely(!rq->nr_running))
                idle_balance(cpu, rq);


/*
 * idle_balance is called by schedule() if this_cpu is about to become
 * idle. Attempts to pull tasks from other CPUs.
 */
static void idle_balance(int this_cpu, struct rq *this_rq)
{
        struct sched_domain *sd;
        int pulled_task = 0;
        unsigned long next_balance = jiffies + HZ;

        this_rq->idle_stamp = this_rq->clock;

        if (this_rq->avg_idle < sysctl_sched_migration_cost)
                return;
------------------------------------------------------------------

So basically all the CPUs with nothing to run end up in idle_balance()
unless sched_migration_cost is larger than the average idle time for 
the whole run queue.  Currently its set to 500000 ns, please try doubling 
it until you dont see this problem happening.  Please let me know if this
solves your problem, it does solve others.

Larry Woodman

Comment 54 Dave Johansen 2012-05-14 23:20:20 UTC

So the good news is that setting kernel.sched_migration_cost to 1 ms does appear to positively affect the results of the test_sleep.c application that I have been running on different machines, but it appears that this might just be a result of tuning it to match the specific behavior of test_sleep.c (1 ms of sleep then 1 ms of work in a loop).

When I run the simple test program that I made on the framework that we use, it only decreased CPU usage by about 5-10% (went from around 3.5% to 3.25%), so I think it's a step in the right direction, but not a complete solution for a real work load.

Also, as a point of note, before submitting the bug, I also tried tuning some of the parameters as described here:
http://doc.opensuse.org/documentation/html/openSUSE/opensuse-tuning/cha.tuning.taskscheduler.html

None of them seemed to get me anything more than the 5-10% decrease in CPU usage that I saw by increasing kernel.sched_migration_cost.

Comment 55 Dave Johansen 2012-06-25 18:59:19 UTC

Any further word on potential fixes for this issue? The previously mentioned fix did show a mild improvement in the simple test cases but nothing all that significant in real world programs.

As another data point, I created one more test were I ran 4 processing programs in parallel on the Dell m6500 with Hyper Threading disabled (so only 4 cores) and on RHEL 5.5 it took ~128 seconds but it took ~138 on RHEL 6.2 so I believe that this is one more data point that indicates that there is a serious regression between RHEL 5 and RHEL 6. Is there any profiling or log files that I can do on this sort of program to help diagnose the problem?

Thanks,
Dave

Comment 57 Dave Johansen 2012-06-29 00:32:53 UTC

Created attachment 595158 [details]
Simpler version of test_sleep program

In order to demonstrate the behavior that I mentioned in my last comment, I simplified the testing program that I had been using before. I compiled it on RHEL 5.5 with the following command:
gcc -Wall -Wextra -O2 test_sleep_simple.c -lrt -o test_sleep_simple

I then ran it on a Dell m6500 with RHEL 5.5 and RHEL 6.2 with the following command:
time ./test_sleep_simple 1000 100000 325 <sleep_type of 0-6>

The following is the time portion of the output:

RHEL 5.5
9.745u 0.000s 0:09.75 99.8% - No Sleep
9.750u 0.023s 0:09.78 99.8% - sched_yield
9.893u 0.003s 0:19.89 49.7% - select
9.875u 0.014s 0:19.89 49.6% - poll
9.888u 0.002s 0:30.12 32.8% - usleep
9.907u 0.012s 0:29.92 33.1% - pthread_cond_timedwait
9.892u 0.003s 0:30.10 32.8% - nanosleep

RHEL 6.2
7.037u 0.000s 0:07.04 99.8% - No Sleep
7.024u 0.008s 0:07.03 99.8% - sched_yield
7.981u 0.030s 0:19.08 41.9% - select
8.085u 0.067s 0:19.14 42.5% - poll
8.056u 0.027s 0:19.08 42.2% - usleep
8.055u 0.035s 0:19.13 42.2% - pthread_cond_timedwait
8.033u 0.022s 0:19.07 42.2% - nanowait

The following things jumped out at me from these results:
1) RHEL 6 is more efficient than RHEL 5 (RHEL 6 took 70% of the time it took on RHEL 5 with the on sleep workload).
2) On RHEL 5, select and poll work as expected (50% CPU usage) but the other 3 sleep methods appear to be sleeping for twice as long as requested (not a big deal, but just interesting to note).
3) On RHEL 6, all the sleep methods are consistent but are using more CPU than expected (user time increased by 1 second and wall time increased by 12 seconds instead of the expected 10 seconds).

I did the same test with the executable built on RHEL 6 and with that one I needed a work_size of 400 to get a no sleep execution time of 10 seconds on RHEL 5 (I'm guessing that it's some optimization that gcc 4.4.6 is doing that 4.1.2 didn't do), but the same general patterns were maintained as far as CPU usage and execution time were concerned.

I also tried running with a shced_min_granularity of 4000000 (8x the default value) and it didn't have a significant impact on the results.

Comment 59 Dave Johansen 2012-07-17 22:06:22 UTC

Created attachment 598759 [details]
Update to simpler version of test_sleep program

This is an update to simpler test program that makes it perform more consistently on CPUs with small caches because it doesn't allocate any memory anymore. It also returns the pseed value that it calculates, so that the compiler won't optimize away the loops (gcc 4.6 was doing this with the no sleep case). It also gives the same results when built on RHEL 5 and RHEL 6 (previous version would run faster when built on RHEL 6).

On the Dell m6500 that it was run on before, it gave similar results to those previously listed, but it had to be run with the command (because it now takes a larger number of iterations to reach the target 1 ms time):
time ./test_sleep_simple 1000 10000 825 <sleep_type>
NOTE: The previous comment said num_iterations was 100000 but 10000 was the correct value)

Comment 62 Dave Johansen 2012-08-29 20:59:16 UTC

I just opened another bug report (Bug #852888) about degraded performance when using mutexes and semaphores. I believe that whatever is causing that issue may also be causing this issue, so I thought that it was worth making note of it in this bug report.

Comment 77 Larry Woodman 2013-02-04 12:24:45 UTC

Madper, the kernel_firmware is now in http://people.redhat.com/~lwoodman/RHEL6/

Larry

Comment 78 Dave Johansen 2013-02-04 16:05:55 UTC

Is that kernel_firmware something that I can or should do testing on to see how it impacts my program and tests?

Comment 79 Larry Woodman 2013-02-04 16:33:42 UTC

Dave, sorry for the noise I was just trying to have some internal testing done to get to the bottom of this issue.  I'll let you knopw as soon as I have something for you to test.

Larry Woodman

Comment 82 Dave Johansen 2013-02-05 15:25:09 UTC

No worries. I figured it was probably an internal thing, but just wanted to check if it was something that would be worthwhile for me to give a try.

But at any rate, it's really good to hear that work is being done, because I've been contacted by several people on the internet asking about the status of this bug and so far I've had basically no idea if anything was even happening with it or not.

Comment 85 Larry Woodman 2013-04-08 17:27:39 UTC

Can we get customers to try the kernel in 
   http://people.redhat.com/~lwoodman/RHEL6/ to verify this the same issue they are running into???


Thanks, Larry Woodman

Comment 86 Jeffrey Kirkpatrick 2013-04-11 14:00:33 UTC

Larry, here was the response from my customer:

"We tried the RPMs RedHat provided and found that they did make a noticeable difference but we are still far from the RHEL5 performance.

To take the example of the nanosleep test process (which we provided the source code for). Recall that in RHEL5 the cost is < 1 microsecond per call.

> New RHEL6 w/idle=mwait:		 5 us per nanonsleep
> New RHEL6 plain:			12 us per nanosleep
> official RHEL6 w/idle=mwait		10 us per nanosleep
> official RHEL6 plain:			19 us per nanosleep

So this is a good step but does not get us back to where we were.

We still think that more work is needed."

I am attaching their nanotest.c code.

-- Jeffrey

Comment 87 Jeffrey Kirkpatrick 2013-04-11 14:01:41 UTC

Created attachment 734215 [details]
Customer provided reproducer program

Comment 90 Jeffrey Kirkpatrick 2013-04-11 17:35:03 UTC

I have asked customer to expand on how they came up with the numbers in comment #86.  They have also come back with these questions:

Why do you have at least these configuration parameters on and their impact:

CONFIG_FTRACE_SYSCALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE=y

Each system call has an extra function call in its prologue.

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)
{
ffffffff81173c70 <sys_open>:
ffffffff81173c70:       55                      push   %rbp
ffffffff81173c71:       48 89 e5                mov    %rsp,%rbp
ffffffff81173c74:       e8 c7 70 e9 ff          callq  ffffffff8100ad40 <mcount>
        long ret;

Figuring out exactly what mcount() does isn't obvious;  its assembly.

Is red hat seeing the same/similar results?

Comment 95 Dave Johansen 2013-04-12 16:16:15 UTC

Both of the tests that I've used to compare RHEL5 and RHEL6 show no significant difference between the default kernel and this new kernel.

Comment 100 Prarit Bhargava 2013-04-23 15:35:02 UTC

*** Bug 903114 has been marked as a duplicate of this bug. ***

Comment 116 Dave Johansen 2013-05-16 15:36:54 UTC

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6062a8dc0517bce23e3c2f7d2fea5e22411269a3
I just noticed talk of that change in the current kernel tree and I was wondering if that change would have any impact on this issue.

Comment 126 Martin Tessun 2013-06-27 08:34:03 UTC

I was just testing the sleeptimerrevert patch on my reproducer environment with the following result:

results-2.6.32-343.el6.x86_64/output-test-14.log:6265912976
results-2.6.32-343.el6.x86_64/output-test-1.log:54696239922
results-2.6.32-343.el6.x86_64/output-test-20.log:55084464390
results-2.6.32-343.el6.x86_64/output-test-20.log:7446574809
results-2.6.32-343.el6.x86_64/output-test-2.log:54627952951
results-2.6.32-343.el6.x86_64/output-test-2.log:5978905013
results-2.6.32-343.el6.x86_64/output-test-3.log:54687347578
results-2.6.32-343.el6.x86_64/output-test-3.log:1596921109
results-2.6.32-343.el6.x86_64/output-test-4.log:54570175593
results-2.6.32-343.el6.x86_64/output-test-4.log:4999146375
results-2.6.32-343.el6.x86_64/output-test-5.log:54739667037
results-2.6.32-343.el6.x86_64/output-test-5.log:1966401707
results-2.6.32-343.el6.x86_64/output-test-6.log:46539858293
results-2.6.32-343.el6.x86_64/output-test-6.log:5642055594

results-2.6.32-352.el6.x86_64/output-test-16.log:19528680165
results-2.6.32-352.el6.x86_64/output-test-5.log:10199927590

results-2.6.32-355.el6.x86_64/output-test-13.log:7284024091
results-2.6.32-355.el6.x86_64/output-test-1.log:34699167648
results-2.6.32-355.el6.x86_64/output-test-21.log:52228621648
results-2.6.32-355.el6.x86_64/output-test-5.log:18552327403
results-2.6.32-355.el6.x86_64/output-test-5.log:3084000215
results-2.6.32-355.el6.x86_64/output-test-9.log:19658483173

results-2.6.32-373.el6.sleeptimerevert.x86_64_mwait/output-test-6.log:32219195237

results-2.6.32-373.el6.sleeptimerevert.x86_64/output-test-7.log:53183818474

As you can see the "hang" just does not happen as often as before, still it does happen (The output should be approx 1000000000 (10^9 nanosec; 1 Second)

The test is mainly flooding the console and doing some sleep 1 in the meantime, logging the time difference:

$old=$(date "+%s%N")
while true; do now=$(date "+%s%N"); echo $(($now-$old))   ;  old=$now ; sleep 1 ; done

Comment 141 loberman 2014-02-23 14:25:45 UTC

Hi Larry,

With the weather last week I did not get a chance to come down and see you about this.

There is some confusion about this for the customer, although they are saying the sleeprevert patch is helping.

What are the implications of this revert, what is implicated when going back to this. Reading the BZ there was talk of perf, ftrace and other implications.

If you have a chance will you post and update.

Thanks
Laurence

Comment 142 loberman 2014-02-23 16:43:03 UTC

Larry

Looking at the sleeprevert patch .config file and comparing to the stock .431 kernel.

There is nothing here that suggests any issue with ftrace, perf and other performance related patch support.

[loberman@dhcp-33-21 boot]$ diff -Nur config-2.6.32-431.3.1.el6.x86_64 config-2.6.32-431.el6.sleeprevert.x86_64
--- config-2.6.32-431.3.1.el6.x86_64	2013-12-13 07:16:18.000000000 -0500
+++ config-2.6.32-431.el6.sleeprevert.x86_64	2013-12-16 15:07:55.000000000 -0500
@@ -1,7 +1,7 @@
 #
 # Automatically generated make config: don't edit
-# Linux kernel version: 2.6.32-431.3.1.el6.x86_64
-# Fri Dec 13 06:55:20 2013
+# Linux kernel version: 2.6.32-431.el6.sleeprevert.x86_64
+# Mon Dec 16 15:02:22 2013
 #
 CONFIG_64BIT=y
 # CONFIG_X86_32 is not set

So I see no reason for the customers concern around perf, strace etc.

loberman

Comment 147 Linda Wang 2015-01-16 16:48:24 UTC

Hi,

This regression in reporting the stime and utime increases was due to
the patch/fix introduced by fixing Bug 748559. 

There is really no wall clock performance degradation, the reporting
values from stime and utime are increased due to the accounting bug fix for 748559. Hence the values reported were increased, but there is really no 
actual performance degradation. 

Just a quick summary: 
The fix in 748559 is to get the corrected reporting/accounting of stime, and utime in /proc per process in multi-threaded environment; therefore, please do expect an increase in stime and utime reporting.  There should not be any real wall clock performance degradation, and if there is an real wall clock performance degradation, please fill a separte bug for further investigation.

Comment 149 Dave Johansen 2015-01-30 16:48:33 UTC

I've heard through other channels that this is going to be closed as "Not A Bug" because of the resolution of Bugzilla #748559. Is it possible for me to receive some of the details of what that resolution was and why this is not considered a bug even though there are still 3 other Bugzillas that this blocks and I have received several queries on the internet of people having issues because of this? If possible, can I be granted access to Bugzilla #748559?

Comment 151 Christian Horn 2015-07-07 12:41:01 UTC

Is there a chance to make this bz public?
Dave Johansen, engineering, what do you think?

We have an amount of cases with "my application consumes more ressources on rhel6 than on rhel5", causing many efforts to debug.
Having this bz here public helps the customers to understand details.

Comment 152 Bob Arendt 2015-07-07 14:26:53 UTC

I'm a co-worker of Dave's, and I'm certain that this should be public. We usually post issues to bugzilla so that they *are* public and have the opportunity to have more eyes on it.  Some Red Hat process seems to keep marking the bugs that we initiate as private - it's annoying.  Especially when we reference bugs to our customers and they can't view them.

.. but I'll let Dave confirm that this bug can be open.  Thanks.

Comment 153 Dave Johansen 2015-07-07 15:14:39 UTC

Yes, I agree with Bob. This bugzilla was opened as public and was public for a while but was made private for reasons that are unknown to me. It's actually been annoying that it's private because several people have email me asking with info about the bugzilla and I've had to relay information/updates to them instead of just letting them view it here.

Comment 159 Brad Hubbard 2015-07-08 22:23:27 UTC

Making this Bugzilla public.