Bug 812148
Created attachment 577384 [details]
Added output of CPU usage
Bob Ardent pointed out to me that I was using a sleep time of 10 microseconds, but RHEL 5 will sleep for 1 millisecond and RHEL 6 will sleep for 100 microseconds so this was artificially increasing the amount of processing done on RHEL 6, so I changed the sleep time to 1 millisecond to put the comparison on even ground for both OSes.
I also added some code to output the CPU usage of the process. On RHEL 5, it's always 0, but on RHEL 6 it starts out as 0 and then will jump up to a non-zero value. The user percentage seems to stay relatively stable, but the system percentage seems to be slowly increasing. I let it run for a few minutes and it about doubled in that time, but I'm not sure what the long term trend would be.
Created attachment 577389 [details]
Reset prevStat after each calculation
The prevStat struct wasn't being updated after the stats were calculated each time, so that is what was causing the slowly increasing system percentage. The values now occasionally appear as non-zero values.
So, given the information in comments #2 & #3, is this still an issue? Yes, I believe that there is still an issue. Even with the increased sleep time of 1 millisecond, the process is showing up in top with a CPU usage percentage of ~0.3% on a Dell Precision T3400 when it should be basically 0% like that seen with RHEL 5. In isolation, the 0.3% doesn't seem like a lot, but when there are several processes calling select() to perform a sleep in a portable manner, it dramatically increases the CPU usage unnecessarily. The man page also states that this is can be done: "Some code calls select() with all three sets empty, nfds zero, and a non-NULL timeout as a fairly portable way to sleep with subsecond precision." Created attachment 580227 [details]
Just doing a select for 1 ms of sleep
To further emphasize that running several of these eats up a non-trivial amount of CPU. I've attached a .cpp file that just does the select and then you can spawn 100 of them doing something like the script at the bottom of this comment. On a RHEL 5 machine, the CPU usage stays close to 0%, as you'd expect, but on RHEL 6 it shows non-trivial values in both user and system CPU usage.
#!/bin/bash
for i in {1..100}
do
./test_select_small &
done
I believe that this higher CPU usage is related to the CFS or some other kernel mechanism, because here's a snippet of the output from perf with vdso enabled:
# Events: 16K cycles
#
# Overhead Samples Command Shared Object Symbol
# ........ .......... ....... ................... ....................................
#
9.87% 1888 test_select_sma [kernel.kallsyms] [k] find_busiest_group
5.00% 813 test_select_sma [kernel.kallsyms] [k] system_call
4.69% 868 test_select_sma [kernel.kallsyms] [k] schedule
4.29% 723 test_select_sma [vdso] [.] 0x2ef416
3.15% 536 test_select_sma [kernel.kallsyms] [k] do_select
3.08% 568 test_select_sma [kernel.kallsyms] [k] native_write_msr_safe
3.06% 523 test_select_sma [kernel.kallsyms] [k] copy_to_user
2.73% 445 test_select_sma [kernel.kallsyms] [k] read_tsc
2.47% 462 test_select_sma [kernel.kallsyms] [k] update_curr
2.39% 429 test_select_sma [kernel.kallsyms] [k] native_sched_clock
2.32% 398 test_select_sma [kernel.kallsyms] [k] restore_nocheck
2.25% 394 test_select_sma [kernel.kallsyms] [k] sched_clock_local
2.15% 411 test_select_sma [kernel.kallsyms] [k] cpumask_next_and
...
And with vdso disabled:
# Events: 5K cycles
#
# Overhead Samples Command Shared Object Symbol
# ........ .......... ....... ................... ....................................
#
11.36% 623 test_select_sma [kernel.kallsyms] [k] find_busiest_group
5.39% 294 test_select_sma [kernel.kallsyms] [k] schedule
3.74% 203 test_select_sma [kernel.kallsyms] [k] do_select
3.25% 179 test_select_sma [kernel.kallsyms] [k] system_call
3.22% 174 test_select_sma libc-2.12.so [.] __GI_select
3.19% 175 test_select_sma ld-2.12.so [.] _dl_sysinfo_int80
3.10% 238 test_select_sma [kernel.kallsyms] [k] native_write_msr_safe
2.88% 157 test_select_sma [kernel.kallsyms] [k] core_sys_select
2.74% 148 test_select_sma [kernel.kallsyms] [k] update_curr
2.59% 142 test_select_sma [kernel.kallsyms] [k] copy_to_user
2.53% 137 test_select_sma [kernel.kallsyms] [k] read_tsc
2.50% 139 test_select_sma [kernel.kallsyms] [k] native_sched_clock
2.28% 127 test_select_sma [kernel.kallsyms] [k] sched_clock_local
2.15% 122 test_select_sma [kernel.kallsyms] [k] cpumask_next_and
2.00% 109 test_select_sma [kernel.kallsyms] [k] find_next_bit
...
Created attachment 580576 [details]
Doing a sleep and some work in threads to demonstrate high CPU usage
I compiled test_select_work with the following command on RHEL 5:
cc test_select_work.c -O2 -DSLEEP_TYPE=0 -Wall -Wextra -lm -lpthread -o test_select_work
I then found a work size that required about 1 ms per iteration on a Dell Precision m6500, and got the following results on RHEL 5:
./test_select_work 1000 10000 300 4
time_per_iteration: min: 911.5 us avg: 913.7 us max: 917.1 us stddev: 2.4 us
./test_select_work 1000 10000 300 8
time_per_iteration: min: 1802.6 us avg: 1803.9 us max: 1809.1 us stddev: 2.1 us
./test_select_work 1000 10000 300 40
time_per_iteration: min: 7580.4 us avg: 8567.3 us max: 9022.0 us stddev: 299.6 us
And the following on RHEL 6:
./test_select_work 1000 10000 300 4
time_per_iteration: min: 914.6 us avg: 975.7 us max: 1034.5 us stddev: 50.0 us
./test_select_work 1000 10000 300 8
time_per_iteration: min: 1683.9 us avg: 1771.8 us max: 1810.8 us stddev: 43.4 us
./test_select_work 1000 10000 300 40
time_per_iteration: min: 7997.1 us avg: 8709.1 us max: 9061.8 us stddev: 310.0 us
On both versions, these results were about what I expected with the average amount of time per iteration scaling relatively linearly. I then recompiled with -DSLEEP_TYPE=1 and got the following results on RHEL 5:
./test_select_work 1000 10000 300 4
time_per_iteration: min: 1803.3 us avg: 1902.8 us max: 2001.5 us stddev: 113.8 us
./test_select_work 1000 10000 300 8
time_per_iteration: min: 1997.1 us avg: 2002.0 us max: 2010.8 us stddev: 5.0 us
./test_select_work 1000 10000 300 40
time_per_iteration: min: 6958.4 us avg: 8397.9 us max: 9423.7 us stddev: 619.7 us
And the following results on RHEL 6:
./test_select_work 1000 10000 300 4
time_per_iteration: min: 2107.1 us avg: 2143.1 us max: 2177.7 us stddev: 30.3 us
./test_select_work 1000 10000 300 8
time_per_iteration: min: 2903.3 us avg: 2903.8 us max: 2904.3 us stddev: 0.3 us
./test_select_work 1000 10000 300 40
time_per_iteration: min: 8877.7.1 us avg: 9016.3 us max: 9112.6 us stddev: 62.9 us
On RHEL 5, the results were about what I expected (4 threads taking twice as long because of the 1 ms sleep but the 8 threads taking the same amount of time since each thread is now sleeping for about half the time, and a still fairly linear increase).
However, with RHEL 6, the time taken with 4 threads increased by about 15% more than the expected doubling and the 8 thread case increased by about 45% more than the expected slight increase. The increase in the 4 thread case seems to be that RHEL 6 is actually sleeping for a handful of microseconds more than 1 ms while RHEL 5 is only sleep about 900 us, but this doesn't explain the unexpectedly large increase in the 8 and 40 thread cases.
I saw similar types of behaviour with all 3 -DSLEEP_TYPE values. This indicates to me that the time/CPU increased in a non-trivial manner from RHEL 5 to RHEL 6 and has a negative impact on the performance of applications that make use of the the kernel sleeping mechanisms (select, poll, usleep).
Created attachment 581243 [details] Doing a sleep and some work in threads to demonstrate high CPU usage I made some improvements to the program to make comparisons a bit easier and the standard deviation a bit more meaningful. I also made a git repo to track the code and it's available at git://github.com/daveisfera/test_sleep.git I also ran the tests on a Dell Vostro 200 (dual core CPU) with several different OS versions. I realize that several of these will have different patches applied and won't be "pure kernel code" or directly related to the kernel used in RHEL, but this was the simplest way that I could run the tests on different versions of the kernel and get comparisons. I genrated plots with gnuplot and I will attach them as well. All of these tests were run with the following command with the executable that was built on Ubuntu 11.10 (except for CentOS 6.2 which had gcc available on it so I re-compiled it) with the to be attached "run_test" script and this command ./run_test 1000 10 1000 250 8 4. The first interesting results is with Ubuntu 7.10. usleep seems to behave as expected, but select and poll took significantly longer than expected. Then with CentOS 5.6, the opposite is true, so I'm not sure what result can be drawn from those conclusions. I tried running the Ubuntu 8.04-4 LiveCD (because that's when the CFS was introduced), but it wouldn't boot on the Dell Vostro 200. CentOS 6.2 is where things start to get interesting, because a 10-15% increase in the execution time is seen with select and poll. With usleep, that penalty doesn't seem to be present for low numbers of threads but seems to reach similar levels once the number of threads is 2x the number of cores. Ubuntu 11.10, Ubuntu 12.04, and Fedora 16 all show results basically in line with CentOS 6.2 but with usleep always being a comparable penalty the other sleep methods. The one exception to this is that on Fedora 16, poll is lower than the other sleep methods with 3 threads. One final test that I did was to install the PPA on Ubuntu 11.10 that has BFS instead of CFS. Info about it can be found at: https://launchpad.net/~chogydan/+archive/ppa It doesn't seem to be suffering from this sleep issue (or at least in a different way) because the results for 1-3 threads are at 2 ms as expected (basically no penalty), but then at 4 threads it's about 20% higher than the no sleep methods and it seems to maintain more of a penalty at higher numbers of threads. I also didn't see the system CPU usage when running the previous test with BFS (but this might just be an accounting thing since CFS and BFS account for CPU usage differently). I know that these results aren't 100% conclusive, but they seem to indicate to me that something about the sleeping mechanism in the kernel is using more CPU than I expect it to and that this is the cause of the increased CPU usage. It doesn't appear to be the scheduler itself because then a similar sort of penalty would have been seen with sched_yield. It appears to me that it is something to do with the sleep mechanism's interaction with the scheduler (particularly the CFS more than the BFS). Is there any more data I can gather or tests that I can run that can help diagnose this problem? Created attachment 581244 [details]
Script to run test for a given number of threads and sleep types
Created attachment 581245 [details]
Results on Dell Vostro 200 with Ubuntu 7.10
Created attachment 581246 [details]
Results on Dell Vostro 200 with CentOS 5.6
Created attachment 581247 [details]
Results on Dell Vostro 200 with CentOS 6.2
Created attachment 581248 [details]
Results on Dell Vostro 200 with Ubuntu 11.10
Created attachment 581249 [details]
Results on Dell Vostro 200 with Ubuntu 12.04
Created attachment 581250 [details]
Results on Dell Vostro 200 with Fedora 16
Created attachment 581251 [details]
Results on Dell Vostro 200 with Ubuntu 11.10 with BFS
Dave, the one test I would run (and planned to do so myself once I can get out from under my backlog) would be to compare standard rhel5 with rhel5+glibc from rhel6. If that doesn't show a performance problem, then we can conclude that glibc isn't like the problem and reassign to the kernel team. This is generally looking like a kernel issue and that test would seal the deal for me. Obviously if it's a kernel issue, then it is in everyone's best interest to get it assigned to the kernel team who are the most qualified to fix kernel issues. How do I put the glibc from RHEL 6 on a RHEL 5 machine? Is it as simple as installing the .rpm from RHEL 6? Something like rpm --upgrade --force --nodeps glibc-<whatever>.rpm glibc-common-<whatever>.rpm Use a scratch box as the "--force --nodeps" will likely leave your system in an undesirable state as it's bypassing most of the rpm dependency checks, duplicate file checks, etc etc. In fact, make sure your test program is already on the system and compiled. I can't stress enough, use a test box :-) I tried that on a test box and also with '--nomd5' but it gave me this error: error: unpacking of archive failed: cpio: Bad magic I'm guessing that that means that it didn't actually install. Is there something I can do to get it to install? Ah, joys. Yes there's a difference in the RPM payload formats. If it can wait a few more days, I'll get the test run. I'll probably just extract the files I need from the relevant RPM rather than workaround the payload format changes between RHEL5 and RHEL6. Sorry for wasting your time. No worries. I was just hoping to help out however I could to diagnose the cause of the problem. Would that run of rpm have changed anything before the failure? Because I tried running the test just to see what the results would be like and I definitely got different results than I did before, but I'm guessing that that might be because it left the system in some sort of intermediate state with only part of the change being applied. rpm -qa glibc-\* will show the versions installed. If that's correct, then rpm -Va glibc-\* will highlight any files with different checksums from the original rpm installation. We might have to force-install back to the original version. So I'm seeing a fair amount of variation in the most recent testcase. It's a bit hard to know how much variation to expect. Regardless, what I'm trying to do at this stage is rule out glibc as a component of this problem so the problem can be addressed by the proper development team. I think using the testcase from c#3 is sufficient for those needs. First I ran it on a RHEL5 box, stock glibc. The user & system times were consistently zero. Then I hacked up the same RHEL5 box to use glibc from RHEL6 and re-ran the test. The user & system times were still consistently zero. I would have run the test on that same box under RHEL 6, but that box unfortunately doesn't support RHEL 6. So I ran the test on my F16 system and I could consistently see some time being reported in user/system space (system far more often than user). The RHEL5 tests seem to rule out glibc as a component. There clearly is a reproducible issue as evidenced by my F16 tests. This is most likely a kernel issue, so I'm going to reassign to the kernel team for further analysis. Since RHEL 6.3 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. This issue appears to be unexpectedly increasing the CPU usage of any program that makes use of the kernel's sleeping mechanisms, so I don't agree with closing it. If it needs to be moved to a different version of RHEL to be more thoroughly investigated then I am ok with that but I am not ok with it being closed until this specific issue is identified and resolved. Created attachment 582713 [details] Doing a sleep and some work in threads to demonstrate high CPU usage I added measurements of the user and system CPU usage from /proc/stat/<pid>/tasks/<tid>/stat as an output of the test to try and get another point of observation. I also found an issue with the way the mean and standard deviation were being updated that was introduced when I added the outer iteration loop, so I will add the new plots that have the corrected mean and standard deviation measurements. Here's the summary of what I observed. I will compare them in pairs this time because I think that it is a bit more informative that way. CentOS 5.6 vs CentOS 6.2: The wall clock time (gettimeofday) per iteration on CentOS 5.6 is more varied than 6.2, but this makes sense since the CFS should do a better job of giving the processes equal CPU time resulting in more consistent results. It's also pretty clear that CentOS 6.2 is more accurate and consistent in the amount of time that it sleeps for with the different sleeping mechanisms. The "penalty" is definitely apparent on 6.2 with a low number of threads (visible on gettimeofday and user time plots) but it seems to be reduced with a higher number of threads (the difference in user time may just be an accounting thing since the user time measurements are so course). The system time plot shows that the sleep mechanisms in 6.2 are consuming more system than they did in 5.6, which corresponds with the previous results of the simple test of 50 processes just calling select consuming a non-trivial amount of CPU on 6.2 but not 5.6. Something that I believe that is worth note is that the use of sched_yield() doesn't induce the same penalty as seen by the sleep methods. My conclusion from this is that it's not the scheduler itself that is the source of the issue, but the interaction of the sleep methods with the scheduler that is the issue. Ubuntu 7.10 vs Ubuntu 8.04-4: The difference in the kernel version between these two is smaller than that of CentOS 5.6 and 6.2, but they still span the time period when CFS was introduced. The first interesting result is that select and poll seem to be the only sleep mechanisms that have the "penalty" on 8.04 and that penalty continues to a higher number of threads than what was seen with CentOS 6.2. The user time for select and poll and Ubuntu 7.10 are unreasonably low, so this appears to be some sort of accounting issue that existed then, but I believe is not relevant to the current issue/discussion. The system time does seem to be higher with Ubuntu 8.04 than with Ubuntu 7.10 but this difference is FAR less distinct than what was seen with CentOS 5.6 vs 6.2. Notes on Ubuntu 11.10 and Ubuntu 12.04: The first thing to note here is that the plots for Ubuntu 12.04 were comparable to those from 11.10 so they not show to prevent unnecessary redundancy. Overall the plots for Ubuntu 11.10 show the same sort of trend that was observed with CentOS 6.2 (which indicates that this is a kernel issue in general and not just a RHEL issue). The one exception is that the system time appears to be a bit higher with Ubuntu 11.10 than with CentOS 6.2, but once again, the resolution on this measurement is very course so I think that any conclusion other than "it appears to be a bit higher" would be stepping onto thin ice. Ubuntu 11.10 vs Ubuntu 11.10 with BFS: A PPA that uses BFS with the Ubuntu kernel can be found at https://launchpad.net/~chogydan/+archive/ppa and this was installed to generate this comparison. I couldn't find an easy way to run CentOS 6.2 with BFS so I ran with this comparison and since the results of Ubuntu 11.10 compares so well with CentOS 6.2, I believe that it is a fair and meaningful comparison. The major point of note is that with BFS only select and nanosleep induce the "penalty" at low numbers of threads, but that it seems to induce a similar "penalty" (if not a greater one) as that seen with CFS for a higher number of threads. The other interesting point is that the system time appears to be lower with BFS than with CFS. Once again, this is starting to step on thin ice because of the coarseness of the data, but some difference does appear to be present and this result does match with the simple 50 process select loop test did show less CPU usage with BFS than with CFS. The conclusion that I draw from these two points is that BFS does not solve the problem but at least seems to reduce its affects in some areas. Conclusion: As previously stated, I don't believe that this is an issue with the scheduler itself, but with the interaction between the sleeping mechanisms and the scheduler. I consider this increased CPU usage in processes that should be sleeping and using little to no CPU a regression from CentOS 5.6 and a major hurdle for any program that wants to use an event loop or polling style of mechanism. Please let me know if there's any other data I can get or tests I can run to help further diagnose the problem. Created attachment 582723 [details]
getimeofday results on Dell Vostro 200 with CentOS 5.6
Created attachment 582724 [details]
utime results on Dell Vostro 200 with CentOS 5.6
Created attachment 582725 [details]
stime results on Dell Vostro 200 with CentOS 5.6
Created attachment 582726 [details]
getimeofday results on Dell Vostro 200 with CentOS 6.2
Created attachment 582727 [details]
utime results on Dell Vostro 200 with CentOS 6.2
Created attachment 582728 [details]
stime results on Dell Vostro 200 with CentOS 6.2
Created attachment 582729 [details]
gettimeofday results on Dell Vostro 200 with Ubuntu 7.10
Created attachment 582730 [details]
utime results on Dell Vostro 200 with Ubuntu 7.10
Created attachment 582731 [details]
stime results on Dell Vostro 200 with Ubuntu 7.10
Created attachment 582732 [details]
gettimeofday results on Dell Vostro 200 with Ubuntu 8.04-4
Created attachment 582733 [details]
utime results on Dell Vostro 200 with Ubuntu 8.04-4
Created attachment 582734 [details]
stime results on Dell Vostro 200 with Ubuntu 8.04-4
Created attachment 582735 [details]
gettimeofday results on Dell Vostro 200 with Ubuntu 11.10
Created attachment 582736 [details]
utime results on Dell Vostro 200 with Ubuntu 11.10
Created attachment 582737 [details]
stime results on Dell Vostro 200 with Ubuntu 11.10
Created attachment 582738 [details]
gettimeofday results on Dell Vostro 200 with Ubuntu 11.10 with BFS
Created attachment 582739 [details]
utime results on Dell Vostro 200 with Ubuntu 11.10 with BFS
Created attachment 582740 [details]
stime results on Dell Vostro 200 with Ubuntu 11.10 with BFS
I apologize for uploading so many files, but the comment I just added and the images are available at the stackexchange question I posted, so that might be an easier way to view the plots with my comments by them. The link is: http://unix.stackexchange.com/questions/37391/high-cpu-usage-with-cfs Created attachment 583676 [details]
Results on Dell Precision m6500
Here are the results from running the test on a Dell Precision m6500 with RHEL 5.5 and 6.2. This is the original hardware that issue was noticed on and exhibits some different and unexpected behaviour which I assume is at least partially due to the HyperThreading (Intel i7 X 940). I've attached a gzipped tarball of the .svg files and once again I'll discuss the comparison in pairs.
RHEL 5.5 vs RHEL 6.2:
On RHEL 5.5, the gettimeofday and user time results look very similar to what was seen on the Dell Vostro 200 and what is expected (user time being fairly flat up to 4 threads then a linear progression up to 8 and flat again). However, on RHEL 6.2, there's an unexpected upward trend starting at 4 threads that then goes away above 8 threads. The only explanation I can think for this is that the scheduler is putting all the tasks on the first 4 processors (causing competition on the logical cores) instead of spreading them over the 8 processors (so all 4 processors can be utilized). I'm guessing that this is done so the other processors can be utilized to a smaller degree and potentially allowed to enter a lower power state. This appears to come at the cost of processing, but this isn't a concern for our project since we should be typically operating with more than 4 CPU bound processes.
The system time is also higher and more sporadic on RHEL 5.5, but as before, the measurements are coarse and the uncertainty high, so it's hard to draw anything more than general conclusions from this.
RHEL 5.5 (No HyperThreading) vs RHEL 6.2 (No HyperThreading):
I disabled one of the cores on each of the processors to get results without HyperThreading. These results basically mirror the results on the Dell Vostro 200, but the one exception is that the gap between the workloads which sleep and the one which don't sleep is wider which is also because of the previously mentioned use of logical processors by the scheduler.
RHEL 6.2 (built on RHEL 5.5) vs RHEL 6.2 (built on RHEL 6.2):
The final comparison I did was to run the executable built on RHEL 5.5 on RHEL 6.2. In the past, there wasn't a difference in results between the two executables, but now that I make use of a direct system call to get the thread id, the executable built of RHEL 6.2 can't run on RHEL 5.5 because of the newer glibc version and the executable built on RHEL 5.5 gives different results than the executable built on RHEL 6.2.
The results from running the RHEL 5.5 executable on RHEL 6.2 are those title rhel_6.2_old*
The general shapes of the plots are basically identical, but the "penalty" seems to be higher with the executable built on RHEL 5.5 than that built with RHEL 6.2. I'm not sure what conclusion to draw from this, but it is definitely an interesting result.
Can anyone tell me if this is an issue with the upstream kernel as well? We/I back ported the CFS code from upstream into RHEL6.2 so we should see it there as well. Larry Woodman Yes, we see the same sort of behavior on Fedora 16 with both 32 and 64-bit kernels: kernel-3.3.4-3.fc16.i686 kernel-3.3.4-3.fc16.x86_64 We've also seen this with current generation Arch and Ubuntu kernels. It seems to be associated with the upstream kernel. Created attachment 584416 [details]
Results on ProLiant DL380 G4
These are the results from running the test on a ProLiant DL380 G4. The rhel_6.2_old results are from running the executable built on RHEL 5.3 on RHEL 6.2 and the rhel_6.2 are with the executable that was rebuilt on RHEL 6.2.
Probably the most interesting part of these results is that the sched_yield results track with the other sleep types once it hits 8 threads (in previous results it tracked with the no sleep test). It's also interesting that the "penalty" is increasing once 8 threads is hit. On other hardware, the "penalty" was either constant or decreasing with larger numbers of threads.
Created attachment 584419 [details]
Results on Xeon W35xx
These are the results from running on a Xeon W3503 CPU with RHEL 6.2 x86 and a Xeon X3520 CPU with RHEL 6.2 x64.
The results on the W3503 with RHEL 6.2 x86 look very similar to the results on most of the other systems that don't have HyperThreading.
The gettimeofday and utime results on the W3520 look similar to the HyperThreading results seen on the Dell m6500 with the exception that all the sleep types other than nanosleep seem more well behaved on x64. The stime results show an unexpected hump around 13 threads.
I ran the test on the W3503 out to 16 threads just to make sure it wasn't something related to the number of threads, but no hump was seen, so this appears to be something related to the x64 build or to the W3520 CPU itself.
Dave, some late breaking news on this. So far we have managed to reduce this problem by increasing /proc/sys/kernel/sched_migration_cost. This is what the upstream tuning guide says about this: sched_migration_cost Amount of time after the last execution that a task is considered to be “cache hot” in migration decisions. A “hot” task is less likely to be migrated, so increasing this variable reduces task migrations. The default value is 500000 (ns). If the CPU idle time is higher than expected when there are runnable processes, try reducing this value. If tasks bounce between CPUs or nodes too often, try increasing it. And this is what the scheduler does with this: ---------------------------------------------------------------- /* * schedule() is the main scheduler function. */ asmlinkage void __sched schedule(void) { ... if (unlikely(!rq->nr_running)) idle_balance(cpu, rq); /* * idle_balance is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. */ static void idle_balance(int this_cpu, struct rq *this_rq) { struct sched_domain *sd; int pulled_task = 0; unsigned long next_balance = jiffies + HZ; this_rq->idle_stamp = this_rq->clock; if (this_rq->avg_idle < sysctl_sched_migration_cost) return; ------------------------------------------------------------------ So basically all the CPUs with nothing to run end up in idle_balance() unless sched_migration_cost is larger than the average idle time for the whole run queue. Currently its set to 500000 ns, please try doubling it until you dont see this problem happening. Please let me know if this solves your problem, it does solve others. Larry Woodman So the good news is that setting kernel.sched_migration_cost to 1 ms does appear to positively affect the results of the test_sleep.c application that I have been running on different machines, but it appears that this might just be a result of tuning it to match the specific behavior of test_sleep.c (1 ms of sleep then 1 ms of work in a loop). When I run the simple test program that I made on the framework that we use, it only decreased CPU usage by about 5-10% (went from around 3.5% to 3.25%), so I think it's a step in the right direction, but not a complete solution for a real work load. Also, as a point of note, before submitting the bug, I also tried tuning some of the parameters as described here: http://doc.opensuse.org/documentation/html/openSUSE/opensuse-tuning/cha.tuning.taskscheduler.html None of them seemed to get me anything more than the 5-10% decrease in CPU usage that I saw by increasing kernel.sched_migration_cost. Any further word on potential fixes for this issue? The previously mentioned fix did show a mild improvement in the simple test cases but nothing all that significant in real world programs. As another data point, I created one more test were I ran 4 processing programs in parallel on the Dell m6500 with Hyper Threading disabled (so only 4 cores) and on RHEL 5.5 it took ~128 seconds but it took ~138 on RHEL 6.2 so I believe that this is one more data point that indicates that there is a serious regression between RHEL 5 and RHEL 6. Is there any profiling or log files that I can do on this sort of program to help diagnose the problem? Thanks, Dave Created attachment 595158 [details]
Simpler version of test_sleep program
In order to demonstrate the behavior that I mentioned in my last comment, I simplified the testing program that I had been using before. I compiled it on RHEL 5.5 with the following command:
gcc -Wall -Wextra -O2 test_sleep_simple.c -lrt -o test_sleep_simple
I then ran it on a Dell m6500 with RHEL 5.5 and RHEL 6.2 with the following command:
time ./test_sleep_simple 1000 100000 325 <sleep_type of 0-6>
The following is the time portion of the output:
RHEL 5.5
9.745u 0.000s 0:09.75 99.8% - No Sleep
9.750u 0.023s 0:09.78 99.8% - sched_yield
9.893u 0.003s 0:19.89 49.7% - select
9.875u 0.014s 0:19.89 49.6% - poll
9.888u 0.002s 0:30.12 32.8% - usleep
9.907u 0.012s 0:29.92 33.1% - pthread_cond_timedwait
9.892u 0.003s 0:30.10 32.8% - nanosleep
RHEL 6.2
7.037u 0.000s 0:07.04 99.8% - No Sleep
7.024u 0.008s 0:07.03 99.8% - sched_yield
7.981u 0.030s 0:19.08 41.9% - select
8.085u 0.067s 0:19.14 42.5% - poll
8.056u 0.027s 0:19.08 42.2% - usleep
8.055u 0.035s 0:19.13 42.2% - pthread_cond_timedwait
8.033u 0.022s 0:19.07 42.2% - nanowait
The following things jumped out at me from these results:
1) RHEL 6 is more efficient than RHEL 5 (RHEL 6 took 70% of the time it took on RHEL 5 with the on sleep workload).
2) On RHEL 5, select and poll work as expected (50% CPU usage) but the other 3 sleep methods appear to be sleeping for twice as long as requested (not a big deal, but just interesting to note).
3) On RHEL 6, all the sleep methods are consistent but are using more CPU than expected (user time increased by 1 second and wall time increased by 12 seconds instead of the expected 10 seconds).
I did the same test with the executable built on RHEL 6 and with that one I needed a work_size of 400 to get a no sleep execution time of 10 seconds on RHEL 5 (I'm guessing that it's some optimization that gcc 4.4.6 is doing that 4.1.2 didn't do), but the same general patterns were maintained as far as CPU usage and execution time were concerned.
I also tried running with a shced_min_granularity of 4000000 (8x the default value) and it didn't have a significant impact on the results.
Created attachment 598759 [details]
Update to simpler version of test_sleep program
This is an update to simpler test program that makes it perform more consistently on CPUs with small caches because it doesn't allocate any memory anymore. It also returns the pseed value that it calculates, so that the compiler won't optimize away the loops (gcc 4.6 was doing this with the no sleep case). It also gives the same results when built on RHEL 5 and RHEL 6 (previous version would run faster when built on RHEL 6).
On the Dell m6500 that it was run on before, it gave similar results to those previously listed, but it had to be run with the command (because it now takes a larger number of iterations to reach the target 1 ms time):
time ./test_sleep_simple 1000 10000 825 <sleep_type>
NOTE: The previous comment said num_iterations was 100000 but 10000 was the correct value)
I just opened another bug report (Bug #852888) about degraded performance when using mutexes and semaphores. I believe that whatever is causing that issue may also be causing this issue, so I thought that it was worth making note of it in this bug report. Madper, the kernel_firmware is now in http://people.redhat.com/~lwoodman/RHEL6/ Larry Is that kernel_firmware something that I can or should do testing on to see how it impacts my program and tests? Dave, sorry for the noise I was just trying to have some internal testing done to get to the bottom of this issue. I'll let you knopw as soon as I have something for you to test. Larry Woodman No worries. I figured it was probably an internal thing, but just wanted to check if it was something that would be worthwhile for me to give a try. But at any rate, it's really good to hear that work is being done, because I've been contacted by several people on the internet asking about the status of this bug and so far I've had basically no idea if anything was even happening with it or not. Can we get customers to try the kernel in http://people.redhat.com/~lwoodman/RHEL6/ to verify this the same issue they are running into??? Thanks, Larry Woodman Larry, here was the response from my customer:
"We tried the RPMs RedHat provided and found that they did make a noticeable difference but we are still far from the RHEL5 performance.
To take the example of the nanosleep test process (which we provided the source code for). Recall that in RHEL5 the cost is < 1 microsecond per call.
> New RHEL6 w/idle=mwait: 5 us per nanonsleep
> New RHEL6 plain: 12 us per nanosleep
> official RHEL6 w/idle=mwait 10 us per nanosleep
> official RHEL6 plain: 19 us per nanosleep
So this is a good step but does not get us back to where we were.
We still think that more work is needed."
I am attaching their nanotest.c code.
-- Jeffrey
Created attachment 734215 [details]
Customer provided reproducer program
I have asked customer to expand on how they came up with the numbers in comment #86. They have also come back with these questions: Why do you have at least these configuration parameters on and their impact: CONFIG_FTRACE_SYSCALLS=y CONFIG_HAVE_DYNAMIC_FTRACE=y CONFIG_DYNAMIC_FTRACE=y Each system call has an extra function call in its prologue. SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode) { ffffffff81173c70 <sys_open>: ffffffff81173c70: 55 push %rbp ffffffff81173c71: 48 89 e5 mov %rsp,%rbp ffffffff81173c74: e8 c7 70 e9 ff callq ffffffff8100ad40 <mcount> long ret; Figuring out exactly what mcount() does isn't obvious; its assembly. Is red hat seeing the same/similar results? Both of the tests that I've used to compare RHEL5 and RHEL6 show no significant difference between the default kernel and this new kernel. *** Bug 903114 has been marked as a duplicate of this bug. *** https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6062a8dc0517bce23e3c2f7d2fea5e22411269a3 I just noticed talk of that change in the current kernel tree and I was wondering if that change would have any impact on this issue. I was just testing the sleeptimerrevert patch on my reproducer environment with the following result: results-2.6.32-343.el6.x86_64/output-test-14.log:6265912976 results-2.6.32-343.el6.x86_64/output-test-1.log:54696239922 results-2.6.32-343.el6.x86_64/output-test-20.log:55084464390 results-2.6.32-343.el6.x86_64/output-test-20.log:7446574809 results-2.6.32-343.el6.x86_64/output-test-2.log:54627952951 results-2.6.32-343.el6.x86_64/output-test-2.log:5978905013 results-2.6.32-343.el6.x86_64/output-test-3.log:54687347578 results-2.6.32-343.el6.x86_64/output-test-3.log:1596921109 results-2.6.32-343.el6.x86_64/output-test-4.log:54570175593 results-2.6.32-343.el6.x86_64/output-test-4.log:4999146375 results-2.6.32-343.el6.x86_64/output-test-5.log:54739667037 results-2.6.32-343.el6.x86_64/output-test-5.log:1966401707 results-2.6.32-343.el6.x86_64/output-test-6.log:46539858293 results-2.6.32-343.el6.x86_64/output-test-6.log:5642055594 results-2.6.32-352.el6.x86_64/output-test-16.log:19528680165 results-2.6.32-352.el6.x86_64/output-test-5.log:10199927590 results-2.6.32-355.el6.x86_64/output-test-13.log:7284024091 results-2.6.32-355.el6.x86_64/output-test-1.log:34699167648 results-2.6.32-355.el6.x86_64/output-test-21.log:52228621648 results-2.6.32-355.el6.x86_64/output-test-5.log:18552327403 results-2.6.32-355.el6.x86_64/output-test-5.log:3084000215 results-2.6.32-355.el6.x86_64/output-test-9.log:19658483173 results-2.6.32-373.el6.sleeptimerevert.x86_64_mwait/output-test-6.log:32219195237 results-2.6.32-373.el6.sleeptimerevert.x86_64/output-test-7.log:53183818474 As you can see the "hang" just does not happen as often as before, still it does happen (The output should be approx 1000000000 (10^9 nanosec; 1 Second) The test is mainly flooding the console and doing some sleep 1 in the meantime, logging the time difference: $old=$(date "+%s%N") while true; do now=$(date "+%s%N"); echo $(($now-$old)) ; old=$now ; sleep 1 ; done Hi Larry, With the weather last week I did not get a chance to come down and see you about this. There is some confusion about this for the customer, although they are saying the sleeprevert patch is helping. What are the implications of this revert, what is implicated when going back to this. Reading the BZ there was talk of perf, ftrace and other implications. If you have a chance will you post and update. Thanks Laurence Larry Looking at the sleeprevert patch .config file and comparing to the stock .431 kernel. There is nothing here that suggests any issue with ftrace, perf and other performance related patch support. [loberman@dhcp-33-21 boot]$ diff -Nur config-2.6.32-431.3.1.el6.x86_64 config-2.6.32-431.el6.sleeprevert.x86_64 --- config-2.6.32-431.3.1.el6.x86_64 2013-12-13 07:16:18.000000000 -0500 +++ config-2.6.32-431.el6.sleeprevert.x86_64 2013-12-16 15:07:55.000000000 -0500 @@ -1,7 +1,7 @@ # # Automatically generated make config: don't edit -# Linux kernel version: 2.6.32-431.3.1.el6.x86_64 -# Fri Dec 13 06:55:20 2013 +# Linux kernel version: 2.6.32-431.el6.sleeprevert.x86_64 +# Mon Dec 16 15:02:22 2013 # CONFIG_64BIT=y # CONFIG_X86_32 is not set So I see no reason for the customers concern around perf, strace etc. loberman Hi, This regression in reporting the stime and utime increases was due to the patch/fix introduced by fixing Bug 748559. There is really no wall clock performance degradation, the reporting values from stime and utime are increased due to the accounting bug fix for 748559. Hence the values reported were increased, but there is really no actual performance degradation. Just a quick summary: The fix in 748559 is to get the corrected reporting/accounting of stime, and utime in /proc per process in multi-threaded environment; therefore, please do expect an increase in stime and utime reporting. There should not be any real wall clock performance degradation, and if there is an real wall clock performance degradation, please fill a separte bug for further investigation. I've heard through other channels that this is going to be closed as "Not A Bug" because of the resolution of Bugzilla #748559. Is it possible for me to receive some of the details of what that resolution was and why this is not considered a bug even though there are still 3 other Bugzillas that this blocks and I have received several queries on the internet of people having issues because of this? If possible, can I be granted access to Bugzilla #748559? Is there a chance to make this bz public? Dave Johansen, engineering, what do you think? We have an amount of cases with "my application consumes more ressources on rhel6 than on rhel5", causing many efforts to debug. Having this bz here public helps the customers to understand details. I'm a co-worker of Dave's, and I'm certain that this should be public. We usually post issues to bugzilla so that they *are* public and have the opportunity to have more eyes on it. Some Red Hat process seems to keep marking the bugs that we initiate as private - it's annoying. Especially when we reference bugs to our customers and they can't view them. .. but I'll let Dave confirm that this bug can be open. Thanks. Yes, I agree with Bob. This bugzilla was opened as public and was public for a while but was made private for reasons that are unknown to me. It's actually been annoying that it's private because several people have email me asking with info about the bugzilla and I've had to relay information/updates to them instead of just letting them view it here. Making this Bugzilla public. |
Created attachment 577173 [details] Example of calls to select with high CPU usage Description of problem: The select function shows unexpectedly high CPU usage. Version-Release number of selected component (if applicable): 2.12 How reproducible: Always Steps to Reproduce: 1. Compile test_select.cpp 2. Run generated executable Actual results: CPU usage of ~4% on a Dell Precision T3400. Expected results: Insignificant amount of CPU usage since the program should basically just sleeping. Additional info: Running the same code on RHEL 5 shows no significant change in CPU usage.