Bug 1378529
Description
Victor Stinner
2016-09-22 16:27:07 UTC
Here is the full story how I noticed the bug. Last week, I ran a benchmark a full day and everything was fine but at the end of the day, for no obvious reason the benchmark suddenly became 2x faster!? Last benchmark before the bug: -------- $ python3 -m perf show pgo_seed_4.json Median +- std dev: 20.4 ms +- 0.1 ms -------- Benchmark when the bug occurred: -------- haypo@smithers$ python3 -m perf show pgo_seed_5.json ERROR: the benchmark is very unstable, the standard deviation is very high (stdev/median: 38%)! Try to rerun the benchmark with more runs, samples and/or loops Median +- std dev: 11.0 ms +- 4.1 ms $ python3 -m perf dump -q pgo_seed_5.json |less Run 1: samples (3): 20.4 ms (+86%), 20.4 ms (+86%), 20.4 ms (+86%) Run 2: samples (3): 20.7 ms (+88%), 20.7 ms (+88%), 20.7 ms (+88%) Run 3: samples (3): 20.3 ms (+85%), 20.3 ms (+85%), 20.3 ms (+85%) Run 4: samples (3): 20.3 ms (+85%), 20.3 ms (+85%), 20.3 ms (+85%) (...) Run 103: samples (3): 20.3 ms (+85%), 20.3 ms (+85%), 20.3 ms (+85%) Run 104: samples (3): 20.3 ms (+85%), 20.3 ms (+85%), 20.3 ms (+85%) Run 105: samples (3): 20.3 ms (+85%), 20.3 ms (+85%), 20.3 ms (+85%) Run 106: samples (3): 11.0 ms, 11.0 ms, 11.0 ms Run 107: samples (3): 11.2 ms, 11.2 ms, 11.2 ms Run 108: samples (3): 11.1 ms, 11.1 ms, 11.1 ms (...) Run 398: samples (3): 11.0 ms, 11.0 ms, 11.0 ms Run 399: samples (3): 10.9 ms, 10.8 ms, 10.9 ms Run 400: samples (3): 10.8 ms, 10.8 ms, 10.8 ms -------- I don't understand what occurred between run 105 and run 106. The problem is that for this benchmark, I isolated 3 physical cores, so 6 logical cores, and I used the 6 logical cores for the benchmark. So I don' know exactly on which physical or logical CPU the benchmark ran. 6 hours later, when the system was idle, I was still able to reproduce the major performance difference: Performance on CPU 2: -------- $ python3 -m perf show cpu_bug2.json Median +- std dev: 11.0 ms +- 0.2 ms -------- Performance on CPU 3: -------- $ python3 -m perf show cpu_bug3.json Median +- std dev: 20.3 ms +- 0.2 ms -------- CPU 3 seems to be 2x slower. I'm discussing with Srinivas Pandruvada and Rafael Wysocki who work on the Linux intel_pstate driver. Srinivas Pandruvada asks me to add rcu_nocbs=3,7 to the kernel parameters: it seems like this parameter works around the bug! I tried the Linux command line: "... isolcpus=3,7 nohz_full=3,7 rcu_nocbs=3,7". Using this command, I'm unable to reproduce the C0 bug anymore! * When idle, cpupower monitor shows me that CPU 3 and 7 are 0% of time in the C0 state, so the CPUs are seend as idle are expected. Confirmed by turbostat: CPU frequency is 0 MHz (or something like 0.01 MHz) when the CPUs are idle * When running the benchmark on one CPU: cpupower monitor shows me that the tested CPU is 100% of time in the C0 state, as expected, since it's active, and says that its frequency is 1825 MHz. turbostat confirms that the CPU is active: Busy% close to 100% and Avg_MHz around 1720 MHz. * After the benchmark, here the behaviour changes compared to default mode without rcu_nocbs=3,7: CPUs 3 and 7 goes back to 0% in C0 and their frequency goes back to 0 MHz (or 0.01 MHz). I rechecked after a fresh boot and I failed to reproduce the bug. In fact, the bug only occurs with the performance governor. So I ran again the test after a fresh boot with the performance governor and the Linux command line "(...) isolcpus=3,7 nohz_full=3,7 rcu_nocbs=3,7". And I was unable to reproduce the bug. I confirm that "rcu_nocbs=3,7" works around the issue. Created attachment 1203940 [details] Script to tune Linux kernel to run a benchmark on isolated CPUs: set governor to performance, mask IRQ, etc. > I rechecked after a fresh boot and I failed to reproduce the bug. In fact, the bug only occurs with the performance governor. Just after a fresh boot with isolated CPUs, I run my isolcpus.py scripts which tune the system: * Set the scaling governor of all CPUs to "performance" * Try to write 1 into /sys/devices/system/cpu/intel_pstate/no_turbo * Create a CPU mask exluding isolated CPUs and write this mask into /proc/irq/default_smp_affinity and into all /proc/irq/<IRQ number>/smp_affinity I attached my script to this issue. FYI I wrote an article how I noticed, analyzed and then identified the bug: https://haypo.github.io/intel-cpus-part2.html Update: * When using NOHZ_FULL on a CPU, intel_pstate is not called back by the scheduler and so don't update the P-state of the CPU * It doesn't seem possible to workaround the issue by forcing a P-state on the CPU using wrmsr * In short, the frequency of CPUs using NOHZ_FULL don't depend on their workload, but depend on the workload of other CPUs. If another CPU is active, the CPU will probably runs at full speed (ex: 3.4 GHz or higher if Turbo Boost is enabled). If other CPUs are idle, the CPU will probably run at the lowest speed (ex: 1.6 GHz). If you want stable CPU frequency: force the CPU frequency or don't use NOHZ_FULL. -- I found a faster, simpler and more reliable scenario to reproduce the bug: * Boot Linux with nohz_full=3,7 (no more isolcpus, rcu_nocbs or anything like that, only nohz_full!) * Read the P-state of the CPU 7: rdmsr -p 7 0x198 --bitfield 15:0 * Start stressing the CPU 7 * Read again the P-state of the CPU 7: rdmsr -p 7 0x198 --bitfield 15:0 * The P-State is not updated, the CPU frequency is not updated neither The P-state of CPU 7 is not updated, whereas it works as expected on CPUs not part of the nohz_full CPU set. Example on CPU 5 (ok): * rdmsr -p 5 0x198 --bitfield 15:0 => 1000 * Starting stressing the CPU 5: turbostat taskset -c 5 python3 -c 'while 1: pass' * rdmsr -p 5 0x198 --bitfield 15:0 => 2200 * Interrupt the stress test (CTRL+c) * According to turbostat, CPU 5 used frequency 3.3 GHz * Sleep 10 seconds * rdmsr -p 5 0x198 --bitfield 15:0 => 1000 Example on CPU 7 using NOHZ_FULL (BUG!): * rdmsr -p 7 0x198 --bitfield 15:0 => 1100 * Starting stressing the CPU 5: turbostat taskset -c 7 python3 -c 'while 1: pass' * rdmsr -p 7 0x198 --bitfield 15:0 => 1000 ~~~ NO CHANGE! * Interrupt the stress test (CTRL+c) * According to turbostat, CPU 7 used frequency 1.6 GHz ~~~ I expect 3.3 GHz * Sleep 10 seconds * rdmsr -p 7 0x198 --bitfield 15:0 => 1000 -- I played with Ftrace: # cd /sys/kernel/debug/tracing # echo function > current_tracer # echo intel_pstate_update_util > set_ftrace_filter # cat trace_pipe |grep '\[00[37]\]' Notes: * On a regular CPU (ex: CPU 5), intel_pstate_update_util() is called by task_tick_fair() every millisecond (1000x/sec) * On CPU 7 using NOHZ_FULL: intel_pstate_update_util() is called once per second It seems like the scheduler is idle on the CPU 7 and so don't call the callback used by the intel_pstate driver to update P-state of CPUs. -- I tried to update manually P-state using MSR 199H, but it seems like the intel_pstate immediatly override my choice. # wrmsr -p 3 0x199 0x2200; wrmsr -p 7 0x199 0x2200; rdmsr -p 7 0x198 --bitfield 15:0; sleep 1; rdmsr -p 7 0x198 --bitfield 15:0 2200 1000 So it doesn't seem possible to workaround the issue manually. If you want stable CPU frequency on CPUs using NOHZ_FULL: * Force the CPU frequency * Don't use NOHZ_FULL Example of command to force the CPU frequency, set the maximum frequency as the scaling minimum frequency: for cpu in /sys/devices/system/cpu/cpu*/cpufreq; do cat cpuinfo_max_freq|sudo tee scaling_min_freq; done By the way, HyperThreading is unrelated to this bug: the bug can be reproduced with and without HyperThreading. Created attachment 1205282 [details]
Script to reproduce the 2x slow down on CPU using NOHZ_FULL using the performance governor
Created attachment 1205283 [details]
perf JSON file of 10 fast runs and then 10 slow runs of the benchmark for the performance 2x slower issue
Created attachment 1205284 [details]
Output of performance_bug.sh
I wrote a shell script reproducing the 2x slower issue when running a benchmark on a CPU using NOHZ_FULL. CPU 3 and 7 are using NOHZ_FULL.
* It seems like running turbostat makes CPU 3 fast, so the script starts by running turbostat
* First run of the benchmark: Median +- std dev: 12.9 ms +- 0.2 ms (FAST)
* Sleep 2 minutes (120 seconds)
* Second run of the benchmark: Median +- std dev: 23.6 ms +- 0.3 ms (SLOW)
I added commands to get debug information.
The value of MSR 198H and MSR 199H is 2200H for all CPUs, when the benchmark is fast and when the benchmark is slow. So these registers are not enough to explain the performance drop.
I don't see any significant different before (fast)/after (slow). When the benchmark is fast, the CPU temperature is between 44°C and 45°C, when it's slow the temperature is between 47°C and 49°C. But it may be completely unrelated or a side effect of something else.
Created attachment 1205306 [details] performance-bug2.sh: Update script creating the virtual env, with more comments, etc. I forgot to mention requirements for performance-bug-2.sh (and performance-bug.sh): one CPU must run with NOHZ_FULL. Example of Linux cmdline: "... isolcpus=3,7 nohz_full=3,7 rcu_nocbs=3,7". But "... nohz_full=3,7" is enough to reproduce the bug. CPU=3 and LAST_CPU=7 variables should be updated in performance-bug-2.sh depending on the computer: $CPU is a CPU running with NOHZ_FULL, $LAST_CPU is the identifier of the last CPU (ex: 7 if there are 8 CPUs). Created attachment 1205339 [details]
performance2.log: log where the bug occurs
Oops, I attached the wrong log before where the bug wasn't reproduced :-/
performance2.log shows the bug:
* Bench 1: Median +- std dev: 12.9 ms +- 0.2 ms
* Bench 2: Median +- std dev: 23.6 ms +- 0.3 ms
Oooh, on #linux-rt of OFTC, JackWinter_ and clark explained me how to use /dev/cpu_dma_latency for change the minimum C-state dynamically (without rebooting). When writing 0 into /dev/cpu_dma_latency (and keep the device open), all (logical) CPUs are stuck in the C0 state and don't go to sleep anymore. On this case, my benchmark become 2x slower on all CPUs, not only on CPUs using NOHZ_FULL. It seems like forcing C0 on all logical CPUs with HyperThreading is a bad idea :-/ ~~~ CASE 1 When writing 10 (min latency of 10 ms) into /dev/cpu_dma_latency (and keep the device open), CPUs are able to move to C1 state. The benchmarks runs at nominal speed on all CPUs. Moreover, my shell script (performance_bug-2.sh) is unable to reproduce the bug anymore. ~~~ CASE 2 I see two cases: * CASE 1: Sometimes, a pair of logical CPUs (of a physical CPU core) using NOHZ_FULL are both stuck in C0: performance is divided by two * related to CASE 2: Sometimes, a logical CPU is not awaken and stay in deep C-state and so runs slower than the nominal speed. I understand that the C-state of CPUs using NOHZ_FULL is not properly updated. Again, it might be related to the specific behaviour of the scheduler on CPUs using NOHZ_FULL, or maybe an issue with intel_idle and NOHZ_FULL. > CASE 1: Sometimes, a pair of logical CPUs (of a physical CPU core) using NOHZ_FULL are both stuck in C0: performance is divided by two Ah, this case seems to be known by some users and is related to HyperThreading. John McCalpin message in 2014: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/535130 => "Of course the right thing to do is never use HyperThreading and "idle=poll" at the same time!" Extract: """ If you set "idle=poll" and you are using HyperThreading, then the kernel idle loop will be executing instructions and fighting for resources in the physical processor cores. If the LINPACK benchmark decides to use only two threads (since there are only two physical cores), then there will be two "idle" threads that spin in a tight polling loop waiting to be assigned a process to run. This would be fine if they were on their own cores, but with HyperThreading they will slow down the compute threads. You should get an improvement in performance if you can force the code to use 4 logical processors instead of 2. The overall result will be slower because the code is typically blocked for one thread per L2 cache (so you will get lots of extra L2 misses), but at least they will be getting real work done instead of just fighting for issue slots. """ *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs. Fedora 25 has now been rebased to 4.10.9-100.fc24. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26. If you experience different issues, please open a new bug report for those. I'm not sure that the issue can really be called a bug. It seems technically impossible to update the C-state/P-state of a CPU using NOHZ_FULL, since NOHZ_FULL means "no interruption". The Intel CPU drivers are implemented with a callback in the Linux scheduler, whereas NOHZ_FULL disables the interruption of the Linux scheduler as well. The problem is that the Intel CPU drivers are not aware that a few CPUs have interruptions disabled and not aware that the workload of the CPUs using NOHZ_FULL is not taken in account to compute the C-state/P-state. I stopped using NOHZ_FULL, so I now close this bug. |