1378529 – intel_pstate driver doesn't support NOHZ_FULL

Bug 1378529 - intel_pstate driver doesn't support NOHZ_FULL

Summary: intel_pstate driver doesn't support NOHZ_FULL

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	24
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-22 16:27 UTC by Victor Stinner
Modified:	2017-04-12 11:15 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-04-12 11:15:53 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Script to tune Linux kernel to run a benchmark on isolated CPUs: set governor to performance, mask IRQ, etc. (4.97 KB, text/plain) 2016-09-22 23:25 UTC, Victor Stinner	no flags	Details
Script to reproduce the 2x slow down on CPU using NOHZ_FULL using the performance governor (774 bytes, text/plain) 2016-09-27 16:36 UTC, Victor Stinner	no flags	Details
perf JSON file of 10 fast runs and then 10 slow runs of the benchmark for the performance 2x slower issue (8.92 KB, text/plain) 2016-09-27 16:36 UTC, Victor Stinner	no flags	Details
Output of performance_bug.sh (10.74 KB, text/plain) 2016-09-27 16:53 UTC, Victor Stinner	no flags	Details
performance-bug2.sh: Update script creating the virtual env, with more comments, etc. (1.21 KB, text/plain) 2016-09-27 19:40 UTC, Victor Stinner	no flags	Details
performance2.log: log where the bug occurs (14.33 KB, text/plain) 2016-09-27 22:33 UTC, Victor Stinner	no flags	Details
Show Obsolete (1) View All

Description Victor Stinner 2016-09-22 16:27:07 UTC

First, my is a "Sandybridge-DT" and "HWP is not available on this system" according to my colleague Prarit Bhargava :-)

When using isolated CPUs (isolcpus kernel parameter), it seems like sometimes isolated cores are 2x slower than expected. When running CPU-bound benchmarks on these isolated CPUs, sometimes a benchmark suddenly becomes 2x FASTER with no obvious explanation.

It occurred me twice that a benchmark suddenly becomes 2x FASTER, and I didn't do anything special to change the CPU speed or whatever.

I explicitly disabled Turbo Boost in the BIOS. Trying to disable it in the intel_pstate driver confirms that it's disabled:
-------
$ echo 1|sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo [sudo] Mot de passe de haypo : 
tee: /sys/devices/system/cpu/intel_pstate/no_turbo: Operation not permitted
-------

cpupower also confirms that it's disabled:
-------
$ sudo cpupower frequency-info
  boost state support:
    Supported: no
    Active: no
-------

As expected, the CPU frequency is 3.4 GHz when I run my benchmark. The strange thing is that the CPU frequency is also 3.4 GHz when suddenly the benchmark becomes 2x faster. I'm using /proc/cpuinfo to read the CPU frequency. I already noticed that turbostat is much more reliable that /proc/cpuinfo on my Intel CPU. Sadly, It seems like using turbostat has an impact on the bug, so I cannot verify the exact CPU frequency when the bug occurs.

By forcing the CPU frequency at 1.6 GHz (still using the intel_pstate driver), I reproduced the 2x slowdown. So I understand that the isolated CPU runs at 1.6 GHz even if /proc/cpuinfo announces 3.4 GHz when the bug occurs, whereas it runs at 3.4 GHz when the benchmarks seems "faster".



Version-Release number of selected component (if applicable):

* Fedora 24
* Linux kernel 4.7.3-200.fc24.x86_64
* kernel-tools 4.7.3 (release 200.fc24), kernel-tools contains turbostat & cpupower



How reproducible:

Oh, this is the tricky part...

The bug occurs randomy. But it seems like I found a way to trigger the bug manually! It seems like running "cpupower monitor" helps to reproduce the bug.



Steps to reproduce
==================

My computer
-----------



Output::

    $ cat /etc/fedora-release
    Fedora release 24 (Twenty Four)
    $ uname -a
    Linux smithers 4.7.3-200.fc24.x86_64 #1 SMP Wed Sep 7 17:31:21 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
    $ lscpu -a -e
    CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
    0   0    0      0    0:0:0:0       oui    3400,0000 1600,0000
    1   0    0      1    1:1:1:0       oui    3400,0000 1600,0000
    2   0    0      2    2:2:2:0       oui    3400,0000 1600,0000
    3   0    0      3    3:3:3:0       oui    3400,0000 1600,0000
    4   0    0      0    0:0:0:0       oui    3400,0000 1600,0000
    5   0    0      1    1:1:1:0       oui    3400,0000 1600,0000
    6   0    0      2    2:2:2:0       oui    3400,0000 1600,0000
    7   0    0      3    3:3:3:0       oui    3400,0000 1600,0000
    root@smithers$ cat /proc/cpuinfo
    processor	: 0
    vendor_id	: GenuineIntel
    cpu family	: 6
    model		: 42
    model name	: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
    stepping	: 7
    microcode	: 0x29
    cpu MHz		: 1599.768
    cache size	: 8192 KB
    physical id	: 0
    siblings	: 8
    core id		: 0
    cpu cores	: 4
    apicid		: 0
    initial apicid	: 0
    fpu		: yes
    fpu_exception	: yes
    cpuid level	: 13
    wp		: yes
    flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts
    bugs		:
    bogomips	: 6822.45
    clflush size	: 64
    cache_alignment	: 64
    address sizes	: 36 bits physical, 48 bits virtual
    power management:

    (... 7 more logical CPUs ...)


Prepare
-------

* Enable HyperThreading if it was disabled
* Isolate one physical CPU core using isolcpus kernel cmdline, ex: "isolcpus=3,7".
  You may also enable NOHZ full on these isolated CPUs: "isolcpus=3,7 nohz_full=3,7".
  Moreover, you may also disable IRQ on these isolate CPUs.
* Install the Python "perf" module

Install the Python perf module in a virtual environment, so ``rm -rf
bug_pstate`` will remove everything later::

    $ python3 -m venv bug_pstate
    $ bug_pstate/bin/python -m pip install perf


Before C0 bug
-------------

* Run the benchmark to CPU 7 in the terminal 1

Terminal 1 before monitoring, the benchmark is "fast"::

    $ bug_pstate/bin/python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' -v -w0 --metadata -p 10 --affinity=7
    Pin process to CPUs: 7
    (...)
    Run 9/10: samples (3): 267 ns, 267 ns, 267 ns
    Run 10/10: samples (3): 268 ns, 268 ns, 268 ns

    Metadata:
    - cpu_affinity: 7
    - cpu_config: 7=driver:intel_pstate, intel_pstate:no turbo, governor:performance, nohz_full, isolated
    - cpu_count: 8
    - cpu_freq: 7=3400 MHz
    - cpu_model_name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
    (...)

    Median +- std dev: 267 ns +- 1 ns


Trigger the C0 bug
------------------

* Start cpupower monitor in terminal 2

Initial state, Mperf C0 at 0% for CPUs 3 and 7, the CPU is idle::

    root@smithers$ while true; do cpupower monitor -i 2; done
        |Nehalem                    || SandyBridge        || Mperf              || Idle_Stats
    CPU | C3   | C6   | PC3  | PC6  || C7   | PC2  | PC7  || C0   | Cx   | Freq || POLL | C1-S | C1E- | C3-S | C6-S
       0|  1,37| 83,00|  0,76| 59,63||  0,00| 16,90|  0,00||  3,20| 96,80|  2055||  0,00|  0,32|  0,10|  0,76| 95,74
       4|  1,37| 83,00|  0,76| 59,63||  0,00| 16,90|  0,00||  0,57| 99,43|  3120||  0,00|  4,53|  0,40|  0,00| 94,50
       1|  0,12| 96,88|  0,76| 59,63||  0,00| 16,90|  0,00||  1,27| 98,73|  3068||  0,00|  0,10|  0,05|  0,11| 98,47
       5|  0,12| 96,88|  0,76| 59,63||  0,00| 16,90|  0,00||  0,49| 99,51|  3219||  0,00|  0,00|  0,02|  0,00| 99,50
       2|  0,07| 97,53|  0,76| 59,63||  0,00| 16,90|  0,00||  1,20| 98,80|  3032||  0,00|  0,00|  0,06|  0,11| 98,64
       6|  0,07| 97,53|  0,76| 59,63||  0,00| 16,90|  0,00||  0,64| 99,36|  3054||  0,00|  0,00|  0,00|  0,00| 99,35
       3|  0,00| 99,76|  0,76| 59,63||  0,00| 16,90|  0,00||  0,00|100,00|  2513||  0,00|  0,00|  0,00|  0,00| 99,99
       7|  0,00| 99,76|  0,76| 59,63||  0,00| 16,90|  0,00||  0,00|100,00|  3276||  0,00|  0,00|  0,00|  0,00|100,00
    (...)

* Run again the benchmark to CPU 7 in the terminal 1

The second run of the benchmark is fine::

    $ bug_pstate/bin/python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' -v -w0 --metadata -p 10 --affinity=3
    Pin process to CPUs: 3
    (...)
    Run 10/10: samples (3): 273 ns, 267 ns, 267 ns

    Metadata:
    - cpu_affinity: 3
    - cpu_config: 3=driver:intel_pstate, intel_pstate:no turbo, governor:performance, nohz_full, isolated
    - cpu_count: 8
    - cpu_freq: 3=3400 MHz
    - cpu_model_name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
    (...)

    Median +- std dev: 268 ns +- 1 ns

* Wait a few seconds and see CPU 3 and CPU 7 slowly becoming stuck in C0 state::

Terminal 2 when the bug occurs::

    root@smithers$ while true; do cpupower monitor -i 2; done

    (...)
        |Nehalem                    || SandyBridge        || Mperf              || Idle_Stats
    CPU | C3   | C6   | PC3  | PC6  || C7   | PC2  | PC7  || C0   | Cx   | Freq || POLL | C1-S | C1E- | C3-S | C6-S
       0|  3,07| 84,28|  0,00|  0,00||  0,00|  0,00|  0,00||  7,07| 92,93|  3410||  0,00|  0,77|  0,52|  1,86| 89,82
       4|  3,07| 84,28|  0,00|  0,00||  0,00|  0,00|  0,00||  0,25| 99,75|  3410||  0,00|  0,30|  0,34|  0,01| 99,10
       1|  0,41| 87,02|  0,00|  0,00||  0,00|  0,00|  0,00||  3,37| 96,63|  3410||  0,00|  1,74|  0,33|  0,59| 93,98
       5|  0,41| 87,02|  0,00|  0,00||  0,00|  0,00|  0,00||  2,05| 97,95|  3410||  0,00|  7,65|  0,00|  0,00| 90,30
       2|  0,05| 95,01|  0,00|  0,00||  0,00|  0,00|  0,00||  3,39| 96,61|  3411||  0,00|  0,62|  0,03|  0,11| 95,86
       6|  0,05| 95,01|  0,00|  0,00||  0,00|  0,00|  0,00||  0,65| 99,35|  3410||  0,00|  0,00|  0,00|  0,00| 99,35
       3|  0,00|  0,00|  0,00|  0,00||  0,00|  0,00|  0,00||100,00|  0,00|  3411||  0,00|  0,00|  0,00|  0,00|  0,00
       7|  0,00|  0,00|  0,00|  0,00||  0,00|  0,00|  0,00||100,00|  0,00|  3410||  0,00|  0,00|  0,00|  0,00|  0,00


* Now running the benchmark on CPU 3 or CPU 7 is simply 2x slower!

Terminal 1 with monitoring when the C0 bug occurs::

    $ bug_pstate/bin/python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' -v -w0 --metadata -p 10 --affinity=3
    Pin process to CPUs: 7
    (...)
    Run 9/10: samples (3): 528 ns, 522 ns, 526 ns
    Run 10/10: samples (3): 524 ns, 528 ns, 528 ns

    Metadata:
    - cpu_affinity: 7
    - cpu_config: 7=driver:intel_pstate, intel_pstate:no turbo, governor:performance, nohz_full, isolated
    - cpu_count: 8
    - cpu_freq: 7=3400 MHz
    - cpu_model_name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
    (...)

    Median +- std dev: 521 ns +- 4 ns


Exit the C0 bug
---------------

* Stop monitoring
* Run benchmark on the CPU 3
* Run benchmark on the CPU 7
* Run benchmark on the CPU 3
* The bug is gone!

Output::

    $ bug_pstate/bin/python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' -w0 -p 10 --affinity=3
    ..........
    Median +- std dev: 520 ns +- 3 ns
    $ bug_pstate/bin/python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' -w0 -p 10 --affinity=7
    ..........
    Median +- std dev: 519 ns +- 2 ns
    $ bug_pstate/bin/python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' -w0 -p 10 --affinity=3
    ..........
    Median +- std dev: 267 ns +- 1 ns

Comment 1 Victor Stinner 2016-09-22 16:29:27 UTC

Here is the full story how I noticed the bug.

Last week, I ran a benchmark a full day and everything was fine but at the end of the day, for no obvious reason the benchmark suddenly became 2x faster!?

Last benchmark before the bug:
--------
$ python3 -m perf show pgo_seed_4.json
Median +- std dev: 20.4 ms +- 0.1 ms
--------


Benchmark when the bug occurred:
--------
haypo@smithers$ python3 -m perf show pgo_seed_5.json
ERROR: the benchmark is very unstable, the standard deviation is very
high (stdev/median: 38%)!
Try to rerun the benchmark with more runs, samples and/or loops

Median +- std dev: 11.0 ms +- 4.1 ms


$ python3 -m perf dump -q pgo_seed_5.json |less
Run 1: samples (3): 20.4 ms (+86%), 20.4 ms (+86%), 20.4 ms (+86%)
Run 2: samples (3): 20.7 ms (+88%), 20.7 ms (+88%), 20.7 ms (+88%)
Run 3: samples (3): 20.3 ms (+85%), 20.3 ms (+85%), 20.3 ms (+85%)
Run 4: samples (3): 20.3 ms (+85%), 20.3 ms (+85%), 20.3 ms (+85%)
(...)
Run 103: samples (3): 20.3 ms (+85%), 20.3 ms (+85%), 20.3 ms (+85%)
Run 104: samples (3): 20.3 ms (+85%), 20.3 ms (+85%), 20.3 ms (+85%)
Run 105: samples (3): 20.3 ms (+85%), 20.3 ms (+85%), 20.3 ms (+85%)
Run 106: samples (3): 11.0 ms, 11.0 ms, 11.0 ms
Run 107: samples (3): 11.2 ms, 11.2 ms, 11.2 ms
Run 108: samples (3): 11.1 ms, 11.1 ms, 11.1 ms
(...)
Run 398: samples (3): 11.0 ms, 11.0 ms, 11.0 ms
Run 399: samples (3): 10.9 ms, 10.8 ms, 10.9 ms
Run 400: samples (3): 10.8 ms, 10.8 ms, 10.8 ms
--------

I don't understand what occurred between run 105 and run 106.

The problem is that for this benchmark, I isolated 3 physical cores, so 6 logical cores, and I used the 6 logical cores for the benchmark. So I don' know exactly on which physical or logical CPU the benchmark ran.


6 hours later, when the system was idle, I was still able to reproduce the major performance difference:

Performance on CPU 2:
--------
$ python3 -m perf show cpu_bug2.json
Median +- std dev: 11.0 ms +- 0.2 ms
--------

Performance on CPU 3:
--------
$ python3 -m perf show cpu_bug3.json
Median +- std dev: 20.3 ms +- 0.2 ms
--------

CPU 3 seems to be 2x slower.

Comment 2 Victor Stinner 2016-09-22 16:53:57 UTC

I'm discussing with Srinivas Pandruvada and Rafael Wysocki who work on the Linux intel_pstate driver.

Srinivas Pandruvada asks me to add rcu_nocbs=3,7 to the kernel parameters: it seems like this parameter works around the bug!


I tried the Linux command line: "... isolcpus=3,7 nohz_full=3,7 rcu_nocbs=3,7".

Using this command, I'm unable to reproduce the C0 bug anymore!

* When idle, cpupower monitor shows me that CPU 3 and 7 are 0% of time in the C0 state, so the CPUs are seend as idle are expected. Confirmed by turbostat: CPU frequency is 0 MHz (or something like 0.01 MHz) when the CPUs are idle

* When running the benchmark on one CPU: cpupower monitor shows me that the tested CPU is 100% of time in the C0 state, as expected, since it's active, and says that its frequency is 1825 MHz. turbostat confirms that the CPU is active: Busy% close to 100% and Avg_MHz around 1720 MHz.

* After the benchmark, here the behaviour changes compared to default mode without rcu_nocbs=3,7: CPUs 3 and 7 goes back to 0% in C0 and their frequency goes back to 0 MHz (or 0.01 MHz).

Comment 3 Victor Stinner 2016-09-22 20:32:36 UTC

I rechecked after a fresh boot and I failed to reproduce the bug. In fact, the bug only occurs with the performance governor.

So I ran again the test after a fresh boot with the performance governor and the Linux command line "(...) isolcpus=3,7 nohz_full=3,7 rcu_nocbs=3,7". And I was unable to reproduce the bug. I confirm that "rcu_nocbs=3,7" works around the issue.

Comment 4 Victor Stinner 2016-09-22 23:25:00 UTC

Created attachment 1203940 [details]
Script to tune Linux kernel to run a benchmark on isolated CPUs: set governor to performance, mask IRQ, etc.

> I rechecked after a fresh boot and I failed to reproduce the bug. In fact, the bug only occurs with the performance governor.

Just after a fresh boot with isolated CPUs, I run my isolcpus.py scripts which tune the system:

* Set the scaling governor of all CPUs to "performance"
* Try to write 1 into /sys/devices/system/cpu/intel_pstate/no_turbo
* Create a CPU mask exluding isolated CPUs and write this mask into /proc/irq/default_smp_affinity and into all /proc/irq/<IRQ number>/smp_affinity

I attached my script to this issue.

Comment 5 Victor Stinner 2016-09-26 09:21:47 UTC

FYI I wrote an article how I noticed, analyzed and then identified the bug:
https://haypo.github.io/intel-cpus-part2.html

Comment 6 Victor Stinner 2016-09-27 10:32:03 UTC

Update:

* When using NOHZ_FULL on a CPU, intel_pstate is not called back by the scheduler and so don't update the P-state of the CPU
* It doesn't seem possible to workaround the issue by forcing a P-state on the CPU using wrmsr
* In short, the frequency of CPUs using NOHZ_FULL don't depend on their workload, but depend on the workload of other CPUs. If another CPU is active, the CPU will probably runs at full speed (ex: 3.4 GHz or higher if Turbo Boost is enabled). If other CPUs are idle, the CPU will probably run at the lowest speed (ex: 1.6 GHz).

If you want stable CPU frequency: force the CPU frequency or don't use NOHZ_FULL.

--

I found a faster, simpler and more reliable scenario to reproduce the bug:

* Boot Linux with nohz_full=3,7 (no more isolcpus, rcu_nocbs or anything like that, only nohz_full!)
* Read the P-state of the CPU 7: rdmsr -p 7 0x198 --bitfield 15:0
* Start stressing the CPU 7
* Read again the P-state of the CPU 7: rdmsr -p 7 0x198 --bitfield 15:0
* The P-State is not updated, the CPU frequency is not updated neither

The P-state of CPU 7 is not updated, whereas it works as expected on CPUs not part of the nohz_full CPU set.

Example on CPU 5 (ok):

* rdmsr -p 5 0x198 --bitfield 15:0 => 1000
* Starting stressing the CPU 5: turbostat taskset -c 5 python3 -c 'while 1: pass'
* rdmsr -p 5 0x198 --bitfield 15:0 => 2200
* Interrupt the stress test (CTRL+c)
* According to turbostat, CPU 5 used frequency 3.3 GHz
* Sleep 10 seconds
* rdmsr -p 5 0x198 --bitfield 15:0 => 1000

Example on CPU 7 using NOHZ_FULL (BUG!):

* rdmsr -p 7 0x198 --bitfield 15:0 => 1100
* Starting stressing the CPU 5: turbostat taskset -c 7 python3 -c 'while 1: pass'
* rdmsr -p 7 0x198 --bitfield 15:0 => 1000  ~~~ NO CHANGE!
* Interrupt the stress test (CTRL+c)
* According to turbostat, CPU 7 used frequency 1.6 GHz  ~~~ I expect 3.3 GHz
* Sleep 10 seconds
* rdmsr -p 7 0x198 --bitfield 15:0 => 1000

--

I played with Ftrace:

# cd /sys/kernel/debug/tracing
# echo function > current_tracer
# echo intel_pstate_update_util > set_ftrace_filter
# cat trace_pipe |grep '\[00[37]\]'

Notes:

* On a regular CPU (ex: CPU 5), intel_pstate_update_util() is called by task_tick_fair() every millisecond (1000x/sec)
* On CPU 7 using NOHZ_FULL: intel_pstate_update_util() is called once per second

It seems like the scheduler is idle on the CPU 7 and so don't call the callback used by the intel_pstate driver to update P-state of CPUs.

--

I tried to update manually P-state using MSR 199H, but it seems like the intel_pstate immediatly override my choice.

# wrmsr -p 3 0x199 0x2200; wrmsr -p 7 0x199 0x2200; rdmsr -p 7 0x198 --bitfield 15:0; sleep 1; rdmsr -p 7 0x198 --bitfield 15:0
2200
1000

So it doesn't seem possible to workaround the issue manually.

If you want stable CPU frequency on CPUs using NOHZ_FULL:

* Force the CPU frequency
* Don't use NOHZ_FULL

Example of command to force the CPU frequency, set the maximum frequency as the scaling minimum frequency:

for cpu in /sys/devices/system/cpu/cpu*/cpufreq; do cat cpuinfo_max_freq|sudo tee scaling_min_freq; done

Comment 7 Victor Stinner 2016-09-27 11:30:29 UTC

By the way, HyperThreading is unrelated to this bug: the bug can be reproduced with and without HyperThreading.

Comment 8 Victor Stinner 2016-09-27 16:36:03 UTC

Created attachment 1205282 [details]
Script to reproduce the 2x slow down on CPU using NOHZ_FULL using the performance governor

Comment 9 Victor Stinner 2016-09-27 16:36:56 UTC

Created attachment 1205283 [details]
perf JSON file of 10 fast runs and then 10 slow runs of the benchmark for the performance 2x slower issue

Comment 10 Victor Stinner 2016-09-27 16:53:08 UTC

Created attachment 1205284 [details]
Output of performance_bug.sh

I wrote a shell script reproducing the 2x slower issue when running a benchmark on a CPU using NOHZ_FULL. CPU 3 and 7 are using NOHZ_FULL.

* It seems like running turbostat makes CPU 3 fast, so the script starts by running turbostat
* First run of the benchmark: Median +- std dev: 12.9 ms +- 0.2 ms (FAST)
* Sleep 2 minutes (120 seconds)
* Second run of the benchmark: Median +- std dev: 23.6 ms +- 0.3 ms (SLOW)

I added commands to get debug information.

The value of MSR 198H and MSR 199H is 2200H for all CPUs, when the benchmark is fast and when the benchmark is slow. So these registers are not enough to explain the performance drop.

I don't see any significant different before (fast)/after (slow). When the benchmark is fast, the CPU temperature is between 44°C and 45°C, when it's slow the temperature is between 47°C and 49°C. But it may be completely unrelated or a side effect of something else.

Comment 11 Victor Stinner 2016-09-27 19:40:40 UTC

Created attachment 1205306 [details]
performance-bug2.sh: Update script creating the virtual env, with more comments, etc.

I forgot to mention requirements for performance-bug-2.sh (and performance-bug.sh): one CPU must run with NOHZ_FULL. Example of Linux cmdline: "... isolcpus=3,7 nohz_full=3,7 rcu_nocbs=3,7". But "... nohz_full=3,7" is enough to reproduce the bug.

CPU=3 and LAST_CPU=7 variables should be updated in performance-bug-2.sh depending on the computer: $CPU is a CPU running with NOHZ_FULL, $LAST_CPU is the identifier of the last CPU (ex: 7 if there are 8 CPUs).

Comment 12 Victor Stinner 2016-09-27 22:33:18 UTC

Created attachment 1205339 [details]
performance2.log: log where the bug occurs

Oops, I attached the wrong log before where the bug wasn't reproduced :-/

performance2.log shows the bug:

* Bench 1: Median +- std dev: 12.9 ms +- 0.2 ms
* Bench 2: Median +- std dev: 23.6 ms +- 0.3 ms

Comment 13 Victor Stinner 2016-09-27 23:13:50 UTC

Oooh, on #linux-rt of OFTC, JackWinter_ and clark explained me how to use /dev/cpu_dma_latency for change the minimum C-state dynamically (without rebooting).

When writing 0 into /dev/cpu_dma_latency (and keep the device open), all (logical) CPUs are stuck in the C0 state and don't go to sleep anymore. On this case, my benchmark become 2x slower on all CPUs, not only on CPUs using NOHZ_FULL. It seems like forcing C0 on all logical CPUs with HyperThreading is a bad idea :-/ ~~~ CASE 1

When writing 10 (min latency of 10 ms) into /dev/cpu_dma_latency (and keep the device open), CPUs are able to move to C1 state. The benchmarks runs at nominal speed on all CPUs. Moreover, my shell script (performance_bug-2.sh) is unable to reproduce the bug anymore. ~~~ CASE 2

I see two cases:

* CASE 1: Sometimes, a pair of logical CPUs (of a physical CPU core) using NOHZ_FULL are both stuck in C0: performance is divided by two
* related to CASE 2: Sometimes, a logical CPU is not awaken and stay in deep C-state and so runs slower than the nominal speed.

I understand that the C-state of CPUs using NOHZ_FULL is not properly updated. Again, it might be related to the specific behaviour of the scheduler on CPUs using NOHZ_FULL, or maybe an issue with intel_idle and NOHZ_FULL.

Comment 14 Victor Stinner 2016-09-27 23:26:10 UTC

> CASE 1: Sometimes, a pair of logical CPUs (of a physical CPU core) using NOHZ_FULL are both stuck in C0: performance is divided by two

Ah, this case seems to be known by some users and is related to HyperThreading.

John McCalpin message in 2014:
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/535130
=> "Of course the right thing to do is never use HyperThreading and "idle=poll" at the same time!"

Extract:
"""
If you set "idle=poll" and you are using HyperThreading, then the kernel idle loop will be executing instructions and fighting for resources in the physical processor cores.   If the LINPACK benchmark decides to use only two threads (since there are only two physical cores), then there will be two "idle" threads that spin in a tight polling loop waiting to be assigned a process to run.  This would be fine if they were on their own cores, but with HyperThreading they will slow down the compute threads.

You should get an improvement in performance if you can force the code to use 4 logical processors instead of 2.   The overall result will be slower because the code is typically blocked for one thread per L2 cache (so you will get lots of extra L2 misses), but at least they will be getting real work done instead of just fighting for issue slots.
"""

Comment 15 Justin M. Forbes 2017-04-11 15:02:45 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs.

Fedora 25 has now been rebased to 4.10.9-100.fc24.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.

If you experience different issues, please open a new bug report for those.

Comment 16 Victor Stinner 2017-04-12 11:15:53 UTC

I'm not sure that the issue can really be called a bug. It seems technically impossible to update the C-state/P-state of a CPU using NOHZ_FULL, since NOHZ_FULL means "no interruption". The Intel CPU drivers are implemented with a callback in the Linux scheduler, whereas NOHZ_FULL disables the interruption of the Linux scheduler as well.

The problem is that the Intel CPU drivers are not aware that a few CPUs have interruptions disabled and not aware that the workload of the CPUs using NOHZ_FULL is not taken in account to compute the C-state/P-state.

I stopped using NOHZ_FULL, so I now close this bug.

Note You need to log in before you can comment on or make changes to this bug.