Description of problem:
When CPU isolation is used, such as via the cpu-partitioning tuned profile, it is possible for isolated CPUs to be interrupted via kernel IPIs initiated by non-isolated CPUs. There are many different ways that this can happen but a few have been diagnosed using the rt-trace-bpf tool:
caused by NetworkManager:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
caused by the mgag200 kernel module:
238903.096535737 kworker/0:1 0 88579 smp_call_function_many_cond (cpu=0, func=do_flush_tlb_all)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x48
__purge_vmap_area_lazy+0x70
free_vmap_area_noflush+0xf2
remove_vm_area+0x93
__vunmap+0x59
drm_gem_shmem_vunmap+0x6d
mgag200_handle_damage+0x62
mgag200_simple_display_pipe_update+0x69
drm_atomic_helper_commit_planes+0xb3
drm_atomic_helper_commit_tail+0x26
commit_tail+0xc6
drm_atomic_helper_commit+0x103
drm_atomic_helper_dirtyfb+0x20e
drm_fb_helper_damage_work+0x228
process_one_work+0x18f
worker_thread+0x30
kthread+0x15d
ret_from_fork+0x1f
Tracing on the isolated CPUs shows preemptions such as this:
58118.769286 | 18) <...>-128143 | | smp_call_function_interrupt() {
58118.769286 | 18) <...>-128143 | | irq_enter() {
58118.769287 | 18) <...>-128143 | 0.101 us | preempt_count_add();
58118.769288 | 18) <...>-128143 | 0.968 us | }
58118.769288 | 18) <...>-128143 | | generic_smp_call_function_single_interrupt() {
58118.769289 | 18) <...>-128143 | | flush_smp_call_function_queue() {
58118.769289 | 18) <...>-128143 | | do_flush_tlb_all() {
58118.769290 | 18) <...>-128143 | 0.453 us | native_flush_tlb_global();
58118.769291 | 18) <...>-128143 | 1.439 us | }
58118.769292 | 18) <...>-128143 | 2.402 us | }
58118.769292 | 18) <...>-128143 | 3.223 us | }
58118.769292 | 18) <...>-128143 | | irq_exit() {
58118.769293 | 18) <...>-128143 | 0.077 us | preempt_count_sub();
58118.769294 | 18) <...>-128143 | 0.201 us | idle_cpu();
58118.769295 | 18) <...>-128143 | | tick_nohz_irq_exit() {
58118.769295 | 18) <...>-128143 | 0.164 us | ktime_get();
58118.769296 | 18) <...>-128143 | | __tick_nohz_full_update_tick() {
58118.769296 | 18) <...>-128143 | 0.079 us | check_tick_dependency();
58118.769297 | 18) <...>-128143 | 0.074 us | check_tick_dependency();
58118.769298 | 18) <...>-128143 | 0.070 us | check_tick_dependency();
58118.769299 | 18) <...>-128143 | 0.101 us | check_tick_dependency();
58118.769300 | 18) <...>-128143 | 1.458 us | tick_nohz_next_event();
58118.769302 | 18) <...>-128143 | 0.082 us | tick_nohz_stop_tick();
58118.769303 | 18) <...>-128143 | 6.229 us | }
58118.769303 | 18) <...>-128143 | 8.124 us | }
58118.769303 | 18) <...>-128143 | + 10.872 us | }
58118.769304 | 18) <...>-128143 | + 17.471 us | }
Version-Release number of selected component (if applicable):
4.18.0-348.12.2.rt7.143.el8_5.x86_64
How reproducible:
Easily
Steps to Reproduce:
1. Boot the system using an RT kernel and the cpu-partitioning tuned profile
2. Run a workload that measures latency, such as oslat, on the isolated CPUs
3. Trace the kernel activity on the isolated CPUs while the workload is running
Actual results:
Latency spikes caused by IPI processing will be observed on the isolated CPUs when there is no need to handle the IPI at that moment.
Expected results:
No needless IPI processing should occur on the isolated CPUs -- for example, for a 100% userspace workload such as oslat there is no need to enter the kernel and service the IPI until a necessary kernel entry occurs (ie. system call, timer interrupt, etc.).
Additional info:
Description of problem: When CPU isolation is used, such as via the cpu-partitioning tuned profile, it is possible for isolated CPUs to be interrupted via kernel IPIs initiated by non-isolated CPUs. There are many different ways that this can happen but a few have been diagnosed using the rt-trace-bpf tool: caused by NetworkManager: 64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush) smp_call_function_many_cond+0x1 smp_call_function+0x39 on_each_cpu+0x2a flush_tlb_kernel_range+0x7b __purge_vmap_area_lazy+0x70 _vm_unmap_aliases.part.42+0xdf change_page_attr_set_clr+0x16a set_memory_ro+0x26 bpf_int_jit_compile+0x2f9 bpf_prog_select_runtime+0xc6 bpf_prepare_filter+0x523 sk_attach_filter+0x13 sock_setsockopt+0x92c __sys_setsockopt+0x16a __x64_sys_setsockopt+0x20 do_syscall_64+0x87 entry_SYSCALL_64_after_hwframe+0x65 caused by the mgag200 kernel module: 238903.096535737 kworker/0:1 0 88579 smp_call_function_many_cond (cpu=0, func=do_flush_tlb_all) smp_call_function_many_cond+0x1 smp_call_function+0x39 on_each_cpu+0x2a flush_tlb_kernel_range+0x48 __purge_vmap_area_lazy+0x70 free_vmap_area_noflush+0xf2 remove_vm_area+0x93 __vunmap+0x59 drm_gem_shmem_vunmap+0x6d mgag200_handle_damage+0x62 mgag200_simple_display_pipe_update+0x69 drm_atomic_helper_commit_planes+0xb3 drm_atomic_helper_commit_tail+0x26 commit_tail+0xc6 drm_atomic_helper_commit+0x103 drm_atomic_helper_dirtyfb+0x20e drm_fb_helper_damage_work+0x228 process_one_work+0x18f worker_thread+0x30 kthread+0x15d ret_from_fork+0x1f Tracing on the isolated CPUs shows preemptions such as this: 58118.769286 | 18) <...>-128143 | | smp_call_function_interrupt() { 58118.769286 | 18) <...>-128143 | | irq_enter() { 58118.769287 | 18) <...>-128143 | 0.101 us | preempt_count_add(); 58118.769288 | 18) <...>-128143 | 0.968 us | } 58118.769288 | 18) <...>-128143 | | generic_smp_call_function_single_interrupt() { 58118.769289 | 18) <...>-128143 | | flush_smp_call_function_queue() { 58118.769289 | 18) <...>-128143 | | do_flush_tlb_all() { 58118.769290 | 18) <...>-128143 | 0.453 us | native_flush_tlb_global(); 58118.769291 | 18) <...>-128143 | 1.439 us | } 58118.769292 | 18) <...>-128143 | 2.402 us | } 58118.769292 | 18) <...>-128143 | 3.223 us | } 58118.769292 | 18) <...>-128143 | | irq_exit() { 58118.769293 | 18) <...>-128143 | 0.077 us | preempt_count_sub(); 58118.769294 | 18) <...>-128143 | 0.201 us | idle_cpu(); 58118.769295 | 18) <...>-128143 | | tick_nohz_irq_exit() { 58118.769295 | 18) <...>-128143 | 0.164 us | ktime_get(); 58118.769296 | 18) <...>-128143 | | __tick_nohz_full_update_tick() { 58118.769296 | 18) <...>-128143 | 0.079 us | check_tick_dependency(); 58118.769297 | 18) <...>-128143 | 0.074 us | check_tick_dependency(); 58118.769298 | 18) <...>-128143 | 0.070 us | check_tick_dependency(); 58118.769299 | 18) <...>-128143 | 0.101 us | check_tick_dependency(); 58118.769300 | 18) <...>-128143 | 1.458 us | tick_nohz_next_event(); 58118.769302 | 18) <...>-128143 | 0.082 us | tick_nohz_stop_tick(); 58118.769303 | 18) <...>-128143 | 6.229 us | } 58118.769303 | 18) <...>-128143 | 8.124 us | } 58118.769303 | 18) <...>-128143 | + 10.872 us | } 58118.769304 | 18) <...>-128143 | + 17.471 us | } Version-Release number of selected component (if applicable): 4.18.0-348.12.2.rt7.143.el8_5.x86_64 How reproducible: Easily Steps to Reproduce: 1. Boot the system using an RT kernel and the cpu-partitioning tuned profile 2. Run a workload that measures latency, such as oslat, on the isolated CPUs 3. Trace the kernel activity on the isolated CPUs while the workload is running Actual results: Latency spikes caused by IPI processing will be observed on the isolated CPUs when there is no need to handle the IPI at that moment. Expected results: No needless IPI processing should occur on the isolated CPUs -- for example, for a 100% userspace workload such as oslat there is no need to enter the kernel and service the IPI until a necessary kernel entry occurs (ie. system call, timer interrupt, etc.). Additional info: