Red Hat Bugzilla – Bug 1401061
RFE: Improve RT throttling mechanism
Last modified: 2018-04-10 05:09:31 EDT
Description of problem: Currently, we have two throttling modes: With RT_RUNTIME_SHARING (default): before throttle, try to borrow some runtime from other CPU. Without RT_RUNTIME_SHARING: throttle the RT task, even if there is nothing else to do. The problem of the first is that a CPU easily borrow enough runtime to make the spin-rt-task to run forever, allowing the starvation of the non-rt-tasks, hence invalidating the mechanism. The problem of the second is that (with the default values) the CPU will be idle 5% of the time, wasting CPU time. So neither solution is perfect. Daniel Bristot suggested a new option for the rt throttling, the RT_RUNTIME_GREED sched feature. The description of the feature is: ------------------------%<------------- The rt throttling mechanism prevents the starvation of non-real-time tasks by CPU intensive real-time tasks. In terms of percentage, the default behavior allows real-time tasks to run up to 95% of a given period, leaving the other 5% of the period for non-real-time tasks. In the absence of non-rt tasks, the system goes idle for 5% of the period. Although this behavior works fine for the purpose of avoiding bad real-time tasks that can hang the system, some greed users want to allow the real-time task to continue running in the absence of non-real-time tasks starving. In other words, they do not want to see the system going idle. This patch implements the RT_RUNTIME_GREED scheduler feature for greedy users (TM). When enabled, this feature will check if non-rt tasks are starving before throttling the real-time task. If the real-time task becomes throttled, it will be unthrottled as soon as the system goes idle, or when the next period starts, whichever comes first. This feature is enabled with the following command: # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features The user might also want to disable NO_RT_RUNTIME_SHARE logic, to keep all CPUs with the same rt_runtime. # echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features With these two options set, the user will guarantee some runtime for non-rt-tasks on all CPUs, while keeping real-time tasks running as much as possible. ------------------------>%------------- Unfortunately, this option was rejected by Peterz, which wants a more complete solution using a deadline server, such a hierarchical scheduling of non-real-time task inside a deadline task. Here is peterz's reply: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266868.html There was some discussions about the implementation of the deadline server, but it will certainly take some time. There is an internal consensus that Daniel's propose is an acceptable workaround for the problem for our customers, while waiting for the definitive solution. So, the plan is: use RT_RUNTIME_GREED sched feature until having the definitive upstream solution, in the real-time kernel.
Hey, BNP complained about this thread being blocked because of a CPU with a spinning -rt tasks: crash> bt 6949 PID: 6949 TASK: ffff880418466300 CPU: 10 COMMAND: "force" #0 [ffff8800b8793918] __schedule at ffffffff815f31dc #1 [ffff8800b87939b0] schedule at ffffffff815f38f4 #2 [ffff8800b87939d0] wait_transaction_locked at ffffffffa0311a05 [jbd2] #3 [ffff8800b8793a40] add_transaction_credits at ffffffffa0311e89 [jbd2] #4 [ffff8800b8793ac0] start_this_handle at ffffffffa0312131 [jbd2] #5 [ffff8800b8793b60] jbd2__journal_start at ffffffffa0312640 [jbd2] #6 [ffff8800b8793bc0] __ext4_journal_start_sb at ffffffffa0370889 [ext4] #7 [ffff8800b8793c10] ext4_dirty_inode at ffffffffa0341934 [ext4] #8 [ffff8800b8793c30] __mark_inode_dirty at ffffffff811dbd9b #9 [ffff8800b8793c60] update_time at ffffffff811c8d41 #10 [ffff8800b8793c90] file_update_time at ffffffff811c8e28 #11 [ffff8800b8793cf0] __generic_file_aio_write at ffffffff8114c028 #12 [ffff8800b8793d80] generic_file_aio_write at ffffffff8114c2b5 #13 [ffff8800b8793dd0] ext4_file_write at ffffffffa0339954 [ext4] #14 [ffff8800b8793e10] do_sync_write at ffffffff811acdff #15 [ffff8800b8793ef0] vfs_write at ffffffff811ad31f #16 [ffff8800b8793f20] sys_write at ffffffff811addd0 #17 [ffff8800b8793f80] tracesys at ffffffff815fdca8 (via system_call) RIP: 0000003eb900e6fd RSP: 00007fdf14f02d60 RFLAGS: 00000293 RAX: ffffffffffffffda RBX: ffffffff815fdca8 RCX: ffffffffffffffff RDX: 000000000000001f RSI: 00000000007b9a0c RDI: 0000000000000022 RBP: 0000000000000022 R8: 00000000007b99d0 R9: 00000000000001f0 R10: 00007fdf20909718 R11: 0000000000000293 R12: 00007fdeec1fbe10 R13: 00007fdf14f02db0 R14: 000000000000001f R15: 00000000007b9a0c ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b This is that old BZ about not being possible to avoid a jbd2 thread on an isolated CPU - BZ1306341. One possible workaround for this problem is to add the patch suggested in this BZ. The other would be to try to make jdb2 per-cpu kworkers not to be per-cpu. But that would be really complex.
How to reproduce the problem: 1) prepare a busy-loop task, like: f.c: ------------- %< ------------------ int main (void) { for(;;); } ------------- >% ------------------- # gcc -o rt f.c # gcc -o nonrt f.c 2) disable rt runtime sharing # echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features 3) run the "rt" busy loop task, in the FIFO policy, pinned to a CPU, for instance, CPU 1: # taskset -c 1 chrt -f 1 ./rt & 4) see the CPU 1 usage, it should notify 95% busy with the "rt" task, and +- 5% idle. 5) Then, enable the RT_RUNTIME_GREED feature: # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features and check the CPU 1 usage, now the "rt" should be taking +-100 % of CPU time. The system should be able to run for a long period without causing problems like hung tasks because of the busy-loop task. (that is the feature implemented by this patch) 6) Finally, run the "nonrt" task in the CPU 1 as non-rt: # taskset -c 1 ./nonrt & Now, the "rt" task should be taking 95% and the "nonrt" 5%.
Created attachment 1298514 [details] [RT PATCH] sched/rt: RT_RUNTIME_GREED sched feature
Patch posted to the internal list: http://post-office.corp.redhat.com/archives/kernel-rt-team/2017-July/msg00005.html
patch merged to the version 3.10.0-695.rt56.620.
Hello All, 7.5 flag is not required, as kernel-rt it's approved directly for zstream
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0676