Bug 1401061

Summary:

RFE: Improve RT throttling mechanism

Product:

Red Hat Enterprise Linux 7

Reporter:

Daniel Bristot de Oliveira <daolivei>

Component:

kernel-rt

Assignee:

Daniel Bristot de Oliveira <daolivei>

kernel-rt sub component:

Memory Management

QA Contact:

Jiri Kastner <jkastner>

Status:

CLOSED ERRATA

Docs Contact:

Jana Heves <jsvarova>

Severity:

medium

Priority:

high

CC:

bhu, cww, daolivei, dhoward, mkolaja, salmy, stalexan, toneata, williams

Version:

7.4

Keywords:

FutureFeature, ZStream

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Enhancement

Doc Text:

Improved RT throttling mechanism The current real-time throttling mechanism prevents the starvation of non-real-time tasks by CPU intensive real-time tasks. When a real-time run queue is throttled, it allows non-real-time tasks to run or if there are none, the CPU goes idle. To safely maximize CPU usage by decreasing the CPU idle time, the "RT_RUNTIME_GREED" scheduler feature has been implemented. When enabled, this feature checks if non-real-time tasks are starving before throttling the real-time task. As a result, the "RT_RUNTIME_GREED" scheduler option guarantees some run time on all CPUs for the non-real-time tasks, while keeping the real-time tasks running as much as possible.

Story Points:

---

Clone Of:

Clones:

1505158 (view as bug list)

Environment:

Last Closed:

2018-04-10 09:07:09 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1420851, 1442258, 1505158

Attachments:

Description	Flags
[RT PATCH] sched/rt: RT_RUNTIME_GREED sched feature	none

Description Daniel Bristot de Oliveira 2016-12-02 16:29:01 UTC

Description of problem:

Currently, we have two throttling modes:

With RT_RUNTIME_SHARING (default):
before throttle, try to borrow some runtime from other CPU.

Without RT_RUNTIME_SHARING:
throttle the RT task, even if there is nothing else to do.

The problem of the first is that a CPU easily borrow enough runtime to
make the spin-rt-task to run forever, allowing the starvation of the
non-rt-tasks, hence invalidating the mechanism.

The problem of the second is that (with the default values) the CPU will
be idle 5% of the time, wasting CPU time.

So neither solution is perfect.

Daniel Bristot suggested a new option for the rt throttling, the RT_RUNTIME_GREED sched feature.

The description of the feature is:

------------------------%<-------------
The rt throttling mechanism prevents the starvation of non-real-time
tasks by CPU intensive real-time tasks. In terms of percentage,
the default behavior allows real-time tasks to run up to 95% of a
given period, leaving the other 5% of the period for non-real-time
tasks. In the absence of non-rt tasks, the system goes idle for 5%
of the period.

Although this behavior works fine for the purpose of avoiding
bad real-time tasks that can hang the system, some greed users
want to allow the real-time task to continue running in the absence
of non-real-time tasks starving. In other words, they do not want to
see the system going idle.

This patch implements the RT_RUNTIME_GREED scheduler feature for greedy
users (TM). When enabled, this feature will check if non-rt tasks are
starving before throttling the real-time task. If the real-time task
becomes throttled, it will be unthrottled as soon as the system goes
idle, or when the next period starts, whichever comes first.

This feature is enabled with the following command:
# echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features

The user might also want to disable NO_RT_RUNTIME_SHARE logic,
to keep all CPUs with the same rt_runtime.
# echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features

With these two options set, the user will guarantee some runtime
for non-rt-tasks on all CPUs, while keeping real-time tasks running
as much as possible.
------------------------>%-------------

Unfortunately, this option was rejected by Peterz, which wants
a more complete solution using a deadline server, such
a hierarchical scheduling of non-real-time task inside a deadline task.

Here is peterz's reply:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266868.html

There was some discussions about the implementation of the deadline server,
but it will certainly take some time.

There is an internal consensus that Daniel's propose is an acceptable
workaround for the problem for our customers, while waiting for the
definitive solution.

So, the plan is: use RT_RUNTIME_GREED sched feature until having the
definitive upstream solution, in the real-time kernel.

Comment 1 Daniel Bristot de Oliveira 2017-04-18 08:35:12 UTC

Hey,

BNP complained about this thread being blocked because of a CPU with a spinning -rt tasks:

crash> bt 6949
PID: 6949   TASK: ffff880418466300  CPU: 10  COMMAND: "force"
 #0 [ffff8800b8793918] __schedule at ffffffff815f31dc
 #1 [ffff8800b87939b0] schedule at ffffffff815f38f4
 #2 [ffff8800b87939d0] wait_transaction_locked at ffffffffa0311a05 [jbd2]
 #3 [ffff8800b8793a40] add_transaction_credits at ffffffffa0311e89 [jbd2]
 #4 [ffff8800b8793ac0] start_this_handle at ffffffffa0312131 [jbd2]
 #5 [ffff8800b8793b60] jbd2__journal_start at ffffffffa0312640 [jbd2]
 #6 [ffff8800b8793bc0] __ext4_journal_start_sb at ffffffffa0370889 [ext4]
 #7 [ffff8800b8793c10] ext4_dirty_inode at ffffffffa0341934 [ext4]
 #8 [ffff8800b8793c30] __mark_inode_dirty at ffffffff811dbd9b
 #9 [ffff8800b8793c60] update_time at ffffffff811c8d41
#10 [ffff8800b8793c90] file_update_time at ffffffff811c8e28
#11 [ffff8800b8793cf0] __generic_file_aio_write at ffffffff8114c028
#12 [ffff8800b8793d80] generic_file_aio_write at ffffffff8114c2b5
#13 [ffff8800b8793dd0] ext4_file_write at ffffffffa0339954 [ext4]
#14 [ffff8800b8793e10] do_sync_write at ffffffff811acdff
#15 [ffff8800b8793ef0] vfs_write at ffffffff811ad31f
#16 [ffff8800b8793f20] sys_write at ffffffff811addd0
#17 [ffff8800b8793f80] tracesys at ffffffff815fdca8 (via system_call)
    RIP: 0000003eb900e6fd  RSP: 00007fdf14f02d60  RFLAGS: 00000293
    RAX: ffffffffffffffda  RBX: ffffffff815fdca8  RCX: ffffffffffffffff
    RDX: 000000000000001f  RSI: 00000000007b9a0c  RDI: 0000000000000022
    RBP: 0000000000000022   R8: 00000000007b99d0   R9: 00000000000001f0
    R10: 00007fdf20909718  R11: 0000000000000293  R12: 00007fdeec1fbe10
    R13: 00007fdf14f02db0  R14: 000000000000001f  R15: 00000000007b9a0c
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b


This is that old BZ about not being possible to avoid a jbd2 thread on an isolated CPU - BZ1306341.

One possible workaround for this problem is to add the patch suggested in this BZ.

The other would be to try to make jdb2 per-cpu kworkers not to be per-cpu. But that would be really complex.

Comment 3 Daniel Bristot de Oliveira 2017-07-14 16:36:27 UTC

How to reproduce the problem:

1) prepare a busy-loop task, like:

f.c:
------------- %< ------------------
int main (void)
{
	for(;;);
}
------------- >% -------------------

# gcc -o rt f.c
# gcc -o nonrt f.c

2) disable rt runtime sharing

# echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features

3) run the "rt" busy loop task, in the FIFO policy, pinned to a CPU,
for instance, CPU 1:

# taskset -c 1 chrt -f 1 ./rt &

4) see the CPU 1 usage, it should notify 95% busy with the "rt" task,
and +- 5% idle.

5) Then, enable the RT_RUNTIME_GREED feature:

# echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features

and check the CPU 1 usage, now the "rt" should be taking +-100 % of CPU
time.

The system should be able to run for a long period without causing
problems like hung tasks because of the busy-loop task.

(that is the feature implemented by this patch)

6) Finally, run the "nonrt" task in the CPU 1 as non-rt:

# taskset -c 1 ./nonrt &

Now, the "rt" task should be taking 95% and the "nonrt" 5%.

Comment 4 Daniel Bristot de Oliveira 2017-07-14 16:59:10 UTC

Created attachment 1298514 [details]
[RT PATCH] sched/rt: RT_RUNTIME_GREED sched feature

Comment 5 Daniel Bristot de Oliveira 2017-07-18 12:26:12 UTC

Patch posted to the internal list:

http://post-office.corp.redhat.com/archives/kernel-rt-team/2017-July/msg00005.html

Comment 6 Daniel Bristot de Oliveira 2017-08-14 11:49:23 UTC

patch merged to the version 3.10.0-695.rt56.620.

Comment 12 Oneata Mircea Teodor 2017-09-21 07:33:32 UTC

Hello All,
7.5 flag is not required, as kernel-rt it's approved directly for zstream

Comment 21 errata-xmlrpc 2018-04-10 09:07:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0676