Bug 1550584

Summary: spurious ktimersoftd wake ups increases latency (rhel-rt 7)
Product: Red Hat Enterprise Linux 7 Reporter: Luiz Capitulino <lcapitulino>
Component: kernel-rtAssignee: Daniel Bristot de Oliveira <daolivei>
kernel-rt sub component: Other QA Contact: Mike Stowell <mstowell>
Status: CLOSED ERRATA Docs Contact: Sujata Kurup <skurup>
Severity: high    
Priority: high CC: aklimov, bhu, chhudson, daolivei, fherrman, fiezzi, jklech, jraju, jreznik, lgoncalv, lmanasko, mstowell, mtosatti, pezhang, ronaldo.mercado, williams, yicwang
Version: 7.5Keywords: Regression, Reopened
Target Milestone: rc   
Target Release: 7.8   
Hardware: x86_64   
OS: Linux   
Fixed In Version: kernel-rt-3.10.0-1063.rt56.1023.el7 Doc Type: Bug Fix
Doc Text:
.The latency for isolated CPU's is now reduced by avoiding spurious `ktimersoftd` activation Previously, for a KVM-RT configured system, per-CPU `ktimersoftd` kernel threads ran once every second even in absence of a timer. Consequently, an increased latency occurred on the isolated CPU's. This update adds an optimization into the real-time kernel that does not wake the `ktimersoftd` on every tick. As a result, `ktimersoftd` is not raised on isolated CPU's, which prevents the interference and reduces the latency.
Story Points: ---
Clone Of:
: 1723499 1723502 1942495 (view as bug list) Environment:
Last Closed: 2020-03-31 19:48:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1568294, 1593361    
Bug Blocks: 1525647, 1672377, 1678810, 1690543, 1693411, 1723499, 1723502, 1942495    

Description Luiz Capitulino 2018-03-01 14:21:55 UTC
Description of problem:

On a system configured for KVM-RT, we can observe that a hog application running under fifo:1 is preempted at least once a second by the ktimersoftd thread. We expect the situation to be worse when running an RT guest, as the preemptions will happen in host and guest (see test results below).

We've debugged this down to the following commit:

commit 4ea0288128fa782063f4742651a545dd74396ec6
Author: Daniel Bristot de Oliveira <bristot>
Date:   Thu Nov 2 18:33:51 2017 +0100

    re-apply Revert "timers: do not raise softirq unconditionally"

This commit is dropping the code that avoids ktimersoftd spurious wake ups from the tick handler. The result is that now the ktimersoftd is woken up at every tick.

We measured KVM-RT test-case with and without the commit above, here's the results:

 Before revert:

  # Min Latencies: 00005 00010 00012 00010 00010 00010 00012 00010
  # Avg Latencies: 00012 00012 00014 00012 00012 00012 00014 00012
  # Max Latencies: 00026 00026 00033 00026 00026 00026 00033 00025

 After revert:

  # Min Latencies: 00005 00012 00012 00012 00012 00012 00012 00012
  # Avg Latencies: 00012 00013 00013 00013 00013 00013 00014 00013
  # Max Latencies: 00019 00025 00021 00019 00019 00019 00026 00019

So, we observe an increase in latency of around 20% in the worst case scenario.

Version-Release number of selected component (if applicable): kernel-3.10.0-858.rt56.799.el7.x86_64

How reproducible:

Steps to Reproduce:
1. Configure the system for KVM-RT or fully isolate a CPU
2. Run a hog application under fifo:1 pinned to the isolated CPU
3. Trace sched_switch events

Comment 4 Luiz Capitulino 2018-03-01 16:37:20 UTC
Very quickly tested with RHEL7's kernel-3.10.0-858.el7.x86_64 and the cpu-partitioning profile, can't reproduce. Meaning, I don't see any preemptions at all. However, I ran the test-case only for a few minutes.

Comment 5 Luiz Capitulino 2018-03-02 21:13:34 UTC
Let me emphasize that this issue is extremely important. We have a confirmed 20% worse latency for 10 minutes runs and I'd expect that right now RHEL+cpu-partitioning offers better latency for DPDK than the RT kernel.

Also, for regular RT usage, if you have the tick ticking at 1000 times per second you'll have 1000 preemptions per second.

Comment 6 Luiz Capitulino 2018-03-05 14:21:00 UTC
Here's a reproducer that doesn't require KVM-RT profiles, but still requires you to completely isolate a CPU:

1. Completely isolate a CPU, use nohz_full, etc
2. Run a hog application pinned to that CPU: taskset -c CPU ./hog
3. Trace sched_switch events

On an unfixed kernel, there will be one context switch to the ktimersoftd thread per second. On a fixed system, there should be no sched_switch events whatsoever.

Comment 7 Daniel Bristot de Oliveira 2018-03-05 15:38:29 UTC
Hi Luis,

I am already reproducing the issue in a local system. Currently checking upstream code/reproducer.

I have a question: It is not related to this BZ, but I saw rcuc threads being awakened. To avoid this, we can use rcu_nocb_poll, but I do not see this being enabled in the realtime-virtual-host. Is there a reason for it not being enabled?

-- Daniel

Comment 8 Luiz Capitulino 2018-03-05 15:45:32 UTC
Can you open a BZ? You can assign it to me.

Because of another issue I'm debugging, I've just traced sched_switch on the host when running KVM-RT test-case and didn't see any context switches to rcuc threads. But in any case it's better to investigate.

Comment 9 Daniel Bristot de Oliveira 2018-03-08 16:34:31 UTC
Here is a resume of my findings so far.

I could not reproduce the rcuc wakeup anymore. I do not know why. So I will not file the BZ until I see this again, if I see... maybe there was something wrong with my setup.

Using the kernel with the new timer wheel, I still saw the wakeup of the ktimersoftd. But after applying the patches:

[PATCH v4 1/2] timers: Don't wake ktimersoftd on every tick
[PATCH v4 2/2] timers: Don't search for expired timers while TIMER_SOFTIRQ is scheduled

And adding the following kernel command line parameter:


to avoid the timer that checks the stability of the TSC.

The above-mentioned patches are not part of the PREEMPT_RT patch set yet, so there might be newer versions.

I can see very few ktimersoftd wake-up. Those that I see seems to be legit, and rarely take place, and each one after a long time one from the previous one.

My next step is to try to backport the timer wheel for the RHEL-7.6 code.

-- Daniel

Comment 10 Luiz Capitulino 2018-03-08 17:16:03 UTC

Does the first patch depend on the new timer wheel?

Comment 11 Daniel Bristot de Oliveira 2018-03-09 08:24:36 UTC
Hi Luis,

Unfortunately, yes. These patches improve the timer wheel by forwarding it in the interrupt context. The softirq is then raised just in case of an armed timer in the forwarding process.

-- Daniel

Comment 12 Luis Claudio R. Goncalves 2018-06-12 13:45:04 UTC
The dependencies for the timer wheel update, first half of the solution for the problem described in this bugzilla ticket, have already been added to kernel-rt-3.10.0-900.rt56.846.el7. These changes are enough to keep us in sync with the changes added to RHEL, respecting the kernel-rt differences.

Changes added:

bb2b1db2d575 timers: Reduce the CPU index space to 256k
34e2660305c9 timers: Use proper base migration in add_timer_on()
9a9ececb8d90 hlist: Add hlist_is_singular_node() helper
0fbdb2309b1a signals: Use hrtimer for sigtimedwait()
f68a3f9917ed timers: Remove the deprecated mod_timer_pinned() API
4da22cacb04c timers, driver/net/ethernet/tile: Initialize the egress timer as pinned
b1aa9f139e6b timers, cpufreq/powernv: Initialize the gpstate timer as pinned
c274cd526b70 timers, x86/apic/uv: Initialize the UV heartbeat timer as pinned
dee6b36a6cc9 timers: Make 'pinned' a timer property
da4f00fe9a87 timer: Minimize nohz off overhead
e79103755f45 timer: Reduce timer migration overhead if disabled (v2)
417bd5c3fdc1 Remove code redundancy while calling get_nohz_timer_target()
dde8ca171b2a timer: Stats: Simplify the flags handling
ce38ad8067ac timer: Replace timer base by a cpu index
6e878157ddce timer: Use timer->base for flag checks
563b84b3a535 tracing: timer: Add deferrable flag to timer_start
66743e988f03 timer: Use hlist for the timer wheel hash buckets
45665e16b954 timer: Remove FIFO "guarantee"
6056d7fc3054 timers: Sanitize catchup_timer_jiffies() usage
9b05865436d7 timer: Put usleep_range into the __sched section
2e80b77a26ca timer: Remove pointless return value of do_usleep_range()
c57ba680b598 timer: Further simplify the SMP and HOTPLUG logic
9960cc117fd1 timer: Don't initialize 'tvec_base' on hotplug
c0327baf7493 timer: Allocate per-cpu tvec_base's statically

Comment 13 Red Hat Bugzilla Rules Engine 2018-06-27 14:12:30 UTC
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.

Comment 20 Daniel Bristot de Oliveira 2019-03-06 10:33:54 UTC

We have the main dependencies for this patch set already in our kernel. But, the problem is still present in the upstream kernel.
We are working with the kernel maintainer (Timer/RT) to find a solution, and this is a high priority BZ for us.
However, this is a very complex problem and will require some time to find an upstream solution.

-- Daniel

Comment 39 Daniel Bristot de Oliveira 2019-11-15 12:23:06 UTC
doc text reviewed, It is fine.


Comment 49 errata-xmlrpc 2020-03-31 19:48:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.