Description of problem:
On a system configured for KVM-RT, we can observe that a hog application running under fifo:1 is preempted at least once a second by the ktimersoftd thread. We expect the situation to be worse when running an RT guest, as the preemptions will happen in host and guest (see test results below).
We've debugged this down to the following commit:
Author: Daniel Bristot de Oliveira <firstname.lastname@example.org>
Date: Thu Nov 2 18:33:51 2017 +0100
re-apply Revert "timers: do not raise softirq unconditionally"
This commit is dropping the code that avoids ktimersoftd spurious wake ups from the tick handler. The result is that now the ktimersoftd is woken up at every tick.
We measured KVM-RT test-case with and without the commit above, here's the results:
# Min Latencies: 00005 00010 00012 00010 00010 00010 00012 00010
# Avg Latencies: 00012 00012 00014 00012 00012 00012 00014 00012
# Max Latencies: 00026 00026 00033 00026 00026 00026 00033 00025
# Min Latencies: 00005 00012 00012 00012 00012 00012 00012 00012
# Avg Latencies: 00012 00013 00013 00013 00013 00013 00014 00013
# Max Latencies: 00019 00025 00021 00019 00019 00019 00026 00019
So, we observe an increase in latency of around 20% in the worst case scenario.
Version-Release number of selected component (if applicable): kernel-3.10.0-858.rt56.799.el7.x86_64
Steps to Reproduce:
1. Configure the system for KVM-RT or fully isolate a CPU
2. Run a hog application under fifo:1 pinned to the isolated CPU
3. Trace sched_switch events
Very quickly tested with RHEL7's kernel-3.10.0-858.el7.x86_64 and the cpu-partitioning profile, can't reproduce. Meaning, I don't see any preemptions at all. However, I ran the test-case only for a few minutes.
Let me emphasize that this issue is extremely important. We have a confirmed 20% worse latency for 10 minutes runs and I'd expect that right now RHEL+cpu-partitioning offers better latency for DPDK than the RT kernel.
Also, for regular RT usage, if you have the tick ticking at 1000 times per second you'll have 1000 preemptions per second.
Here's a reproducer that doesn't require KVM-RT profiles, but still requires you to completely isolate a CPU:
1. Completely isolate a CPU, use nohz_full, etc
2. Run a hog application pinned to that CPU: taskset -c CPU ./hog
3. Trace sched_switch events
On an unfixed kernel, there will be one context switch to the ktimersoftd thread per second. On a fixed system, there should be no sched_switch events whatsoever.
I am already reproducing the issue in a local system. Currently checking upstream code/reproducer.
I have a question: It is not related to this BZ, but I saw rcuc threads being awakened. To avoid this, we can use rcu_nocb_poll, but I do not see this being enabled in the realtime-virtual-host. Is there a reason for it not being enabled?
Can you open a BZ? You can assign it to me.
Because of another issue I'm debugging, I've just traced sched_switch on the host when running KVM-RT test-case and didn't see any context switches to rcuc threads. But in any case it's better to investigate.
Here is a resume of my findings so far.
I could not reproduce the rcuc wakeup anymore. I do not know why. So I will not file the BZ until I see this again, if I see... maybe there was something wrong with my setup.
Using the kernel with the new timer wheel, I still saw the wakeup of the ktimersoftd. But after applying the patches:
[PATCH v4 1/2] timers: Don't wake ktimersoftd on every tick
[PATCH v4 2/2] timers: Don't search for expired timers while TIMER_SOFTIRQ is scheduled
And adding the following kernel command line parameter:
to avoid the timer that checks the stability of the TSC.
The above-mentioned patches are not part of the PREEMPT_RT patch set yet, so there might be newer versions.
I can see very few ktimersoftd wake-up. Those that I see seems to be legit, and rarely take place, and each one after a long time one from the previous one.
My next step is to try to backport the timer wheel for the RHEL-7.6 code.
Does the first patch depend on the new timer wheel?
Unfortunately, yes. These patches improve the timer wheel by forwarding it in the interrupt context. The softirq is then raised just in case of an armed timer in the forwarding process.
The dependencies for the timer wheel update, first half of the solution for the problem described in this bugzilla ticket, have already been added to kernel-rt-3.10.0-900.rt56.846.el7. These changes are enough to keep us in sync with the changes added to RHEL, respecting the kernel-rt differences.
bb2b1db2d575 timers: Reduce the CPU index space to 256k
34e2660305c9 timers: Use proper base migration in add_timer_on()
9a9ececb8d90 hlist: Add hlist_is_singular_node() helper
0fbdb2309b1a signals: Use hrtimer for sigtimedwait()
f68a3f9917ed timers: Remove the deprecated mod_timer_pinned() API
4da22cacb04c timers, driver/net/ethernet/tile: Initialize the egress timer as pinned
b1aa9f139e6b timers, cpufreq/powernv: Initialize the gpstate timer as pinned
c274cd526b70 timers, x86/apic/uv: Initialize the UV heartbeat timer as pinned
dee6b36a6cc9 timers: Make 'pinned' a timer property
da4f00fe9a87 timer: Minimize nohz off overhead
e79103755f45 timer: Reduce timer migration overhead if disabled (v2)
417bd5c3fdc1 Remove code redundancy while calling get_nohz_timer_target()
dde8ca171b2a timer: Stats: Simplify the flags handling
ce38ad8067ac timer: Replace timer base by a cpu index
6e878157ddce timer: Use timer->base for flag checks
563b84b3a535 tracing: timer: Add deferrable flag to timer_start
66743e988f03 timer: Use hlist for the timer wheel hash buckets
45665e16b954 timer: Remove FIFO "guarantee"
6056d7fc3054 timers: Sanitize catchup_timer_jiffies() usage
9b05865436d7 timer: Put usleep_range into the __sched section
2e80b77a26ca timer: Remove pointless return value of do_usleep_range()
c57ba680b598 timer: Further simplify the SMP and HOTPLUG logic
9960cc117fd1 timer: Don't initialize 'tvec_base' on hotplug
c0327baf7493 timer: Allocate per-cpu tvec_base's statically
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
We have the main dependencies for this patch set already in our kernel. But, the problem is still present in the upstream kernel.
We are working with the kernel maintainer (Timer/RT) to find a solution, and this is a high priority BZ for us.
However, this is a very complex problem and will require some time to find an upstream solution.
doc text reviewed, It is fine.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.