Description of problem:
When cyclictests are running in one realtime guest, reboot or halts of other guests affects the latency spikes reported. Ideally the maximum allowed latency of realtime kernels should be less than 20us.When a halt or a reboot occurs the spikes reported are > 90 us.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.start 2 realtime VMs
2.run-cyclictest on one guest for 15m . Wait for 5m
3.shutdown the other guest.
4. A spike is seen in the cyclictest results.
The Max Latency reported >90 us
The Max latency should be <20 us
This is very reproducible to me, but only when the other guest halts. I could not reproduce it when the other guest reboots. Did you really get the spike when the other guest reboots (for example, but typing "reboot" in bash)?
(In reply to Luiz Capitulino from comment #2)
> This is very reproducible to me, but only when the other guest halts. I
> could not reproduce it when the other guest reboots. Did you really get the
> spike when the other guest reboots (for example, but typing "reboot" in
Yep. I tested with reboots more than I tested with halts. I got a spike everytime. I tried it again now and got a spike of 102us for 5m test.
How are rebooting the VM?
I'm asking this because debugging a reboot might be easier than debugging the whole VM teardown.
(In reply to Luiz Capitulino from comment #6)
> How are rebooting the VM?
> I'm asking this because debugging a reboot might be easier than debugging
> the whole VM teardown.
I type reboot in the VM2. And wait for the results of the cyclictests in VM1.
I'll separate my findings into what I'm seeing on the core where
the spike happens vs. what I think is the root cause for it.
I'm very confident about what I'm seeing when the spike happens,
but I'm not completely sure my conclusions make sense for the
So, what I'm seeing when the spike happens is:
1. The core receives two reschedule IPIs
2. The migration thread executes on that core
Now, one strange thing is that the execution of the migration
thread and the reschedule IPI handler is quick and don't
*seem* to account for the spike duration (the spike is reported
as dozens of microseconds, while those two events take a few
microseconds). But I could be missing something.
I've tried to track down what's sending reschedule IPIs to that
core and why the migration thread executes. What I found out,
and this is the part where I'm not very sure about my
findings, is the following:
1. synchronize_rcu_expedited() runs on a _housekeeping_ core.
It seems that it's the networking subsystem that is
calling it from synchronize_net().
There's a number of places in the networking subsystem
that calls synchronize_net(). I'm seeing it being
called from tun_detach(), packet_release() and
2. The execution of synchronize_rcu_expedited() triggers
the sending of reschedule IPIs to all cores on the
system. Those make the migration thread execute on
3. The migration thread executes synchronize_sched_expedited_cpu_stop()
Although it's not always that I see it executing this
function in the core that gets the spike
As was discussed on list, synchronize_rcu_expedited is bad for RT.
Try reverting be3fc413da9eb17cce0991f214ab019d16c88c41
and see if that helps.
Assuming the VCPUs that see the latency are running all the time (stress + cyclictest), I have to wonder why synchronize_rcu_expedited is bothering those CPUs at all?
After all, whenever we go into the guest, the CPU gets marked as being in an RCU quiescent state, and there should be no reason at all for synchronize_rcu_expedited to touch that CPU.
What am I overlooking?
Missing commit e0775cefb5ede661dbdc0611d7bf3fcd4640005c?
Guys, thanks for the feedback. I'll try first Michael's suggestion (comment 9), this should help confirming my findings make sense. Then I'll try Paolo's suggestion (comment 11).
I don't get the spike for short duration runs when be3fc413da9eb17cce0991f214ab019d16c88c41 is reverted.
I'll now take a look at Paolo's suggestion for the fix.
Paolo's suggested fix increases the latency from a few dozen microseconds to several hundrends of microseconds. That's unexpected, the fix does look like the right fix to me since it makes synchronize_sched_expedited() skip CPUs in nohz mode. Investigating...
I've been trying to understand why the upstream commit from comment 11 does not solve this issue, given that it should: this commit changes synchronize_sched_expedited() to ignore CPUs which are executing in RCU idle mode (quiescent mode). This is the case for CPUs running real-time vCPUs.
However, what seems to be happening is that if synchronize_sched_expedited() executes on a CPU while a real-time CPU is in a kvm exit (eg. handling a TSC deadline timer), then the real-time CPU will be scheduled to receive IPIs.
The commit from comment 11 should improve the situation for DPDK workloads, as DPDK always executes in user-space (except for some kernel interruptions such as the 1HZ scheduler tick). But cyclictest loops around clock_nanosleep(), so it's certain it will trig this problem.
I don't know how to fix this. Isn't it legit that the real-time CPU receives this IPI since it's executing in kernel-mode and since it may use RCU?
The only thing I can think of doing is trying to reproducing with upstream and if it does reproduce then move this discussion there.
After discussing my findings with Rik, we've concluded that:
1. As an immediate term solution, we could revert commit from comment 9 from the RT kernel
2. Think about a longer term solution, preferably on upstream (although I haven't checked if upstream has this issue. If it doesn't have, we could expect a complex RCU backport, which is also not good as an immediate term solution)
My next step is to repeat my tests with commit from comment 9 reverted for longer durations.
A small update on this one: I've ran longer tests and reverting the commit from comment 9 does fix the issue on 7.3 RT.
However, we're still getting spikes on 7.2.z when the same commit is reverted.
I'm repeating the debug I did on 7.3 in 7.2.z now.
I am also thinking about this and trying to understand.
Regarding commit 11 fix, As its mentioned in commit message says "idle CPUs and nohz_full= userspace CPUs are in extended quiescent states, there is no need to IPI"
which means inside Guest we are running cyclictest on isolated realtime cpu and other CPU does housekeeping tasks. So, in this case we will be getting IPI's on both realtime and housekeeping CPU when ran 'synchronize_rcu_expedited' resulting in vm exit?
At host side, looks like after this change, Whenever realtime vCPU is in host userspace (very less times) then it will not get any IPI from host CPU may take its time to complete/schedule, resulting in delay.
So, commit 13 revert looks like good option. I agree there should be some long term solution.
(In reply to pagupta from comment #18)
> which means inside Guest we are running cyclictest on isolated realtime cpu
> and other CPU does housekeeping tasks. So, in this case we will be getting
> IPI's on both realtime and housekeeping CPU when ran
> 'synchronize_rcu_expedited' resulting in vm exit?
VM exits happen in the host. In the guest cyclictest would be interrupted.
> At host side, looks like after this change, Whenever realtime vCPU is in
> host userspace (very less times) then it will not get any IPI from host CPU
> may take its time to complete/schedule, resulting in delay.
There's no need for those CPUs to receive this IPI, given they are in RCU idle more.
> So, commit 13 revert looks like good option. I agree there should be some
> long term solution.
We need both commits. The fix from comment 11 becomes less relevant because very few code paths will trig this problem, but we need it with or without the revert from comment 9.
Change the BZ summary to reflect how the issue manifests itself in 7.3, given that we're not going to fix this for 7.2.z.
On trying to reproduce this on upstream RT, I found it contains
a fix similar to ours:
Author: Josh Cartwright <email@example.com>
Date: Tue Oct 27 07:31:53 2015 -0500
net: Make synchronize_rcu_expedited() conditional on !RT_FULL
While the use of synchronize_rcu_expedited() might make
synchronize_net() "faster", it does so at significant cost on RT
systems, as expediting a grace period forcibly preempts any
high-priority RT tasks (via the stop_machine() mechanism).
Without this change, we can observe a latency spike up to 30us with
cyclictest by rapidly unplugging/reestablishing an ethernet link.
The best thing to do is to just cherry-pick this fix into the RT tree. Re-assigning to Clark.
Created attachment 1207301 [details]
upstream RT patch to make synchronize_rcu_expedited() conditional on !PREEMPT_RT
Just for the record:
On upstream, booting the kernel with rcupdate.rcu_normal=1 also solves the problem (the RCU subsystem will never call synchronize_rcu_expedited()). Also, there's a discussion on the RT mailing list about making rcu_normal=1 by default.
However, the RHEL7 kernel doesn't have the rcu_normal option. And I'd guess it would take a non-trivial RCU backport to have it. So, the conclusion is: let's stick with the proposed fix for RHEL7 and use rcu_normal= in RHEL8.
Same with bug 1388528 Comment 9.
After running cyclictest in rt VM for 1 hour when the other VM repeat halt/start, didn't cause any spike in rt VM.
# Min Latencies: 00004
# Avg Latencies: 00006
# Max Latencies: 00012
So this bug has been verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.