Bug 1378172 - KVM-RT: halting and starting guests cause latency spikes
Summary: KVM-RT: halting and starting guests cause latency spikes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-rt
Version: 7.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 7.4
Assignee: Clark Williams
QA Contact: Pei Zhang
URL:
Whiteboard:
Depends On:
Blocks: 1175461 1353018 1388528
TreeView+ depends on / blocked
 
Reported: 2016-09-21 17:28 UTC by Greeshma Gopinath
Modified: 2017-08-02 00:23 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
no 7.4 docs needed. will be released/documented in 7.3.z
Clone Of:
: 1388528 (view as bug list)
Environment:
Last Closed: 2017-08-01 18:57:57 UTC


Attachments (Terms of Use)
upstream RT patch to make synchronize_rcu_expedited() conditional on !PREEMPT_RT (1.27 KB, patch)
2016-10-04 17:44 UTC, Clark Williams
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:2077 normal SHIPPED_LIVE Important: kernel-rt security, bug fix, and enhancement update 2017-08-01 18:13:37 UTC

Description Greeshma Gopinath 2016-09-21 17:28:57 UTC
Description of problem:
When cyclictests are running in one realtime guest, reboot or halts of other guests affects the latency spikes reported. Ideally the maximum allowed latency of realtime kernels should be less than 20us.When a halt or a reboot occurs the spikes reported are > 90 us.


Version-Release number of selected component (if applicable):
3.10.0-327.36.1.rt56.237.el7.x86_64

How reproducible:
100%


Steps to Reproduce:
1.start 2 realtime VMs
2.run-cyclictest on one guest for 15m . Wait for 5m
3.shutdown the other guest.
4. A spike is seen in the cyclictest results.

Actual results:
The Max Latency reported >90 us

Expected results:
The Max latency should be <20 us


Additional info:

Comment 2 Luiz Capitulino 2016-09-22 14:23:16 UTC
Greeshma,

This is very reproducible to me, but only when the other guest halts. I could not reproduce it when the other guest reboots. Did you really get the spike when the other guest reboots (for example, but typing "reboot" in bash)?

Comment 5 Greeshma Gopinath 2016-09-22 14:41:10 UTC
(In reply to Luiz Capitulino from comment #2)
> Greeshma,
> 
> This is very reproducible to me, but only when the other guest halts. I
> could not reproduce it when the other guest reboots. Did you really get the
> spike when the other guest reboots (for example, but typing "reboot" in
> bash)?

Yep. I tested with reboots more than I tested with halts. I got a spike everytime. I tried it again now and got a spike of 102us for 5m test.

Comment 6 Luiz Capitulino 2016-09-22 14:48:36 UTC
How are rebooting the VM?

I'm asking this because debugging a reboot might be easier than debugging the whole VM teardown.

Comment 7 Greeshma Gopinath 2016-09-22 15:03:15 UTC
(In reply to Luiz Capitulino from comment #6)
> How are rebooting the VM?
> 
> I'm asking this because debugging a reboot might be easier than debugging
> the whole VM teardown.

I type reboot in the VM2. And wait for the results of the cyclictests in VM1.

Comment 8 Luiz Capitulino 2016-09-23 21:34:42 UTC
I'll separate my findings into what I'm seeing on the core where
the spike happens vs. what I think is the root cause for it.

I'm very confident about what I'm seeing when the spike happens,
but I'm not completely sure my conclusions make sense for the
root cause.

So, what I'm seeing when the spike happens is:

 1. The core receives two reschedule IPIs
 2. The migration thread executes on that core

Now, one strange thing is that the execution of the migration
thread and the reschedule IPI handler is quick and don't
*seem* to account for the spike duration (the spike is reported
as dozens of microseconds, while those two events take a few
microseconds). But I could be missing something.

I've tried to track down what's sending reschedule IPIs to that
core and why the migration thread executes. What I found out,
and this is the part where I'm not very sure about my
findings, is the following:

 1. synchronize_rcu_expedited() runs on a _housekeeping_ core.
    It seems that it's the networking subsystem that is
    calling it from synchronize_net().
    
    There's a number of places in the networking subsystem
    that calls synchronize_net(). I'm seeing it being
    called from tun_detach(), packet_release() and
    rollback_registered_many()
    
 2. The execution of synchronize_rcu_expedited() triggers
    the sending of reschedule IPIs to all cores on the
    system. Those make the migration thread execute on
    each core
    
 3. The migration thread executes synchronize_sched_expedited_cpu_stop()
    Although it's not always that I see it executing this
    function in the core that gets the spike

Comment 9 Michael S. Tsirkin 2016-09-23 22:36:59 UTC
As was discussed on list, synchronize_rcu_expedited is bad for RT.

Try reverting be3fc413da9eb17cce0991f214ab019d16c88c41
and see if that helps.

Comment 10 Rik van Riel 2016-09-24 01:02:51 UTC
Assuming the VCPUs that see the latency are running all the time (stress + cyclictest), I have to wonder why synchronize_rcu_expedited is bothering those CPUs at all?

After all, whenever we go into the guest, the CPU gets marked as being in an RCU quiescent state, and there should be no reason at all for synchronize_rcu_expedited to touch that CPU.

What am I overlooking?

Comment 11 Paolo Bonzini 2016-09-26 12:04:33 UTC
Missing commit e0775cefb5ede661dbdc0611d7bf3fcd4640005c?

Comment 12 Luiz Capitulino 2016-09-26 13:00:53 UTC
Guys, thanks for the feedback. I'll try first Michael's suggestion (comment 9), this should help confirming my findings make sense. Then I'll try Paolo's suggestion (comment 11).

Comment 13 Luiz Capitulino 2016-09-26 16:55:36 UTC
Michael,

I don't get the spike for short duration runs when be3fc413da9eb17cce0991f214ab019d16c88c41 is reverted.

I'll now take a look at Paolo's suggestion for the fix.

Comment 14 Luiz Capitulino 2016-09-26 20:57:36 UTC
Paolo's suggested fix increases the latency from a few dozen microseconds to several hundrends of microseconds. That's unexpected, the fix does look like the right fix to me since it makes synchronize_sched_expedited() skip CPUs in nohz mode. Investigating...

Comment 15 Luiz Capitulino 2016-09-28 15:27:00 UTC
I've been trying to understand why the upstream commit from comment 11 does not solve this issue, given that it should: this commit changes synchronize_sched_expedited() to ignore CPUs which are executing in RCU idle mode (quiescent mode). This is the case for CPUs running real-time vCPUs.

However, what seems to be happening is that if synchronize_sched_expedited() executes on a CPU while a real-time CPU is in a kvm exit (eg. handling a TSC deadline timer), then the real-time CPU will be scheduled to receive IPIs.

The commit from comment 11 should improve the situation for DPDK workloads, as DPDK always executes in user-space (except for some kernel interruptions such as the 1HZ scheduler tick). But cyclictest loops around clock_nanosleep(), so it's certain it will trig this problem.

I don't know how to fix this. Isn't it legit that the real-time CPU receives this IPI since it's executing in kernel-mode and since it may use RCU?

The only thing I can think of doing is trying to reproducing with upstream and if it does reproduce then move this discussion there.

Comment 16 Luiz Capitulino 2016-09-28 21:19:34 UTC
After discussing my findings with Rik, we've concluded that:

1. As an immediate term solution, we could revert commit from comment 9 from the RT kernel

2. Think about a longer term solution, preferably on upstream (although I haven't checked if upstream has this issue. If it doesn't have, we could expect a complex RCU backport, which is also not good as an immediate term solution)

My next step is to repeat my tests with commit from comment 9 reverted for longer durations.

Comment 17 Luiz Capitulino 2016-09-30 18:57:09 UTC
A small update on this one: I've ran longer tests and reverting the commit from comment 9 does fix the issue on 7.3 RT.

However, we're still getting spikes on 7.2.z when the same commit is reverted. 

I'm repeating the debug I did on 7.3 in 7.2.z now.

Comment 18 pagupta 2016-10-03 08:46:22 UTC
I am also thinking about this and trying to understand.

Regarding commit 11 fix, As its mentioned in commit message says "idle CPUs and nohz_full= userspace CPUs are in extended quiescent states, there is no need to IPI"

which means inside Guest we are running cyclictest on isolated realtime cpu and other CPU does housekeeping tasks. So, in this case we will be getting IPI's on both realtime and housekeeping CPU when ran 'synchronize_rcu_expedited' resulting in vm exit?

At host side, looks like after this change, Whenever realtime vCPU is in host userspace (very less times) then it will not get any IPI from host CPU may take its time to complete/schedule, resulting in delay.

So, commit 13 revert looks like good option. I agree there should be some long term solution.

Best regards,
Pankaj

Comment 20 Luiz Capitulino 2016-10-03 13:23:47 UTC
(In reply to pagupta from comment #18)

> which means inside Guest we are running cyclictest on isolated realtime cpu
> and other CPU does housekeeping tasks. So, in this case we will be getting
> IPI's on both realtime and housekeeping CPU when ran
> 'synchronize_rcu_expedited' resulting in vm exit?

VM exits happen in the host. In the guest cyclictest would be interrupted.
 
> At host side, looks like after this change, Whenever realtime vCPU is in
> host userspace (very less times) then it will not get any IPI from host CPU
> may take its time to complete/schedule, resulting in delay.

There's no need for those CPUs to receive this IPI, given they are in RCU idle more.

> So, commit 13 revert looks like good option. I agree there should be some
> long term solution.

We need both commits. The fix from comment 11 becomes less relevant because very few code paths will trig this problem, but we need it with or without the revert from comment 9.

Comment 22 Luiz Capitulino 2016-10-03 18:00:14 UTC
Change the BZ summary to reflect how the issue manifests itself in 7.3, given that we're not going to fix this for 7.2.z.

Comment 23 Luiz Capitulino 2016-10-04 14:52:13 UTC
On trying to reproduce this on upstream RT, I found it contains
a fix similar to ours:

commit 8a2f07f55ad33c804fc3e69fc9641d4c7e670eef
Author: Josh Cartwright <joshc@ni.com>
Date:   Tue Oct 27 07:31:53 2015 -0500

    net: Make synchronize_rcu_expedited() conditional on !RT_FULL
    
    While the use of synchronize_rcu_expedited() might make
    synchronize_net() "faster", it does so at significant cost on RT
    systems, as expediting a grace period forcibly preempts any
    high-priority RT tasks (via the stop_machine() mechanism).
    
    Without this change, we can observe a latency spike up to 30us with
    cyclictest by rapidly unplugging/reestablishing an ethernet link.

The best thing to do is to just cherry-pick this fix into the RT tree. Re-assigning to Clark.

Comment 24 Clark Williams 2016-10-04 17:44:44 UTC
Created attachment 1207301 [details]
upstream RT patch to make synchronize_rcu_expedited() conditional on !PREEMPT_RT

Comment 27 Luiz Capitulino 2016-10-12 17:18:47 UTC
Just for the record:

On upstream, booting the kernel with rcupdate.rcu_normal=1 also solves the problem (the RCU subsystem will never call synchronize_rcu_expedited()). Also, there's a discussion on the RT mailing list about making rcu_normal=1 by default.

However, the RHEL7 kernel doesn't have the rcu_normal option. And I'd guess it would take a non-trivial RCU backport to have it. So, the conclusion is: let's stick with the proposed fix for RHEL7 and use rcu_normal= in RHEL8.

Comment 34 Pei Zhang 2017-06-02 07:02:53 UTC
==Verification==
Versions:
3.10.0-675.rt56.597.el7.x86_64
tuned-2.8.0-3.el7.noarch
libvirt-3.2.0-6.el7.x86_64
qemu-kvm-rhev-2.9.0-7.el7.x86_64


Steps:
Same with bug 1388528 Comment 9.

Results:
After running cyclictest in rt VM for 1 hour when the other VM repeat halt/start, didn't cause any spike in rt VM.
 
# Min Latencies: 00004
# Avg Latencies: 00006
# Max Latencies: 00012

So this bug has been verified.

Comment 35 errata-xmlrpc 2017-08-01 18:57:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2077

Comment 36 errata-xmlrpc 2017-08-02 00:23:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2077


Note You need to log in before you can comment on or make changes to this bug.