Bug 1388528

Summary: KVM-RT: halting and starting guests cause latency spikes [rhel-rt-7.3.z]
Product: Red Hat Enterprise Linux 7 Reporter: Marcel Kolaja <mkolaja>
Component: kernel-rtAssignee: Clark Williams <williams>
kernel-rt sub component: KVM QA Contact: Pei Zhang <pezhang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bhu, chayang, ggopinat, hhuang, jen, jshortt, juzhang, lcapitulino, mkolaja, mst, pagupta, pbonzini, pezhang, riel, sgordon, sherold, snagar, srostedt, virt-maint, williams, xfu
Version: 7.2Keywords: ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: synchronized_rcu_expedited() is a call used upstream to increase the priority of rcu synchronize operations Consequence: Calling this may hold off realtime operations and cause latency spikes Fix: make the call to synchronize_rcu_expedited conditional on not being in an RT kernel Result: No latency spikes caused by the rcu expedited call
Story Points: ---
Clone Of: 1378172 Environment:
Last Closed: 2016-12-06 17:10:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1378172    
Bug Blocks: 1353018    

Description Marcel Kolaja 2016-10-25 14:53:24 UTC
This bug has been copied from bug #1378172 and has been proposed
to be backported to 7.3 z-stream (EUS).

Comment 5 Pei Zhang 2016-11-15 10:10:19 UTC
Hi Clark,

QE failed to reproduce this issue with rhel7.3GA version.

Below scenarios were tested, but still failed reproduce, no spikes in the testing(the Max latencies < 20):
1. run cyclictests on vm1 for 15m, reboot/halt/shutdown vm2 5min later
2. run cyclictests on vm1, vm2 and vm3 for 15m, reboot vm2 several times


However, QE can reproduce this issue with rhel7.2.z(3.10.0-327.36.1.rt56.237.el7.x86_64) version.


We also tested with this bug's fixed version kernel-rt-3.10.0-514.1.1.rt56.422.el7, no spike occurs.


Could you give QE some suggestions about this bug?  Thanks.


Best Regards,
-Pei

Comment 6 Pei Zhang 2016-11-15 10:15:45 UTC
The rhel7.3GA version we tested: 3.10.0-514.rt56.420.el7.x86_64

Comment 7 Luiz Capitulino 2016-11-15 14:46:24 UTC
The rhel7.3GA kernel does have the bug, I think it's just a matter of trying harder to reproduce it.

What you could do is:

1. Run cyclictest for longer (eg. 1 hour)

2. The second VM should keep rebooting in a loop while cyclictest runs on the other VM

Comment 8 Luiz Capitulino 2016-11-15 15:03:28 UTC
Another note, make sure that the VM that reboots is a "standard" VM. Meaning that, it has a network NIC etc.

The best way is probably to install it with virt-install and don't change the XML.

Comment 9 Luiz Capitulino 2016-11-16 18:21:34 UTC
I talked to Pai Zhang today on IRC and I think we have found out why the problem is not reproducing. As it turn out, the bug reproduces on halt and re-start, not in reboots (as I mention in bug 1378172 comment 22. Sorry for having forgotten about that.

The reproducer I've been using is:

1. Install a "standard" VM with virt-manager (that is, don't change the XML)

2. In the VM, add "halt -p" to /etc/rc.d/rc.local (save a snapshot before doing this if you plan to use the VM afterwards)

3. In the host, write a script that does "virsh start VM" every few seconds in a loop

Then while this is running, run the cyclitest test-case in the RT VM.

Comment 10 Pei Zhang 2016-11-17 10:50:01 UTC
Thanks Luiz for providing the detail reproduce method.

==Reproduce==
Versions:
RHEL7.3GA version: 3.10.0-514.rt56.420.el7.x86_64

Steps:
Same as Comment 9. And run cyclitest tests in rt VM for 1 hour.

Results:
# Min Latencies: 00003
# Avg Latencies: 00005
# Max Latencies: 00033

The Max latencies 33 > 20. So this bug has been reproduced.


==Verification==
Versions:
3.10.0-514.1.1.rt56.422.el7.x86_64

Steps:
Same as reproduce.

Results:
# Min Latencies: 00003
# Avg Latencies: 00005
# Max Latencies: 00011


So this bug has been fixed well.

Comment 11 Pei Zhang 2016-11-17 10:51:04 UTC
Set this bug 'VERIFIED' as Comment 10.

Comment 12 Luiz Capitulino 2016-11-17 13:22:05 UTC
Thanks for insisting on having a reproducer Pei Zhang!

Comment 14 errata-xmlrpc 2016-12-06 17:10:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2883.html