Bug 1569846

Summary: Spike in Single VM with 1 vCPU (Max latency is 45us)
Product: Red Hat Enterprise Linux 7 Reporter: Pei Zhang <pezhang>
Component: kernel-rtAssignee: Luiz Capitulino <lcapitulino>
kernel-rt sub component: KVM QA Contact: Pei Zhang <pezhang>
Status: CLOSED WORKSFORME Docs Contact:
Severity: medium    
Priority: medium CC: bhu, chayang, daolivei, juzhang, knoel, lcapitulino, lgoncalv, michen, virt-maint, williams
Version: 7.6   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-17 17:49:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pei Zhang 2018-04-20 05:42:47 UTC
Description of problem:
The max latency is 45us in 24 hours cyclictest testing.


Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64
tuned-2.9.0-1.el7.noarch
libvirt-3.9.0-14.el7.x86_64
kernel-rt-3.10.0-871.rt56.814.el7.x86_64


How reproducible:
1/1


Steps to Reproduce:
1. Install rhel7.6 host

2. Set up rt host.

3. Install rhel7.6 guest

4. Setup rt guest

5. Start kernel compiling stress to housekeeping vCPUs.
# cd /home/nfv-virt-rt-kvm/src/kernel-rt/kernel-rt-3.10.0-871.rt56.814 && make -j2>/dev/null && make clean>/dev/null

6. Meanwhile start cyclictest testing.
# taskset -c 1 cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace

7. The max latency is 45us.


Actual results:
The max latency is 45us.

Expected results:
The max latency should < 40us.

Additional info:

1. Intel microcode was applied in this testing run.

2. X86 debug pts:pti_enable=1 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=1

3. This spike only happens in the first kvm-rt test cases, other 2 scenarios look good.

(1)Single VM with 1 rt vCPU:
# Min Latencies: 00005
# Avg Latencies: 00007
# Max Latencies: 00045

(2)Single VM with 8 rt vCPUs:
# Min Latencies: 00005 00005 00005 00005 00005 00005 00005 00005
# Avg Latencies: 00007 00006 00006 00006 00006 00006 00006 00006
# Max Latencies: 00029 00021 00021 00020 00021 00022 00021 00021

(3)Multiple VMs each with 1 rt vCPU:
- VM1
# Min Latencies: 00005
# Avg Latencies: 00007
# Max Latencies: 00024

- VM2
# Min Latencies: 00005
# Avg Latencies: 00007
# Max Latencies: 00024

- VM3
# Min Latencies: 00005
# Avg Latencies: 00007
# Max Latencies: 00024

- VM4
# Min Latencies: 00005
# Avg Latencies: 00007
# Max Latencies: 00025

Comment 2 Luiz Capitulino 2018-04-20 12:42:48 UTC
Pei,

Do you have a kernel build running in the host as well? And, can you run cyclictest again with the additional options "-b30 --notrace"?

Thanks

Comment 3 Pei Zhang 2018-04-20 15:15:58 UTC
(In reply to Luiz Capitulino from comment #2)
> Pei,
> 
> Do you have a kernel build running in the host as well? And, can you run
> cyclictest again with the additional options "-b30 --notrace"?
> 
> Thanks

Luiz, I didn't build kernel in the host, just build kernel in the guest.

OK. I'll test with options "-b30 --notrace".

Best Regards,
Pei

Comment 4 Luiz Capitulino 2018-04-20 17:18:31 UTC
OK, I'm trying to reproduce it too. The first step is to know how long it takes to reproduce.

Comment 5 Pei Zhang 2018-04-23 05:59:17 UTC
I tried 2 runs, however neither reproduced this spike:

run1:
Test started at:    2018-04-21 03:23:13 Saturday
Kernel cmdline:     BOOT_IMAGE=/vmlinuz-3.10.0-871.rt56.814.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--185-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-185/root rd.lvm.lv=rhel_bootp-73-75-185/swap rhgb quiet default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1
X86 debug pts:      pti_enable=1 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=1
Machine:            bootp-73-75-185.lab.eng.pek2.redhat.com
CPU:                Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration:      24h
Test ended at:      2018-04-22 03:23:14 Sunday
cyclictest cmdline: taskset -c 1 cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 -b30 --notrace
cyclictest results: 

# Min Latencies: 00005
# Avg Latencies: 00007
# Max Latencies: 00025


run2:
Test started at:    2018-04-22 04:42:48 Sunday
Kernel cmdline:     BOOT_IMAGE=/vmlinuz-3.10.0-871.rt56.814.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--185-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-185/root rd.lvm.lv=rhel_bootp-73-75-185/swap rhgb quiet default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1
X86 debug pts:      pti_enable=1 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=1
Machine:            bootp-73-75-185.lab.eng.pek2.redhat.com
CPU:                Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration:      24h
Test ended at:      2018-04-23 04:42:50 Monday
cyclictest cmdline: taskset -c 1 cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 -b30 --notrace
cyclictest results: 

# Min Latencies: 00005
# Avg Latencies: 00007
# Max Latencies: 00023

Comment 6 Luiz Capitulino 2018-04-23 13:04:54 UTC
I wasn't able to reproduce it either. The only differences between our setups are that I'm running:

qemu-kvm-rhev-2.10.0-20.el7.x86_64
kernel-3.10.0-873.rt56.816.el7.x86_64

Did you try your run manually? Maybe you could try to run this again through automation to see if it reproduces? In this case, it would be great if you could add "-b30 --notrace".

Otherwise, this would be one of those hard to debug bugs...

Comment 7 Pei Zhang 2018-04-23 15:06:47 UTC
(In reply to Luiz Capitulino from comment #6)
> I wasn't able to reproduce it either. The only differences between our
> setups are that I'm running:
> 
> qemu-kvm-rhev-2.10.0-20.el7.x86_64
> kernel-3.10.0-873.rt56.816.el7.x86_64
> 
> Did you try your run manually? Maybe you could try to run this again through
> automation to see if it reproduces? In this case, it would be great if you
> could add "-b30 --notrace".

Luiz, I tried to reproduce with automation, the results in Comment 5 are the automation results. Here are the beaker jobs:

https://beaker.engineering.redhat.com/recipes/5065306#tasks
https://beaker.engineering.redhat.com/recipes/5064248#tasks

Best Regards,
Pei
 
> Otherwise, this would be one of those hard to debug bugs...

Comment 8 Luiz Capitulino 2018-04-23 15:40:18 UTC
Ah, ok. So there's no much we can do. Let's keep this BZ open and keep testing normally. If it happens again, please report here.

Comment 9 Luiz Capitulino 2018-05-17 17:49:12 UTC
Almost a month has gone and we don't seem to be able to reproduce, closing as WORKSFORME.