Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1690543

Summary:	8 vCPU guest need max latency < 20 us with stress
Product:	Red Hat Enterprise Linux 7	Reporter:	jianzzha
Component:	kernel-rt	Assignee:	Marcelo Tosatti <mtosatti>
kernel-rt sub component:	Other	QA Contact:	Pei Zhang <pezhang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	urgent	CC:	bhu, broskos, chayang, cww, daolivei, derli, dhoward, eelena, fiezzi, hhuang, jhsiao, jinzhao, jlelli, juzhang, kabbott, lcapitulino, mtosatti, ngu, peterx, pezhang, pvaanane, snagar, sputhenp, virt-maint, williams
Version:	7.5	Keywords:	ZStream
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-rt-3.10.0-1063.rt56.1023.el7	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1754846 1754847 1757165 (view as bug list)		Environment:
Last Closed:	2020-03-31 19:48:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1550584, 1701509, 1723499, 1730016, 1732264, 1734096, 1942499
Bug Blocks:	1672377, 1715542, 1754846, 1754847

Description jianzzha 2019-03-19 16:27:30 UTC

Description of problem:
we need to achieve <20 max latency in 8 vCPU cyclict test with stress.


Steps to Reproduce:
1. setup 8 vCPU guest, 2 for house keeping, 6 for cyclictest
2. stress all 8 cores with: for in in {0..7}; do taskset -c i stress-ng --cpu 1 --cpu-load 70 --cpu-method loop --timeout 24h &; done
3. run cyclict test 3 times in a row: cyclictest -l 100000 -p 99 -t 6 -h 30 -m -n -a 2-7


Actual results:
some run has max latency between 20-30 us

Expected results:
all 3 runs should have <20 us max latency

Additional info:

Comment 3 Luiz Capitulino 2019-03-25 18:01:14 UTC

Jianzhu,

Would you please confirm that when you don't run stress-ng on CPU0
the test passes? Thanks.

Comment 4 jianzzha 2019-03-25 18:11:31 UTC

(In reply to Luiz Capitulino from comment #3)
> Jianzhu,
> 
> Would you please confirm that when you don't run stress-ng on CPU0
> the test passes? Thanks.

CPU0 has stress-ng. That's china mobile standard test procedure and all vendors need to follow it in the lab test

Comment 7 Luiz Capitulino 2019-03-25 21:20:02 UTC

(In reply to jianzzha from comment #4)
> (In reply to Luiz Capitulino from comment #3)
> > Jianzhu,
> > 
> > Would you please confirm that when you don't run stress-ng on CPU0
> > the test passes? Thanks.
> 
> CPU0 has stress-ng. That's china mobile standard test procedure and all
> vendors need to follow it in the lab test

I understand. But it's an important data point for us to know whether this
only happens when there's stress in CPU0.

Comment 9 jianzzha 2019-03-26 12:34:35 UTC

(In reply to Luiz Capitulino from comment #7)
> (In reply to jianzzha from comment #4)
> > (In reply to Luiz Capitulino from comment #3)
> > > Jianzhu,
> > > 
> > > Would you please confirm that when you don't run stress-ng on CPU0
> > > the test passes? Thanks.
> > 
> > CPU0 has stress-ng. That's china mobile standard test procedure and all
> > vendors need to follow it in the lab test
> 
> I understand. But it's an important data point for us to know whether this
> only happens when there's stress in CPU0.

I see what you meant. I saw Peter was able to reproduce it without openstack. That's good. As in openstack it is much harder to maneuver the system as a lot of libvirt control is in nova, but once we know how to improve it (kernel, settings, or whatever) we can get it into openstack.

Comment 18 Peter Xu 2019-03-28 06:32:17 UTC

(obviously I didn't really notice that my previous comments are private... 
 this one will be public)

> I don't know how the tools differ, but I don't think we should use my script
> for this BZ. Let's do exactly what Jianzhu is doing.

Yes I was using jianzhu's command line when trying to reproduce.  I
used your script when I wanted to run a baseline test only.

> 
> > This is the nightly test result covering 16h:
> > 
> > - idle housekeeping vcpus (vcpu 0-1)
> > - run stress-ng on real-time vcpus only (vcpu 2-5): taskset -c $i stress-ng
> > --cpu 1 --cpu-load 70 --cpu-method loop 
> > - run cyclictest manually: cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q
> > 
> > # Total: 058864224 058864220 058864217 058864213
> > # Min Latencies: 00005 00005 00005 00005
> > # Avg Latencies: 00007 00007 00008 00007
> > # Max Latencies: 00028 00028 00028 00028

[1]

> > 
> > So the spikes triggered again even without housekeeping vcpu workload. 
> 
> What's your CPU?
> 
> This result could be a spike, but it could also be the baseline for your CPU.
> Note that Jianzhu is able to achieve around 12us for most CPUs.

Ok if so then I'm unsure on whether the spikes I observed is the same
as Jianzhu's...  But let me try to re-summarize what I have now again
before reusing another host to test, because it seems I already saw 
some issue.

Firstly, my guest has 6 vcpus (2 housekeep, 4 realtime).

(1) use stress workload (close to 24 hours)

- keep vcpu 0-1 idle
- run: "stress --cpu 1" on vcpu 2-5 in the background
- run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q"

# Total: 069851830 069851827 069851824 069851820
# Min Latencies: 00007 00007 00007 00007
# Avg Latencies: 00007 00007 00007 00007
# Max Latencies: 00017 00017 00017 00017

(2) use stress-ng workload (reproduce even within 1 hour, not to say 24H)

- keep vcpu 0-1 idle
- run: "stress-ng --cpu 1 --cpu-load 70 --cpu-method loop" on vcpu 2-5 in the background
- run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q"

# Total: 004082938 004082934 004082931 004082928
# Min Latencies: 00005 00005 00005 00005
# Avg Latencies: 00007 00007 00007 00008
# Max Latencies: 00027 00027 00027 00023

(It's basically the same as what I got from above [1]; and this is 
 very easy to reproduce, say in every 1 hour test)

So I really suspect the workload that we're using is affecting the
test result of cyclictest.

My host/guest kernel version:

3.10.0-957.rt56.910.el7.x86_64

Jianzhu, have you tried to run "stress --cpu 1" as workload on your host?

Comment 19 Marcelo Tosatti 2019-03-28 12:40:38 UTC

(In reply to Peter Xu from comment #18)
> (obviously I didn't really notice that my previous comments are private... 
>  this one will be public)
> 
> > I don't know how the tools differ, but I don't think we should use my script
> > for this BZ. Let's do exactly what Jianzhu is doing.
> 
> Yes I was using jianzhu's command line when trying to reproduce.  I
> used your script when I wanted to run a baseline test only.
> 
> > 
> > > This is the nightly test result covering 16h:
> > > 
> > > - idle housekeeping vcpus (vcpu 0-1)
> > > - run stress-ng on real-time vcpus only (vcpu 2-5): taskset -c $i stress-ng
> > > --cpu 1 --cpu-load 70 --cpu-method loop 
> > > - run cyclictest manually: cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q
> > > 
> > > # Total: 058864224 058864220 058864217 058864213
> > > # Min Latencies: 00005 00005 00005 00005
> > > # Avg Latencies: 00007 00007 00008 00007
> > > # Max Latencies: 00028 00028 00028 00028
> 
> [1]
> 
> > > 
> > > So the spikes triggered again even without housekeeping vcpu workload. 
> > 
> > What's your CPU?
> > 
> > This result could be a spike, but it could also be the baseline for your CPU.
> > Note that Jianzhu is able to achieve around 12us for most CPUs.
> 
> Ok if so then I'm unsure on whether the spikes I observed is the same
> as Jianzhu's...  But let me try to re-summarize what I have now again
> before reusing another host to test, because it seems I already saw 
> some issue.
> 
> Firstly, my guest has 6 vcpus (2 housekeep, 4 realtime).
> 
> (1) use stress workload (close to 24 hours)
> 
> - keep vcpu 0-1 idle
> - run: "stress --cpu 1" on vcpu 2-5 in the background
> - run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q"
> 
> # Total: 069851830 069851827 069851824 069851820
> # Min Latencies: 00007 00007 00007 00007
> # Avg Latencies: 00007 00007 00007 00007
> # Max Latencies: 00017 00017 00017 00017
> 
> (2) use stress-ng workload (reproduce even within 1 hour, not to say 24H)
> 
> - keep vcpu 0-1 idle
> - run: "stress-ng --cpu 1 --cpu-load 70 --cpu-method loop" on vcpu 2-5 in
> the background
> - run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q"
> 
> # Total: 004082938 004082934 004082931 004082928
> # Min Latencies: 00005 00005 00005 00005
> # Avg Latencies: 00007 00007 00007 00008
> # Max Latencies: 00027 00027 00027 00023
> 
> (It's basically the same as what I got from above [1]; and this is 
>  very easy to reproduce, say in every 1 hour test)
> 
> So I really suspect the workload that we're using is affecting the
> test result of cyclictest.
> 
> My host/guest kernel version:
> 
> 3.10.0-957.rt56.910.el7.x86_64
> 
> Jianzhu, have you tried to run "stress --cpu 1" as workload on your host?

Peter Xu,  

You need to use CAT, which is available on Jianzhu's machine as far
as i understand (instructions to setup CAT on comment #6).

Comment 20 Pei Zhang 2019-03-28 13:47:33 UTC

I reproduced this issue with 3.10.0-862.20.2.rt56.823.el7.x86_64 and 3.10.0-862.rt56.804.el7.x86_64.

== In both host&guest
# cat /sys/kernel/debug/x86/ibpb_enabled
0 
# cat /sys/kernel/debug/x86/pti_enabled
0
# cat /sys/kernel/debug/x86/ibrs_enabled
0
# cat /sys/kernel/debug/x86/retp_enabled
0

== in host
# cat /sys/module/kvm/parameters/halt_poll_ns
0

== stress-ng: add stress-ng to all cpus in guest
cpu_list="0 1 2 3 4 5 6 7"
for cpu in $cpu_list; do
        taskset -c $cpu stress-ng --cpu 1 --cpu-load 70 --cpu-method loop --timeout 24h &
        #taskset -c $cpu stress --cpu 1 &
done

== cyclictest 
# cyclictest -l 100000 -p 99 -t 6 -h 30 -m -n -a 2-7
(without "-D", the default running time is 100 seconds)

Run 20 runs for each version, around 1~2 runs exceeds 20us.

# Max Latencies: 00017 00027 00022 00016 00017 00019

# Max Latencies: 00017 00017 00030 00016 00017 00017


Besides, I checked rhel7.5 history testing(https://mojo.redhat.com/docs/DOC-1146668), we switched latency threshold from 20us to 40us after applying spectre&meltdown fixes.

Comment 21 Juri Lelli 2019-03-28 15:00:08 UTC

Hi,

(In reply to Peter Xu from comment #18)

[...]

> Firstly, my guest has 6 vcpus (2 housekeep, 4 realtime).
> 
> (1) use stress workload (close to 24 hours)
> 
> - keep vcpu 0-1 idle
> - run: "stress --cpu 1" on vcpu 2-5 in the background
> - run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q"

[...] 

> (2) use stress-ng workload (reproduce even within 1 hour, not to say 24H)
> 
> - keep vcpu 0-1 idle
> - run: "stress-ng --cpu 1 --cpu-load 70 --cpu-method loop" on vcpu 2-5 in
> the background
> - run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q"

I believe there is a macro difference between these two, which
is the CPU busy percentage the stress(-ng) is creating.

AFAIK "stress-ng --cpu 1 --cpu-load 70" imposes a 70% busy factor where
"stress --cpu 1" it just busy loop (100%).

Not sure this can make any difference, but maybe it's worth trying to use
--cpu-load 100 with stress-ng and check what happens?

Comment 22 Luiz Capitulino 2019-03-28 19:00:51 UTC

(In reply to Peter Xu from comment #18)

> So I really suspect the workload that we're using is affecting the
> test result of cyclictest.

I also suspected this when we discussed this issue by email before opening
this BZ. I think we have two options: Juri suggestion from previous comment,
or you could try yourself to run "stress --cpu 1" in vcpu0 and vcpu1 to see
if you still get good latencies.

Btw, we have to see how stress-ng calculates load, it may be doing some
system call that's causing this...

Marcelo, Do you agree with this plan? Ie. Spend some more time understand the
differences Peter spotted before trying CAT?

Comment 23 Peter Xu 2019-03-29 02:03:19 UTC

Hi, everyone,

(In reply to Luiz Capitulino from comment #22)
> (In reply to Peter Xu from comment #18)
> 
> > So I really suspect the workload that we're using is affecting the
> > test result of cyclictest.
> 
> I also suspected this when we discussed this issue by email before opening
> this BZ. I think we have two options: Juri suggestion from previous comment,
> or you could try yourself to run "stress --cpu 1" in vcpu0 and vcpu1 to see
> if you still get good latencies.
> 
> Btw, we have to see how stress-ng calculates load, it may be doing some
> system call that's causing this...

Yeah actually I discussed this with Hai offlist days ago but I forgot to update here.  There're at least two differences on how stress-ng can be differerent:

1. stress-ng uses 70% load rather than 100%
2. when the load is <100%, it'll do one select() per loop (please see stress-cpu.c:stress_cpu() - the code path when "cpu_load==100" is different, which will bypass select())

And before I saw the suggestion from Marcelo and Juri, I've already started a 24h test on --cpu-load=100 so let me update this result first:

# Total: 069240667 069240663 069240659 069240656
# Min Latencies: 00007 00007 00007 00007
# Avg Latencies: 00007 00007 00007 00007
# Max Latencies: 00017 00018 00018 00019

I think it proved that either the cpu load or the select() syscall at least has some bad effect to latency (not to mention CAT so far).

I noticed that Pei has halt_poll_ns set to zero.  Note that I didn't do that (since no one told me to, yet... :) and it's 200000, and it seems there is a difference between mine and Pei too (maybe because of this).  Jianzhu, are you setting halt_poll_ns to 0 on your host?

And I really want to know whether my test result can reproduce somewhere else, especially Jianzhu's environment.  Jianzhu, would you give it a shot?

Comment 24 jianzzha 2019-03-29 04:56:10 UTC

(In reply to Marcelo Tosatti from comment #6)
> See man pqos for more details.
> Step 1) Define CLOSID's (each CLOSID will map a certain part of the L3
> cache).
> 
> 
> Follows an example on a local machine (with only 4 COSID's).
> 
> # pqos -e "llc:0=0x000f;llc:1=0x00f0;llc:2=0x0f00;llc:3=0xf000"
> NOTE:  Mixed use of MSR and kernel interfaces to manage
>        CAT or CMT & MBM may lead to unexpected behavior.
> SOCKET 0 L3CA COS0 => MASK 0xf
> SOCKET 1 L3CA COS0 => MASK 0xf
> SOCKET 0 L3CA COS1 => MASK 0xf0
> SOCKET 1 L3CA COS1 => MASK 0xf0
> SOCKET 0 L3CA COS2 => MASK 0xf00
> SOCKET 1 L3CA COS2 => MASK 0xf00
> SOCKET 0 L3CA COS3 => MASK 0xf000
> SOCKET 1 L3CA COS3 => MASK 0xf000
> Allocation configuration altered.
> 
> In your case, there are more COSID's available: you can create 9 CLOSID's,
> each of them with a part of the L3 cache (so that there is no overlap in
> the bits between CLOSID's), as follows:
> 
> CLOSID0 = 0xFFFFF
> CLOSID1 = 0x3
> CLOSID2 = 0xC
> CLOSID3 = 0x30
> CLOSID4 = 0xC0
> CLOSID5 = 0x300
> CLOSID6 = 0xC00
> CLOSID7 = 0x3000
> CLOSID8 = 0xC000
> 
> So command line would be
> 
> pqos -e "llc:0=0xFFFFF;llc=1:0x3;llc=2:0xC;llc=3: ..."
> 
> Step 2) Map CLOSID's to cores.
> 
>        -a CLASS2CORE, --alloc-assoc=CLASS2CORE
>               associate allocation classes with cores. CLASS2CORE format is
> "TYPE:ID=CORE_LIST;...".
>               For CAT, TYPE is "llc" and ID is a class number. CORE_LIST is
> comma or dash separated list of cores.
>               For example "-a llc:0=0,2,4,6-10;llc:1=1;" associates cores 0,
> 2, 4, 6, 7, 8, 9, 10 with CAT class 0 and core  1  with
>               class 1.
> 
> so that would be
> 
> # pqos -a llc:0=host_cpus,llc:1=pcpu_of_vcpu1,llc:2=pcpu_of_vcpu2,..."

what host_cpus is set to here, say in my system,

NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23


[root@compute-0 ~]# virsh emulatorpin instance-00000001
emulator: CPU Affinity
----------------------------------
       *: 20

[root@compute-0 ~]# virsh vcpupin instance-00000001
VCPU: CPU Affinity
----------------------------------
   0: 4
   1: 6
   2: 8
   3: 10
   4: 12
   5: 14
   6: 16
   7: 18

Comment 25 jianzzha 2019-03-29 05:05:50 UTC

(In reply to Peter Xu from comment #23)

> difference between mine and Pei too (maybe because of this).  Jianzhu, are
> you setting halt_poll_ns to 0 on your host?

I have the default 200000

> And I really want to know whether my test result can reproduce somewhere
> else, especially Jianzhu's environment.  Jianzhu, would you give it a shot?

I have openstack setup again on this system and can try the 100% load.

Why did you use 6 CPU though? the requirement was to have 8 vcpu VM and 0-1 for housekeeping

Comment 26 Peter Xu 2019-03-29 05:18:02 UTC

(In reply to jianzzha from comment #25)
> (In reply to Peter Xu from comment #23)
> 
> > difference between mine and Pei too (maybe because of this).  Jianzhu, are
> > you setting halt_poll_ns to 0 on your host?
> 
> I have the default 200000
> 
> > And I really want to know whether my test result can reproduce somewhere
> > else, especially Jianzhu's environment.  Jianzhu, would you give it a shot?
> 
> I have openstack setup again on this system and can try the 100% load.
> 
> Why did you use 6 CPU though? the requirement was to have 8 vcpu VM and 0-1
> for housekeeping

I thought it should not really matter that much so I used a random number of vcpus.  If you think that matters I can simply re-run those tests with exactly your cpu allcoation.

But since you said "0-1 for housekeeping" but instead you were using 2 housekeeping vcpus in comment 0 - how many housekeeping vcpus are you using in fact?  I'll use exactly those vcpus in my future tests.

Thanks,

Comment 27 jianzzha 2019-03-29 05:30:48 UTC

(In reply to Peter Xu from comment #26)

> But since you said "0-1 for housekeeping" but instead you were using 2
> housekeeping vcpus in comment 0 - how many housekeeping vcpus are you using
> in fact?  I'll use exactly those vcpus in my future tests.
> 
ah I didn't make it clear, vcpu 0-1 for house keeping, 2-7 for cyclictest; when running cyclictest, 0-7 have stress

I just tried 100% load on all vcpu and I can still see the >20us in just 2 runs. So it doesn't really make difference for me.

I guess we will need first to agree on a gold image version to use. on the openstack setup I have on both host and guest level:

3.10.0-862.14.4.rt56.821.el7.x86_64

Comment 28 jianzzha 2019-03-29 05:34:24 UTC

(In reply to Peter Xu from comment #26)

> I thought it should not really matter that much so I used a random number of
> vcpus.  If you think that matters I can simply re-run those tests with
> exactly your cpu allcoation.

The number of vCPU absolutely make big difference, in OSP test I noticed a 8-vCPU guest has much higher latency than 2-vCPU guest. Did you guys not see this in the RT test?

Comment 29 Peter Xu 2019-03-29 05:48:01 UTC

(In reply to jianzzha from comment #28)
> (In reply to Peter Xu from comment #26)
> 
> > I thought it should not really matter that much so I used a random number of
> > vcpus.  If you think that matters I can simply re-run those tests with
> > exactly your cpu allcoation.
> 
> The number of vCPU absolutely make big difference, in OSP test I noticed a
> 8-vCPU guest has much higher latency than 2-vCPU guest. Did you guys not see
> this in the RT test?

Yeah I think there should be a difference at least for 2 vs 8.  I used 6 only for an initial setup; I'll change to yours.

Comment 34 Marcelo Tosatti 2019-03-29 13:12:32 UTC

(In reply to Luiz Capitulino from comment #22)
> (In reply to Peter Xu from comment #18)
> 
> > So I really suspect the workload that we're using is affecting the
> > test result of cyclictest.
> 
> I also suspected this when we discussed this issue by email before opening
> this BZ. I think we have two options: Juri suggestion from previous comment,
> or you could try yourself to run "stress --cpu 1" in vcpu0 and vcpu1 to see
> if you still get good latencies.
> 
> Btw, we have to see how stress-ng calculates load, it may be doing some
> system call that's causing this...
> 
> Marcelo, Do you agree with this plan? Ie. Spend some more time understand the
> differences Peter spotted before trying CAT?

I agree its good to attempt to get low latencies before trying CAT, and using CAT only if absolutely necessary.

Comment 35 Peter Xu 2019-03-29 14:22:56 UTC

Ok here comes the initial results I got after I switch to 8 vcpus and update the kernel.

I'm trying to summarize stuff up a bit since the thread is already getting long, so we may avoid doing page up and down.

   - basic environment
     - host/guest kernel: 3.10.0-862.14.4.rt56.821.el7.x86_64 (Jianzhu's version)
     - spectre/meltdown mitigations: all off
   - host configuration (16 cpus)
     - host node 0: 0,2,4,6,8,10,12,14
     - host node 1: 1,3,5,7,9,11,13,15
   - guest pinning (8 vcpus)
     - guest emulator: using CPU 2,4
     - guest housekeeping: using CPU 6,8
     - guest real-time: using CPU 5,7,9,11,13,15

   |-------+--------------------------+----------+-------------|
   | index | environment              | duration | max latency |
   |-------+--------------------------+----------+-------------|
   |     1 | 6 vcpus, 100% rtcpu-only | 24h      | 17us        |
   |     2 | 8 vcpus, 100% rtcpu-only | 2h       | 16us        |
   |     3 | 8 vcpus, 100% allcpu     | 30m      | 16us        |
   |     4 | 8 vcpus, 70% allcpu      | 1h       | 28us        |
   |     5 | 8 vcpus, 70% rtcpu-only  | 20m      | 24us        |
   |-------+--------------------------+----------+-------------|

   - 100%/70% means the cpu workload of "--cpu-load".
   - rtcpu-only means only adding cpu load to rt cpus, so housekeeping
     cpus are idle; while allcpu means adding load to all cpus

Entry 1 is the one of 6 vcpus to reference.  Entries 2-5 are new ones with 8 vcpus.

Conclusions:

1. Compare 1 with 2: vcpu number (6 or 8) seems to make no difference
2. Entry 4: this is the initial state of comment 0 when bug reported, so spike reproduced
3. Compare 3 with 4: again it verified that the workload should matter something
4. Compare 4 with 5: it should somehow show that the housekeeping cpu workload does not matter much because all these two can generate spikes

Jianzhu, from comment 27 you said you can still see spikes even with the case of entry 3. It does not match with my test (I ran 30min, even longer than yours 1.5min*3).  Before we move on to the final goal (70% load on all cpus), could you help me to make sure we can at least have the same data matched with entry 3 (run 100% cpu load on all cpus)?  That should help us to make sure we have the same baseline before we dig into the 70% issue IMHO.

Please check your environment setup, BIOS, everything.  If you still cannot reproduce, maybe we can let Pei to run entry 3 again to see what Pei can get with it to make sure I didn't mess up anything.

Comment 37 Marcelo Tosatti 2019-03-29 20:52:02 UTC

(In reply to jianzzha from comment #24)
> (In reply to Marcelo Tosatti from comment #6)
> > See man pqos for more details.
> > Step 1) Define CLOSID's (each CLOSID will map a certain part of the L3
> > cache).
> > 
> > 
> > Follows an example on a local machine (with only 4 COSID's).
> > 
> > # pqos -e "llc:0=0x000f;llc:1=0x00f0;llc:2=0x0f00;llc:3=0xf000"
> > NOTE:  Mixed use of MSR and kernel interfaces to manage
> >        CAT or CMT & MBM may lead to unexpected behavior.
> > SOCKET 0 L3CA COS0 => MASK 0xf
> > SOCKET 1 L3CA COS0 => MASK 0xf
> > SOCKET 0 L3CA COS1 => MASK 0xf0
> > SOCKET 1 L3CA COS1 => MASK 0xf0
> > SOCKET 0 L3CA COS2 => MASK 0xf00
> > SOCKET 1 L3CA COS2 => MASK 0xf00
> > SOCKET 0 L3CA COS3 => MASK 0xf000
> > SOCKET 1 L3CA COS3 => MASK 0xf000
> > Allocation configuration altered.
> > 
> > In your case, there are more COSID's available: you can create 9 CLOSID's,
> > each of them with a part of the L3 cache (so that there is no overlap in
> > the bits between CLOSID's), as follows:
> > 
> > CLOSID0 = 0xFFFFF
> > CLOSID1 = 0x3
> > CLOSID2 = 0xC
> > CLOSID3 = 0x30
> > CLOSID4 = 0xC0
> > CLOSID5 = 0x300
> > CLOSID6 = 0xC00
> > CLOSID7 = 0x3000
> > CLOSID8 = 0xC000
> > 
> > So command line would be
> > 
> > pqos -e "llc:0=0xFFFFF;llc=1:0x3;llc=2:0xC;llc=3: ..."
> > 
> > Step 2) Map CLOSID's to cores.
> > 
> >        -a CLASS2CORE, --alloc-assoc=CLASS2CORE
> >               associate allocation classes with cores. CLASS2CORE format is
> > "TYPE:ID=CORE_LIST;...".
> >               For CAT, TYPE is "llc" and ID is a class number. CORE_LIST is
> > comma or dash separated list of cores.
> >               For example "-a llc:0=0,2,4,6-10;llc:1=1;" associates cores 0,
> > 2, 4, 6, 7, 8, 9, 10 with CAT class 0 and core  1  with
> >               class 1.
> > 
> > so that would be
> > 
> > # pqos -a llc:0=host_cpus,llc:1=pcpu_of_vcpu1,llc:2=pcpu_of_vcpu2,..."
> 
> what host_cpus is set to here, say in my system,
> 
> NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
> NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23
> 
> 
> [root@compute-0 ~]# virsh emulatorpin instance-00000001
> emulator: CPU Affinity
> ----------------------------------
>        *: 20
> 
> [root@compute-0 ~]# virsh vcpupin instance-00000001
> VCPU: CPU Affinity
> ----------------------------------
>    0: 4
>    1: 6
>    2: 8
>    3: 10
>    4: 12
>    5: 14
>    6: 16
>    7: 18

       -a CLASS2CORE, --alloc-assoc=CLASS2CORE
              associate allocation classes with cores. CLASS2CORE format is "TYPE:ID=CORE_LIST;...".
              For CAT, TYPE is "llc" and ID is a class number. CORE_LIST is comma or dash separated list of cores.
              For example "-a llc:0=0,2,4,6-10;llc:1=1;" associates cores 0, 2, 4, 6, 7, 8, 9, 10 with CAT class 0 and core 1 with class 1.

So it would be 

pqos -a llc:0=1,3,5,7,9,11,13,15,17,19,20,21,22,23 
        llc:1=4,llc:2=6,llc:3=8,llc:4=10,llc:5=12,
        llc:6=14,llc:7=16,llc:8=18

(check that COSID-0 has non realtime CPUs, and each realtime CPU is assigned one COSID between 1-8). 

COSID's being the values allocated in "pqos -e ...".

Comment 40 Marcelo Tosatti 2019-03-30 19:47:36 UTC

(In reply to Peter Xu from comment #35)
> Ok here comes the initial results I got after I switch to 8 vcpus and update
> the kernel.

Good report, thanks.
 
> I'm trying to summarize stuff up a bit since the thread is already getting
> long, so we may avoid doing page up and down.
> 
>    - basic environment
>      - host/guest kernel: 3.10.0-862.14.4.rt56.821.el7.x86_64 (Jianzhu's
> version)
>      - spectre/meltdown mitigations: all off
>    - host configuration (16 cpus)
>      - host node 0: 0,2,4,6,8,10,12,14
>      - host node 1: 1,3,5,7,9,11,13,15
>    - guest pinning (8 vcpus)
>      - guest emulator: using CPU 2,4
>      - guest housekeeping: using CPU 6,8
>      - guest real-time: using CPU 5,7,9,11,13,15
> 
>    |-------+--------------------------+----------+-------------|
>    | index | environment              | duration | max latency |
>    |-------+--------------------------+----------+-------------|
>    |     1 | 6 vcpus, 100% rtcpu-only | 24h      | 17us        |
>    |     2 | 8 vcpus, 100% rtcpu-only | 2h       | 16us        |
>    |     3 | 8 vcpus, 100% allcpu     | 30m      | 16us        |
>    |     4 | 8 vcpus, 70% allcpu      | 1h       | 28us        |
>    |     5 | 8 vcpus, 70% rtcpu-only  | 20m      | 24us        |
>    |-------+--------------------------+----------+-------------|
> 
>    - 100%/70% means the cpu workload of "--cpu-load".
>    - rtcpu-only means only adding cpu load to rt cpus, so housekeeping
>      cpus are idle; while allcpu means adding load to all cpus

Note how stress-ng handles the "cpu-load = 100" case and 
the "cpu-load < 100" cases. Perhaps the reason for this spike 
is there.

Can you increase cpu-load, to say 90% and 99% ? (only the rtcpu-only case is sufficient).

Also, can you please show the histogram output from cyclictest, to 
know the frequency of the >20us events.



> 
> Entry 1 is the one of 6 vcpus to reference.  Entries 2-5 are new ones with 8
> vcpus.
> 
> Conclusions:
> 
> 1. Compare 1 with 2: vcpu number (6 or 8) seems to make no difference
> 2. Entry 4: this is the initial state of comment 0 when bug reported, so
> spike reproduced
> 3. Compare 3 with 4: again it verified that the workload should matter
> something
> 4. Compare 4 with 5: it should somehow show that the housekeeping cpu
> workload does not matter much because all these two can generate spikes
> 
> Jianzhu, from comment 27 you said you can still see spikes even with the
> case of entry 3. It does not match with my test (I ran 30min, even longer
> than yours 1.5min*3).  Before we move on to the final goal (70% load on all
> cpus), could you help me to make sure we can at least have the same data
> matched with entry 3 (run 100% cpu load on all cpus)?  That should help us
> to make sure we have the same baseline before we dig into the 70% issue IMHO.
> 
> Please check your environment setup, BIOS, everything.  If you still cannot
> reproduce, maybe we can let Pei to run entry 3 again to see what Pei can get
> with it to make sure I didn't mess up anything.

Comment 41 Marcelo Tosatti 2019-03-30 20:11:34 UTC

(In reply to Marcelo Tosatti from comment #40)
> (In reply to Peter Xu from comment #35)
> > Ok here comes the initial results I got after I switch to 8 vcpus and update
> > the kernel.
> 
> Good report, thanks.
>  
> > I'm trying to summarize stuff up a bit since the thread is already getting
> > long, so we may avoid doing page up and down.
> > 
> >    - basic environment
> >      - host/guest kernel: 3.10.0-862.14.4.rt56.821.el7.x86_64 (Jianzhu's
> > version)
> >      - spectre/meltdown mitigations: all off
> >    - host configuration (16 cpus)
> >      - host node 0: 0,2,4,6,8,10,12,14
> >      - host node 1: 1,3,5,7,9,11,13,15
> >    - guest pinning (8 vcpus)
> >      - guest emulator: using CPU 2,4
> >      - guest housekeeping: using CPU 6,8
> >      - guest real-time: using CPU 5,7,9,11,13,15
> > 
> >    |-------+--------------------------+----------+-------------|
> >    | index | environment              | duration | max latency |
> >    |-------+--------------------------+----------+-------------|
> >    |     1 | 6 vcpus, 100% rtcpu-only | 24h      | 17us        |
> >    |     2 | 8 vcpus, 100% rtcpu-only | 2h       | 16us        |
> >    |     3 | 8 vcpus, 100% allcpu     | 30m      | 16us        |
> >    |     4 | 8 vcpus, 70% allcpu      | 1h       | 28us        |
> >    |     5 | 8 vcpus, 70% rtcpu-only  | 20m      | 24us        |
> >    |-------+--------------------------+----------+-------------|
> > 
> >    - 100%/70% means the cpu workload of "--cpu-load".
> >    - rtcpu-only means only adding cpu load to rt cpus, so housekeeping
> >      cpus are idle; while allcpu means adding load to all cpus
> 
> Note how stress-ng handles the "cpu-load = 100" case and 
> the "cpu-load < 100" cases. Perhaps the reason for this spike 
> is there.

You mentioned select() earlier, but also gettimeofday() is called
very often. 

> Can you increase cpu-load, to say 90% and 99% ? (only the rtcpu-only case is
> sufficient).
> 
> Also, can you please show the histogram output from cyclictest, to 
> know the frequency of the >20us events.

Another thing is now that you have another process running
on the RT CPUs, its good to have a SCHED_FIFO priority for cyclictest: 

Replace -p 99 with --policy fifo -p 1

The next step, if SCHED_FIFO priority fails, would be tracing 
to find out where the extra latency comes from.

Comment 42 Peter Xu 2019-04-01 03:33:01 UTC

(In reply to Marcelo Tosatti from comment #41)
> You mentioned select() earlier, but also gettimeofday() is called
> very often. 

True.

> 
> > Can you increase cpu-load, to say 90% and 99% ? (only the rtcpu-only case is
> > sufficient).

Sure.

> > 
> > Also, can you please show the histogram output from cyclictest, to 
> > know the frequency of the >20us events.
> 
> Another thing is now that you have another process running
> on the RT CPUs, its good to have a SCHED_FIFO priority for cyclictest: 
> 
> Replace -p 99 with --policy fifo -p 1
> 
> The next step, if SCHED_FIFO priority fails, would be tracing 
> to find out where the extra latency comes from.

I suspect I've already been using FIFO.  Cyclictest should by default use fifo IIUC if "-p" is specified, see:

		case OPT_PRIORITY:
			priority = atoi(optarg);
			if (policy != SCHED_FIFO && policy != SCHED_RR)
				policy = SCHED_FIFO;
			break;

And since I didn't specify policy, it should be using FIFO already in all my previous test results.

Thanks,

Comment 43 Peter Xu 2019-04-01 08:21:33 UTC

Follow up with 90%/99% workload, this time I'm appending the histogram:

   |-------+--------------------------+----------+-------------|
   | index | environment              | duration | max latency |
   |-------+--------------------------+----------+-------------|
   |     6 | 8 vcpus, 90% rtcpu-only  | 1h       | 31us        |
   |     7 | 8 vcpus, 99% rtcpu-only  | 2h       | 19us        |
   |-------+--------------------------+----------+-------------|

   90% cpuload on rtcpu only, ~1H:

# Histogram
000000 000000   000000  000000  000000  000000  000000
000001 000000   000000  000000  000000  000000  000000
000002 000000   000000  000000  000000  000000  000000
000003 000000   000000  000000  000000  000000  000000
000004 000000   000000  000000  000000  000000  000000
000005 000646   000611  000207  000405  000147  000326
000006 005419   004935  001974  000649  001961  003515
000007 2997397  3022392 2954316 1759617 2132180 2979399
000008 017984   016639  029222  1220846 876268  033792
000009 008019   007717  013977  004194  008258  007836
000010 266250   254322  273506  004397  281210  273484
000011 003135   002398  004394  279580  003203  003347
000012 002029   001859  003232  017097  002262  002264
000013 002046   001866  003283  005986  002000  001992
000014 058676   047713  068814  059527  053017  055798
000015 006405   006605  012614  000333  007108  006684
000016 000876   000799  001005  000001  001088  000527
000017 000788   001094  002586  000678  001400  001336
000018 000508   000772  000192  000020  000031  000019
000019 000197   000320  000341  000212  000113  000086
000020 000064   000071  000206  010446  000114  000099
000021 000213   000464  000707  001866  000368  000339
000022 000209   000280  000260  000167  000158  000039
000023 000038   000037  000054  000409  000000  000000
000024 000000   000001  000001  002601  000000  000000
000025 000000   000000  000000  000401  000000  000000
000026 000000   000000  000000  001052  000000  000000
000027 000000   000000  000000  000299  000000  000000
000028 000000   000000  000000  000100  000000  000000
000029 000000   000000  000000  000003  000000  000000
# Total: 003370899 003370895 003370891 003370886 003370886 003370882
# Min Latencies: 00005 00005 00005 00005 00005 00005
# Avg Latencies: 00007 00007 00007 00007 00007 00007
# Max Latencies: 00023 00024 00024 00031 00022 00022
# Histogram Overflows: 00000 00000 00000 00002 00000 00000
# Histogram Overflow at cycle number:
# Thread 0:
# Thread 1:
# Thread 2:
# Thread 3: 44166 3078688
# Thread 4:
# Thread 5:


   99% cpuload on rtcpu only, ~2H:
  
# Histogram
000000 000000   000000  000000  000000  000000  000000
000001 000000   000000  000000  000000  000000  000000
000002 000000   000000  000000  000000  000000  000000
000003 000000   000000  000000  000000  000000  000000
000004 000000   000000  000000  000000  000000  000000
000005 000699   000691  000430  000588  000491  000610
000006 006501   008036  007542  006682  009096  007566
000007 7342617  7341466 7342801 7344175 7345626 7339440
000008 001742   001145  001317  001229  001438  001496
000009 002614   002036  001939  002260  001493  001818
000010 045624   046621  045936  045040  041680  048736
000011 000611   000447  000470  000467  000601  000760
000012 000015   000010  000010  000007  000012  000013
000013 000011   000006  000007  000004  000005  000005
000014 000008   000005  000006  000006  000007  000006
000015 000036   000025  000026  000030  000015  000018
000016 000032   000021  000022  000016  000036  000028
000017 000002   000001  000001  000000  000000  000001
000018 000000   000000  000000  000000  000000  000000
000019 000001   000000  000000  000000  000000  000000
000020 000000   000000  000000  000000  000000  000000
000021 000000   000000  000000  000000  000000  000000
000022 000000   000000  000000  000000  000000  000000
000023 000000   000000  000000  000000  000000  000000
000024 000000   000000  000000  000000  000000  000000
000025 000000   000000  000000  000000  000000  000000
000026 000000   000000  000000  000000  000000  000000
000027 000000   000000  000000  000000  000000  000000
000028 000000   000000  000000  000000  000000  000000
000029 000000   000000  000000  000000  000000  000000
# Total: 007400513 007400510 007400507 007400504 007400500 007400497
# Min Latencies: 00005 00005 00005 00005 00005 00005
# Avg Latencies: 00007 00007 00007 00007 00007 00007
# Max Latencies: 00019 00017 00017 00016 00016 00017
# Histogram Overflows: 00000 00000 00000 00000 00000 00000
# Histogram Overflow at cycle number:
# Thread 0:
# Thread 1:
# Thread 2:
# Thread 3:
# Thread 4:
# Thread 5:

Comment 44 jianzzha 2019-04-01 11:38:29 UTC

(In reply to Peter Xu from comment #35)
> Ok here comes the initial results I got after I switch to 8 vcpus and update
> the kernel.
> 
> I'm trying to summarize stuff up a bit since the thread is already getting
> long, so we may avoid doing page up and down.
> 
>    - basic environment
>      - host/guest kernel: 3.10.0-862.14.4.rt56.821.el7.x86_64 (Jianzhu's
> version)
>      - spectre/meltdown mitigations: all off
>    - host configuration (16 cpus)
>      - host node 0: 0,2,4,6,8,10,12,14
>      - host node 1: 1,3,5,7,9,11,13,15
>    - guest pinning (8 vcpus)
>      - guest emulator: using CPU 2,4
>      - guest housekeeping: using CPU 6,8
>      - guest real-time: using CPU 5,7,9,11,13,15
> 
>    |-------+--------------------------+----------+-------------|
>    | index | environment              | duration | max latency |
>    |-------+--------------------------+----------+-------------|
>    |     1 | 6 vcpus, 100% rtcpu-only | 24h      | 17us        |
>    |     2 | 8 vcpus, 100% rtcpu-only | 2h       | 16us        |
>    |     3 | 8 vcpus, 100% allcpu     | 30m      | 16us        |
>    |     4 | 8 vcpus, 70% allcpu      | 1h       | 28us        |
>    |     5 | 8 vcpus, 70% rtcpu-only  | 20m      | 24us        |
>    |-------+--------------------------+----------+-------------|
> 
>    - 100%/70% means the cpu workload of "--cpu-load".
>    - rtcpu-only means only adding cpu load to rt cpus, so housekeeping
>      cpus are idle; while allcpu means adding load to all cpus
> 
> Entry 1 is the one of 6 vcpus to reference.  Entries 2-5 are new ones with 8
> vcpus.
> 
> Conclusions:
> 
> 1. Compare 1 with 2: vcpu number (6 or 8) seems to make no difference
> 2. Entry 4: this is the initial state of comment 0 when bug reported, so
> spike reproduced
> 3. Compare 3 with 4: again it verified that the workload should matter
> something
> 4. Compare 4 with 5: it should somehow show that the housekeeping cpu
> workload does not matter much because all these two can generate spikes
> 
> Jianzhu, from comment 27 you said you can still see spikes even with the
> case of entry 3. It does not match with my test (I ran 30min, even longer
> than yours 1.5min*3).  Before we move on to the final goal (70% load on all
> cpus), could you help me to make sure we can at least have the same data
> matched with entry 3 (run 100% cpu load on all cpus)?  That should help us
> to make sure we have the same baseline before we dig into the 70% issue IMHO.
> 
> Please check your environment setup, BIOS, everything.  If you still cannot
> reproduce, maybe we can let Pei to run entry 3 again to see what Pei can get
> with it to make sure I didn't mess up anything.

I tried 3), still no luck. 
one out of 3 runs,
# Max Latencies: 00013 00013 00022 00018 00019 00024


the domainxml generated from the OSP is:
[root@compute-0 ~]# virsh dumpxml instance-00000002
<domain type='kvm' id='1'>
  <name>instance-00000002</name>
  <uuid>5c0f7592-cc0c-4042-b46e-a0b1429d6310</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="0.0.1-3.d7864fbgit.el7ost"/>
      <nova:name>demo1</nova:name>
      <nova:creationTime>2019-04-01 10:40:16</nova:creationTime>
      <nova:flavor name="nfv">
        <nova:memory>8192</nova:memory>
        <nova:disk>60</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>8</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="cc878ab426ea4b8fb5e21504409c7935">admin</nova:user>
        <nova:project uuid="458aa608e64f4c09b3d13dec9ae4a6a3">admin</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="53a8a18b-bc55-4195-b6b9-872e490eec2c"/>
    </nova:instance>
  </metadata>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB' nodeset='0'/>
    </hugepages>
    <nosharepages/>
    <locked/>
  </memoryBacking>
  <vcpu placement='static'>8</vcpu>
  <cputune>
    <shares>8192</shares>
    <vcpupin vcpu='0' cpuset='4'/>
    <vcpupin vcpu='1' cpuset='6'/>
    <vcpupin vcpu='2' cpuset='8'/>
    <vcpupin vcpu='3' cpuset='10'/>
    <vcpupin vcpu='4' cpuset='12'/>
    <vcpupin vcpu='5' cpuset='14'/>
    <vcpupin vcpu='6' cpuset='16'/>
    <vcpupin vcpu='7' cpuset='18'/>
    <emulatorpin cpuset='20'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='3' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='4' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='5' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='6' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='7' scheduler='fifo' priority='1'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
  </numatune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>Red Hat</entry>
      <entry name='product'>OpenStack Compute</entry>
      <entry name='version'>0.0.1-3.d7864fbgit.el7ost</entry>
      <entry name='serial'>4c4c4544-0052-4e10-8044-b8c04f394e32</entry>
      <entry name='uuid'>5c0f7592-cc0c-4042-b46e-a0b1429d6310</entry>
      <entry name='family'>Virtual Machine</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-i440fx-rhel7.5.0'>hvm</type>
    <boot dev='hd'/>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pmu state='off'/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='8' cores='1' threads='1'/>
    <feature policy='require' name='tsc-deadline'/>
    <numa>
      <cell id='0' cpus='0-7' memory='8388608' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>
  <clock offset='utc'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/var/lib/nova/instances/5c0f7592-cc0c-4042-b46e-a0b1429d6310/disk'/>
      <backingStore type='file' index='1'>
        <format type='raw'/>
        <source file='/var/lib/nova/instances/_base/63952bd3e89784b90636bbcb855f4c62b039e2fe'/>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='piix3-uhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <interface type='bridge'>
      <mac address='fa:16:3e:15:c5:3f'/>
      <source bridge='qbr633e4bde-e1'/>
      <target dev='tap633e4bde-e1'/>
      <model type='virtio'/>
      <mtu size='9000'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/0'/>
      <log file='/var/lib/nova/instances/5c0f7592-cc0c-4042-b46e-a0b1429d6310/console.log' append='off'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/0'>
      <source path='/dev/pts/0'/>
      <log file='/var/lib/nova/instances/5c0f7592-cc0c-4042-b46e-a0b1429d6310/console.log' append='off'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <graphics type='vnc' port='5900' autoport='yes' listen='172.22.33.21' keymap='en-us'>
      <listen type='address' address='172.22.33.21'/>
    </graphics>
    <video>
      <model type='cirrus' vram='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <stats period='10'/>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c438,c944</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c438,c944</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
</domain>

[root@compute-0 ~]# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.rt56.821.el7.x86_64 root=UUID=7aa9d695-b9c7-416f-baf7-7e8f89c1a3bc ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on skew_tick=1 isolcpus=4,6,8,10,12,14,16,18,20,22 intel_pstate=disable nosoftlockup nohz=on nohz_full=4,6,8,10,12,14,16,18,20,22 rcu_nocbs=4,6,8,10,12,14,16,18,20,22 spectre_v2=off nopti kvm-intel.vmentry_l1d_flush=never

[root@compute-0 ~]# uname -r
3.10.0-862.14.4.rt56.821.el7.x86_64

in the guest:
[root@host-10-1-1-4 ~]# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.rt56.821.el7.x86_64 root=UUID=6bea2b7b-e6cc-4dba-ac79-be6530d348f5 ro console=tty0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto LANG=en_US.UTF-8 skew_tick=1 isolcpus=2-7 intel_pstate=disable nosoftlockup nohz=on nohz_full=2-7 rcu_nocbs=2-7 spectre_v2=off nopti default_hugepagesz=1GB hugepagesz=1G hugepages=1

[root@host-10-1-1-4 ~]# uname -r
3.10.0-862.14.4.rt56.821.el7.x86_64

Comment 45 Marcelo Tosatti 2019-04-01 12:43:38 UTC

(In reply to Peter Xu from comment #43)
> Follow up with 90%/99% workload, this time I'm appending the histogram:
> 
>    |-------+--------------------------+----------+-------------|
>    | index | environment              | duration | max latency |
>    |-------+--------------------------+----------+-------------|
>    |     6 | 8 vcpus, 90% rtcpu-only  | 1h       | 31us        |
>    |     7 | 8 vcpus, 99% rtcpu-only  | 2h       | 19us        |
>    |-------+--------------------------+----------+-------------|
> 
>    90% cpuload on rtcpu only, ~1H:
> 
> # Histogram
> 000000 000000   000000  000000  000000  000000  000000
> 000001 000000   000000  000000  000000  000000  000000
> 000002 000000   000000  000000  000000  000000  000000
> 000003 000000   000000  000000  000000  000000  000000
> 000004 000000   000000  000000  000000  000000  000000
> 000005 000646   000611  000207  000405  000147  000326
> 000006 005419   004935  001974  000649  001961  003515
> 000007 2997397  3022392 2954316 1759617 2132180 2979399
> 000008 017984   016639  029222  1220846 876268  033792
> 000009 008019   007717  013977  004194  008258  007836
> 000010 266250   254322  273506  004397  281210  273484
> 000011 003135   002398  004394  279580  003203  003347
> 000012 002029   001859  003232  017097  002262  002264
> 000013 002046   001866  003283  005986  002000  001992
> 000014 058676   047713  068814  059527  053017  055798
> 000015 006405   006605  012614  000333  007108  006684
> 000016 000876   000799  001005  000001  001088  000527
> 000017 000788   001094  002586  000678  001400  001336
> 000018 000508   000772  000192  000020  000031  000019
> 000019 000197   000320  000341  000212  000113  000086
> 000020 000064   000071  000206  010446  000114  000099
> 000021 000213   000464  000707  001866  000368  000339
> 000022 000209   000280  000260  000167  000158  000039
> 000023 000038   000037  000054  000409  000000  000000
> 000024 000000   000001  000001  002601  000000  000000
> 000025 000000   000000  000000  000401  000000  000000
> 000026 000000   000000  000000  001052  000000  000000
> 000027 000000   000000  000000  000299  000000  000000
> 000028 000000   000000  000000  000100  000000  000000
> 000029 000000   000000  000000  000003  000000  000000
> # Total: 003370899 003370895 003370891 003370886 003370886 003370882
> # Min Latencies: 00005 00005 00005 00005 00005 00005
> # Avg Latencies: 00007 00007 00007 00007 00007 00007
> # Max Latencies: 00023 00024 00024 00031 00022 00022
> # Histogram Overflows: 00000 00000 00000 00002 00000 00000
> # Histogram Overflow at cycle number:
> # Thread 0:
> # Thread 1:
> # Thread 2:
> # Thread 3: 44166 3078688
> # Thread 4:
> # Thread 5:
> 
> 
>    99% cpuload on rtcpu only, ~2H:
>   
> # Histogram
> 000000 000000   000000  000000  000000  000000  000000
> 000001 000000   000000  000000  000000  000000  000000
> 000002 000000   000000  000000  000000  000000  000000
> 000003 000000   000000  000000  000000  000000  000000
> 000004 000000   000000  000000  000000  000000  000000
> 000005 000699   000691  000430  000588  000491  000610
> 000006 006501   008036  007542  006682  009096  007566
> 000007 7342617  7341466 7342801 7344175 7345626 7339440
> 000008 001742   001145  001317  001229  001438  001496
> 000009 002614   002036  001939  002260  001493  001818
> 000010 045624   046621  045936  045040  041680  048736
> 000011 000611   000447  000470  000467  000601  000760
> 000012 000015   000010  000010  000007  000012  000013
> 000013 000011   000006  000007  000004  000005  000005
> 000014 000008   000005  000006  000006  000007  000006
> 000015 000036   000025  000026  000030  000015  000018
> 000016 000032   000021  000022  000016  000036  000028
> 000017 000002   000001  000001  000000  000000  000001
> 000018 000000   000000  000000  000000  000000  000000
> 000019 000001   000000  000000  000000  000000  000000
> 000020 000000   000000  000000  000000  000000  000000
> 000021 000000   000000  000000  000000  000000  000000
> 000022 000000   000000  000000  000000  000000  000000
> 000023 000000   000000  000000  000000  000000  000000
> 000024 000000   000000  000000  000000  000000  000000
> 000025 000000   000000  000000  000000  000000  000000
> 000026 000000   000000  000000  000000  000000  000000
> 000027 000000   000000  000000  000000  000000  000000
> 000028 000000   000000  000000  000000  000000  000000
> 000029 000000   000000  000000  000000  000000  000000
> # Total: 007400513 007400510 007400507 007400504 007400500 007400497
> # Min Latencies: 00005 00005 00005 00005 00005 00005
> # Avg Latencies: 00007 00007 00007 00007 00007 00007
> # Max Latencies: 00019 00017 00017 00016 00016 00017
> # Histogram Overflows: 00000 00000 00000 00000 00000 00000
> # Histogram Overflow at cycle number:
> # Thread 0:
> # Thread 1:
> # Thread 2:
> # Thread 3:
> # Thread 4:
> # Thread 5:

One theory would be that what is happening is:

1) stress-ng runs, dirties cache.
2) goes to sleep.
3) cyclictest wakes up CPU, finds cache dirty from stress-ng run, 
and the cache misses incur the additional us being seen.

Comment 50 Peter Xu 2019-04-02 03:03:26 UTC

(In reply to jianzzha from comment #44)
> I tried 3), still no luck. 
> one out of 3 runs,
> # Max Latencies: 00013 00013 00022 00018 00019 00024

...

>     <vcpupin vcpu='0' cpuset='4'/>
>     <vcpupin vcpu='1' cpuset='6'/>
>     <vcpupin vcpu='2' cpuset='8'/>
>     <vcpupin vcpu='3' cpuset='10'/>
>     <vcpupin vcpu='4' cpuset='12'/>
>     <vcpupin vcpu='5' cpuset='14'/>
>     <vcpupin vcpu='6' cpuset='16'/>
>     <vcpupin vcpu='7' cpuset='18'/>
>     <emulatorpin cpuset='20'/>

Jianzhu,

I noticed that you only isolated some cpus of node 0 but not node 1.  I'm not sure whether this will matter but... is it possible to isolate the node 1 instead?  I was trying to only run rt workload on node 1 and all the rest (housekeeping of host and guest, and also the emulator codes) on node 0:

   - host configuration (16 cpus)
     - host node 0: 0,2,4,6,8,10,12,14
     - host node 1: 1,3,5,7,9,11,13,15
   - guest pinning (8 vcpus)
     - guest emulator: using CPU 2,4                            <-------------- this is node 0 only
     - guest housekeeping: using CPU 6,8                        <-------------- this is node 0 only
     - guest real-time: using CPU 5,7,9,11,13,15                <-------------- this is node 1 only

Thanks,

Comment 51 jianzzha 2019-04-02 13:05:55 UTC

(In reply to Peter Xu from comment #50)
> (In reply to jianzzha from comment #44)
> > I tried 3), still no luck. 
> > one out of 3 runs,
> > # Max Latencies: 00013 00013 00022 00018 00019 00024
> 
> ...
> 
> >     <vcpupin vcpu='0' cpuset='4'/>
> >     <vcpupin vcpu='1' cpuset='6'/>
> >     <vcpupin vcpu='2' cpuset='8'/>
> >     <vcpupin vcpu='3' cpuset='10'/>
> >     <vcpupin vcpu='4' cpuset='12'/>
> >     <vcpupin vcpu='5' cpuset='14'/>
> >     <vcpupin vcpu='6' cpuset='16'/>
> >     <vcpupin vcpu='7' cpuset='18'/>
> >     <emulatorpin cpuset='20'/>
> 
> Jianzhu,
> 
> I noticed that you only isolated some cpus of node 0 but not node 1.  I'm
> not sure whether this will matter but... is it possible to isolate the node
> 1 instead?  I was trying to only run rt workload on node 1 and all the rest
> (housekeeping of host and guest, and also the emulator codes) on node 0:
> 
>    - host configuration (16 cpus)
>      - host node 0: 0,2,4,6,8,10,12,14
>      - host node 1: 1,3,5,7,9,11,13,15
>    - guest pinning (8 vcpus)
>      - guest emulator: using CPU 2,4                           
> <-------------- this is node 0 only
>      - guest housekeeping: using CPU 6,8                       
> <-------------- this is node 0 only
>      - guest real-time: using CPU 5,7,9,11,13,15               
> <-------------- this is node 1 only
> 
> Thanks,

I tried, no good (actually I think it is even worse, as one spike went up to 34us).

can you paste your domain xml, let's compare what other difference is.

Comment 57 Pei Zhang 2019-04-03 01:33:13 UTC

After talking with Peter, I've made below changes in my testing:
 - Disabling l1d flush (I missed this step in the past testing)
 - Using host cores from NUMA node 1(replace using NUMA node 0)

The latency looks better then my past testings, but still can exceed 20us a bit.

In guest:
# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.rt56.821.el7.x86_64 root=/dev/mapper/rhel_vm--74--76-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-74-76/root rd.lvm.lv=rhel_vm-74-76/swap rhgb quiet default_hugepagesz=1G iommu=pt intel_iommu=on kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti skew_tick=1 isolcpus=2,3,4,5,6,7 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,3,4,5,6,7 rcu_nocbs=2,3,4,5,6,7

In host:
# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.rt56.821.el7.x86_64 root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti LANG=en_US.UTF-8 skew_tick=1 isolcpus=1,3,5,7,9,11,13,15,17,19,18,16,14 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,3,5,7,9,11,13,15,17,19,18,16,14 rcu_nocbs=1,3,5,7,9,11,13,15,17,19,18,16,14

(1)70% cpu load
# cat stress_ng.sh
echo running stress
cpu_list="0 1 2 3 4 5 6 7"
for cpu in $cpu_list; do
        taskset -c $cpu stress-ng --cpu 1 --cpu-load 70 --cpu-method loop --timeout 24h &
done

# cyclictest -p 99 -t 6 -h 30 -m -n -a 2-7 -D 90m
Result:
# Min Latencies: 00005 00005 00005 00005 00005 00005
# Avg Latencies: 00007 00007 00006 00006 00007 00006
# Max Latencies: 00023 00015 00014 00013 00015 00013

(2)100% cpu load
# cat stress.sh
echo running stress
cpu_list="0 1 2 3 4 5 6 7"
for cpu in $cpu_list; do
        taskset -c $cpu stress --cpu 1 &
done

# cyclictest -p 99 -t 6 -h 30 -m -n -a 2-7 -D 10h
Result:
# Min Latencies: 00007 00007 00007 00007 00007 00007
# Avg Latencies: 00014 00007 00007 00007 00007 00007
# Max Latencies: 00022 00015 00016 00016 00015 00016


Summary, in my setup, no matter with 70% cpu load or 100% cpu load, the latency > 20us(not always, but in some runs, it exceeds 20us).

Comment 58 Peter Xu 2019-04-03 06:10:33 UTC

(In reply to jianzzha from comment #51)
> can you paste your domain xml, let's compare what other difference is.

[root@virtlab422 ~]# virsh dumpxml rhel7-rt
<domain type='kvm' id='3'>
  <name>rhel7-rt</name>
  <uuid>12056a04-48c9-11e9-9820-1866da5ff2ec</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB'/>
    </hugepages>
    <locked/>
  </memoryBacking>
  <vcpu placement='static'>8</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='6'/>
    <vcpupin vcpu='1' cpuset='8'/>
    <vcpupin vcpu='2' cpuset='5'/>
    <vcpupin vcpu='3' cpuset='7'/>
    <vcpupin vcpu='4' cpuset='9'/>
    <vcpupin vcpu='5' cpuset='11'/>
    <vcpupin vcpu='6' cpuset='13'/>
    <vcpupin vcpu='7' cpuset='15'/>
    <emulatorpin cpuset='2,4'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='3' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='4' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='5' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='6' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='7' scheduler='fifo' priority='1'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel7.6.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <pmu state='off'/>
    <vmport state='off'/>
    <ioapic driver='qemu'/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <feature policy='require' name='tsc-deadline'/>
  </cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none' io='threads' iommu='on' ats='on'/>
      <source file='/home/images/rhel7-rt.qcow2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='none'>
      <alias name='usb'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x0'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x0'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x0'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x0'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x0'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <interface type='network'>
      <mac address='52:54:00:c4:6e:1e'/>
      <source network='default' bridge='virbr0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/1'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/1'>
      <source path='/dev/pts/1'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </memballoon>
    <iommu model='intel'>
      <driver intremap='on' caching_mode='on' iotlb='on'/>
    </iommu>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c160,c234</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c160,c234</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
</domain>

Comment 76 jianzzha 2019-04-08 20:57:33 UTC

@Peter, Pei,

notice in comment 60 and 61, there are some difference on the domain xml setup by OSP13 versus your non-osp test setup. Some of the difference might account for the latency difference we observed. we need to find out if/what the xml item difference can impact the cyclictest difference.

Nova doesn't allow manual edit of the xml, in most case it will overwrite manual editing. In stead, can you edit your domain XML setting to match the OSP13 setting and see if/what item cause degradation. If such item difference exists and identified, we can update nova code.

Comment 78 Pei Zhang 2019-04-09 06:38:59 UTC

(In reply to jianzzha from comment #76)
> @Peter, Pei,
> 
> notice in comment 60 and 61, there are some difference on the domain xml
> setup by OSP13 versus your non-osp test setup. Some of the difference might
> account for the latency difference we observed. we need to find out if/what
> the xml item difference can impact the cyclictest difference.
> 
> Nova doesn't allow manual edit of the xml, in most case it will overwrite
> manual editing. In stead, can you edit your domain XML setting to match the
> OSP13 setting and see if/what item cause degradation. If such item
> difference exists and identified, we can update nova code.

Hi Jianzhu, Peter,

My servers are running regularly rhel7.7 testing now, and this run will be finished until tomorrow. Then I'll keep testing environment(eg. all package versions), and replace XML with OSP XML provided by Jianzhu in Comment 44. Next I'll update the latency difference between OSP XML config and our current XML config after all finish.

Regarding machines type mentioned in Comment 61, q35 is fully supported from rhel7.6 and newer versions, however default machine type on rhel7.6+ is still pc-i440fx. So actually I do regularly testing with pc-i440fx on rhel7.6+. 

I remember in our past rhel7.6 testing, actually machine type doesn't affect the latency result. 

Besides, q35 is default machine type on rhel8.

Best regards,
Pei

Comment 82 Pei Zhang 2019-04-15 14:30:22 UTC

Hi Jianzhu, 

I've tested OSP KVM-RT XML in Comment 44, the max latency looks much higher. After several try of removing some devices(vnc, cirrus, console, usb..), but still get very higher latency. I'll try to continue to find which config cause this spike difference.

KVM-RT XML from OSP: (10h cyclictest result)
# Total: 038718584 038718559 038718545 038718499 038718514 038718411
# Min Latencies: 00006 00006 00006 00006 00006 00006
# Avg Latencies: 00011 00011 00011 00011 00011 00011
# Max Latencies: 00054 00057 00062 00063 00063 00063

KVM-RT XML from our past testings: (20h cyclictest result)
# Total: 081445061 081445077 081445069 081445067 081445056 081445053
# Min Latencies: 00006 00006 00006 00006 00006 00006
# Avg Latencies: 00010 00010 00010 00010 00010 00010
# Max Latencies: 00015 00022 00021 00016 00019 00015

Versions:
3.10.0-957.15.1.rt56.927skipktimersoftd1.el7.x86_64


Best regards,
Pei

Comment 87 Pei Zhang 2019-04-18 00:07:52 UTC

Hi Jianzhu,

With KVM-RT XML from OSP, the high spike was caused by <stats period='10'/> from memballoon device:


    <memballoon model='virtio'>
      <stats period='10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </memballoon>


Removing <stats period='10'/>, others XML config keep same(including all your devices, eg. usb, vnc, video..), the max latency value is equal to the KVM-RT xml from platform.


Best regards,
Pei

Comment 89 Luiz Capitulino 2019-04-18 17:29:18 UTC

(In reply to Pei Zhang from comment #87)
> Hi Jianzhu,
> 
> With KVM-RT XML from OSP, the high spike was caused by <stats period='10'/>
> from memballoon device:
> 
> 
>     <memballoon model='virtio'>
>       <stats period='10'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
> function='0x0'/>
>     </memballoon>
> 
> 
> Removing <stats period='10'/>, others XML config keep same(including all
> your devices, eg. usb, vnc, video..), the max latency value is equal to the
> KVM-RT xml from platform.

Great find Pei!

Would you please open a BZ against RHOS for this issue? Please, check
bug 1646397 as an example for product, component, etc. Except that
the memballoon issue you found is a bug for KVM-RT, not an RFE.

Comment 92 Pei Zhang 2019-04-19 10:19:48 UTC

(In reply to Luiz Capitulino from comment #89)
> (In reply to Pei Zhang from comment #87)
> > Hi Jianzhu,
> > 
> > With KVM-RT XML from OSP, the high spike was caused by <stats period='10'/>
> > from memballoon device:
> > 
> > 
> >     <memballoon model='virtio'>
> >       <stats period='10'/>
> >       <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
> > function='0x0'/>
> >     </memballoon>
> > 
> > 
> > Removing <stats period='10'/>, others XML config keep same(including all
> > your devices, eg. usb, vnc, video..), the max latency value is equal to the
> > KVM-RT xml from platform.
> 
> Great find Pei!
> 
> Would you please open a BZ against RHOS for this issue? Please, check
> bug 1646397 as an example for product, component, etc. Except that
> the memballoon issue you found is a bug for KVM-RT, not an RFE.

Thanks Luiz for the bz reference, I filed a new RHOSP BZ to track this issue:

Bug 1701509 - <stats period='10'/> of memballoon device cause high latency spike for KVM-RT guest

Comment 93 jianzzha 2019-04-22 14:13:42 UTC

(In reply to Luiz Capitulino from comment #89)
> (In reply to Pei Zhang from comment #87)
> > Hi Jianzhu,
> > 
> > With KVM-RT XML from OSP, the high spike was caused by <stats period='10'/>
> > from memballoon device:
> > 
> > 
> >     <memballoon model='virtio'>
> >       <stats period='10'/>
> >       <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
> > function='0x0'/>
> >     </memballoon>
> > 
> > 
> > Removing <stats period='10'/>, others XML config keep same(including all
> > your devices, eg. usb, vnc, video..), the max latency value is equal to the
> > KVM-RT xml from platform.
> 
> Great find Pei!
> 
> Would you please open a BZ against RHOS for this issue? Please, check
> bug 1646397 as an example for product, component, etc. Except that
> the memballoon issue you found is a bug for KVM-RT, not an RFE.

Indeed great finding! Excellent team work.

Luiz, is there anything that need to tackle from openstack side on this issue?

Comment 96 Luiz Capitulino 2019-04-23 19:29:54 UTC

(In reply to jianzzha from comment #93)
> (In reply to Luiz Capitulino from comment #89)
> > (In reply to Pei Zhang from comment #87)
> > > Hi Jianzhu,
> > > 
> > > With KVM-RT XML from OSP, the high spike was caused by <stats period='10'/>
> > > from memballoon device:
> > > 
> > > 
> > >     <memballoon model='virtio'>
> > >       <stats period='10'/>
> > >       <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
> > > function='0x0'/>
> > >     </memballoon>
> > > 
> > > 
> > > Removing <stats period='10'/>, others XML config keep same(including all
> > > your devices, eg. usb, vnc, video..), the max latency value is equal to the
> > > KVM-RT xml from platform.
> > 
> > Great find Pei!
> > 
> > Would you please open a BZ against RHOS for this issue? Please, check
> > bug 1646397 as an example for product, component, etc. Except that
> > the memballoon issue you found is a bug for KVM-RT, not an RFE.
> 
> Indeed great finding! Excellent team work.
> 
> Luiz, is there anything that need to tackle from openstack side on this
> issue?

Yes, Pei has opened bug 1701509 for this issue. First we need to know what are
the implications of not using the balloon stats. If we can live without it, then
dropping it it's the easiest solution.

Comment 99 Luiz Capitulino 2019-05-13 18:53:15 UTC

We're going to use this BZ as the main tracker for achieving max < 20us
in guests. In order to achieve it though, we need to solve two other issues:

o Bug 1550584 - spurious ktimersoftd wake ups increases latency
o Bug 1701509 - <stats period='10'/> of memballoon device cause high latency spike for KVM-RT guest

Setting them as dependecies.

Comment 106 Jean-Tsung Hsiao 2019-09-11 21:46:38 UTC

When host is running under 
[root@netqe10 ~]# uname -r
3.10.0-1062.rt56.1022.el7.x86_64
[root@netqe10 ~]# 

The guest comes up successfully.

But, when host is running under kernel-rt-3.10.0-1063.rt56.1023.el7.x86_64, the guest failed to start:

virsh # start master-virbr0 
error: Failed to start domain master-virbr0
error: unsupported configuration: Domain requires KVM, but it is not available. Check that virtualization is enabled in the host BIOS, and host configuration is setup to load the kvm modules.

Below is the qemu-kvm info:

[root@netqe10 ~]# rpm -qa | grep qemu
qemu-kvm-common-rhev-2.12.0-33.el7_7.2.x86_64
ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch
libvirt-daemon-driver-qemu-4.5.0-23.el7.x86_64
qemu-kvm-tools-rhev-2.12.0-33.el7_7.2.x86_64
qemu-img-rhev-2.12.0-33.el7_7.2.x86_64
qemu-kvm-rhev-2.12.0-33.el7_7.2.x86_64

Bad kernel or outdated qemu-kvm-rhev ?

Comment 107 Jean-Tsung Hsiao 2019-09-12 00:36:43 UTC

I made it work after updating kernel-rt-kvm to kernel-rt-kvm-3.10.0-1063.rt56.1023.el7.x86_64, and ran "dracut -v -f", and reboot.

But, 30m cycclitest has max latencies at 21 and 22 us.

Please below to see if I am missing something.

[root@localhost rt-scripts]# ./run-cyclictest.sh -d 30m -c 1,2 -k

Test started at Wed Sep 11 19:59:16 EDT 2019

Test duration:    30m
Run rteval:       n
Run stress:       y
Isolated CPUs:    1,2
Kernel:           3.10.0-1063.rt56.1023.el7.x86_64
Kernel cmd-line:  BOOT_IMAGE=/vmlinuz-3.10.0-1063.rt56.1023.el7.x86_64 root=UUID=bc815b9b-25e7-4c22-9498-d1f84c446bcf ro rhgb quiet crashkernel=auto spectre_v2=retpoline console=ttyS0,115200 default_hugepagesz=1G hugepagesz=1G hugepages=4 nohz=on nohz_full=1-4 rcu_nocbs=1-4 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup LANG=en_US.UTF-8 skew_tick=1 isolcpus=1,2,3,4 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,2,3,4 rcu_nocbs=1,2,3,4
x86 debug opts:   retp_enabled=1 pti_enabled=1 ibrs_enabled=0 ibpb_enabled=1
Machine:          localhost.localdomain
CPU:              Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz
Results dir:      /root/results/cyclictest-results.xfavni

running stress
   taskset -c 1 stress --cpu 1
   taskset -c 2 stress --cpu 1

starting Wed Sep 11 19:59:17 EDT 2019
   taskset -c 1,2 cyclictest -m -n -q -p95 -D 30m -h60 -i 200 -t 2 -a 1,2
ended Wed Sep 11 20:29:17 EDT 2019

output dir is /root/results/cyclictest-results.xfavni


# Min Latencies: 00006 00009
# Avg Latencies: 00010 00010
# Max Latencies: 00022 00021

./run-cyclictest.sh: line 261:  1954 Terminated              $cmdline 2>&1 > $stress_out
./run-cyclictest.sh: line 261:  1955 Terminated              $cmdline 2>&1 > $stress_out
[root@localhost rt-scripts]#

Comment 108 Jean-Tsung Hsiao 2019-09-12 03:24:05 UTC

I have a question on the Reproducer, step 3.

The cyclitest has no time duration --- -D option. So, the test duration is based on "-l 100000". And, it could be very short --- could be just 20 seconds.

So, is this the intent to run the cyclitest as short as 20 seconds ?

NOTE: My test mentioned in Comment #107 haa duration of 30 minutes.

Steps to Reproduce: *** COPY from bug description above ***
1. setup 8 vCPU guest, 2 for house keeping, 6 for cyclictest
2. stress all 8 cores with: for in in {0..7}; do taskset -c i stress-ng --cpu 1 --cpu-load 70 --cpu-method loop --timeout 24h &; done
3. run cyclict test 3 times in a row: cyclictest -l 100000 -p 99 -t 6 -h 30 -m -n -a 2-7

Comment 109 Luiz Capitulino 2019-09-16 14:13:47 UTC

Jean,

How is the max=22us impacting you?

Marcelo, can you take a look at this?

Comment 110 Marcelo Tosatti 2019-09-16 15:36:18 UTC

(In reply to Jean-Tsung Hsiao from comment #107)
> I made it work after updating kernel-rt-kvm to
> kernel-rt-kvm-3.10.0-1063.rt56.1023.el7.x86_64, and ran "dracut -v -f", and
> reboot.
> 
> But, 30m cycclitest has max latencies at 21 and 22 us.
> 
> Please below to see if I am missing something.
> 
> [root@localhost rt-scripts]# ./run-cyclictest.sh -d 30m -c 1,2 -k
> 
> Test started at Wed Sep 11 19:59:16 EDT 2019
> 
> Test duration:    30m
> Run rteval:       n
> Run stress:       y
> Isolated CPUs:    1,2
> Kernel:           3.10.0-1063.rt56.1023.el7.x86_64
> Kernel cmd-line:  BOOT_IMAGE=/vmlinuz-3.10.0-1063.rt56.1023.el7.x86_64

This kernel contains the fix for the problem. Make sure you run it on both
host and guest.

Also updated tuned package is necessary (on both host and guest): 
tuned-2.11.0-8.el7 or newer.

Can you confirm you see the problem with an updated setup ? 

> root=UUID=bc815b9b-25e7-4c22-9498-d1f84c446bcf ro rhgb quiet
> crashkernel=auto spectre_v2=retpoline console=ttyS0,115200
> default_hugepagesz=1G hugepagesz=1G hugepages=4 nohz=on nohz_full=1-4
> rcu_nocbs=1-4 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup
> LANG=en_US.UTF-8 skew_tick=1 isolcpus=1,2,3,4 intel_pstate=disable
> nosoftlockup nohz=on nohz_full=1,2,3,4 rcu_nocbs=1,2,3,4
> x86 debug opts:   retp_enabled=1 pti_enabled=1 ibrs_enabled=0 ibpb_enabled=1
> Machine:          localhost.localdomain
> CPU:              Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz
> Results dir:      /root/results/cyclictest-results.xfavni
> 
> running stress
>    taskset -c 1 stress --cpu 1
>    taskset -c 2 stress --cpu 1
> 
> starting Wed Sep 11 19:59:17 EDT 2019
>    taskset -c 1,2 cyclictest -m -n -q -p95 -D 30m -h60 -i 200 -t 2 -a 1,2
> ended Wed Sep 11 20:29:17 EDT 2019
> 
> output dir is /root/results/cyclictest-results.xfavni
> 
> 
> # Min Latencies: 00006 00009
> # Avg Latencies: 00010 00010
> # Max Latencies: 00022 00021
> 
> ./run-cyclictest.sh: line 261:  1954 Terminated              $cmdline 2>&1 >
> $stress_out
> ./run-cyclictest.sh: line 261:  1955 Terminated              $cmdline 2>&1 >
> $stress_out
> [root@localhost rt-scripts]#

Comment 111 Jean-Tsung Hsiao 2019-09-17 13:43:31 UTC

(In reply to Luiz Capitulino from comment #109)
> Jean,
> 
> How is the max=22us impacting you?

Just curious about the Reproducer described in the description:

run cyclict test 3 times in a row: cyclictest -l 100000 -p 99 -t 6 -h 30 -m -n -a 2-7

I tried "-l 100000". The test duration took only 20 seconds on my Haswell test bed, and the spikes were well below 20 us.

So, is this a valid test?

All tests that I have run are without "-l" option.

> 
> Marcelo, can you take a look at this?

Comment 112 Marcelo Tosatti 2019-09-18 19:18:31 UTC

(In reply to Jean-Tsung Hsiao from comment #111)
> (In reply to Luiz Capitulino from comment #109)
> > Jean,
> > 
> > How is the max=22us impacting you?
> 
> Just curious about the Reproducer described in the description:
> 
> run cyclict test 3 times in a row: cyclictest -l 100000 -p 99 -t 6 -h 30 -m
> -n -a 2-7
> 
> I tried "-l 100000". The test duration took only 20 seconds on my Haswell
> test bed, and the spikes were well below 20 us.
> 
> So, is this a valid test?
> 
> All tests that I have run are without "-l" option.

Yes, this is a valid test.

Can you please reply to comment 110?

Comment 119 Beth Uptagrafft 2019-09-30 17:17:24 UTC

Jean-Tsung Hsiao, can you please reply to comment#110.  Thank you!

Comment 120 Pei Zhang 2019-10-14 03:01:22 UTC

Testing 24h cyclictest with 3 KVM-RT standard testing scenarios, the max latency is 17us which is expected. 

==Results==
(1)Single VM with 1 rt vCPU:
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00015

(2)Single VM with 8 rt vCPUs:
# Min Latencies: 00005 00007 00007 00007 00007 00007 00007 00007
# Avg Latencies: 00006 00007 00007 00007 00007 00007 00007 00007
# Max Latencies: 00015 00015 00015 00015 00015 00015 00016 00015

(3)Multiple VMs each with 1 rt vCPU:
- VM1
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00015

- VM2
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00015

- VM3
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00017

- VM4
# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00015


==Versions==
tuned-2.11.0-8.el7.noarch
qemu-kvm-rhev-2.12.0-37.el7.x86_64
libvirt-4.5.0-27.el7.x86_64
kernel-rt-3.10.0-1101.rt56.1061.el7.x86_64


==Details of this testing==
- Host kernel line:
BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap console=ttyS0,115200n81 LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1,3,5,7,9,11,13,15,17,19,12,14,16,18 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,3,5,7,9,11,13,15,17,19,12,14,16,18 rcu_nocbs=1,3,5,7,9,11,13,15,17,19,12,14,16,18 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti


- Testing info of three test cases:
(1)Single VM with 1 rt vCPU:
Test started at:     2019-10-12 12:52:06 Saturday
Kernel cmdline:      BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_vm--74--14-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-74-14/root rd.lvm.lv=rhel_vm-74-14/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti
X86 debug pts:       pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0
Machine:             vm-74-14.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-10-13 12:52:08 Sunday
cyclictest cmdline:  taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace -i 200
cyclictest results:

# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00015


(2)Single VM with 8 rt vCPUs:
Test started at:     2019-10-12 12:54:32 Saturday
Kernel cmdline:      BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_vm--73--228-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-73-228/root rd.lvm.lv=rhel_vm-73-228/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1,2,3,4,5,6,7,8 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,2,3,4,5,6,7,8 rcu_nocbs=1,2,3,4,5,6,7,8 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti
X86 debug pts:       pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0
Machine:             vm-73-228.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-10-13 12:54:33 Sunday
cyclictest cmdline:  taskset -c 1,2,3,4,5,6,7,8 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 8 -a 1,2,3,4,5,6,7,8 --notrace -i 200
cyclictest results:

# Min Latencies: 00005 00007 00007 00007 00007 00007 00007 00007
# Avg Latencies: 00006 00007 00007 00007 00007 00007 00007 00007
# Max Latencies: 00015 00015 00015 00015 00015 00015 00016 00015

(3)Multiple VMs each with 1 rt vCPU:
- VM1
Test started at:     2019-10-12 23:38:24 Saturday
Kernel cmdline:      BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--130-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-130/root rd.lvm.lv=rhel_bootp-73-75-130/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti
X86 debug pts:       pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0
Machine:             bootp-73-75-130.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-10-13 23:38:29 Sunday
cyclictest cmdline:  taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace -i 200
cyclictest results:

# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00015


- VM2
Test started at:     2019-10-12 23:38:23 Saturday
Kernel cmdline:      BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_vm--74--190-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-74-190/root rd.lvm.lv=rhel_vm-74-190/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti
X86 debug pts:       pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0
Machine:             vm-74-190.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-10-13 23:38:27 Sunday
cyclictest cmdline:  taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace -i 200
cyclictest results:

# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00015


- VM3
Test started at:     2019-10-12 23:38:24 Saturday
Kernel cmdline:      BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_vm--74--203-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-74-203/root rd.lvm.lv=rhel_vm-74-203/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti
X86 debug pts:       pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0
Machine:             vm-74-203.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-10-13 23:38:28 Sunday
cyclictest cmdline:  taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace -i 200
cyclictest results:

# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00017


- VM4
Test started at:     2019-10-12 23:38:24 Saturday
Kernel cmdline:      BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_vm--74--198-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-74-198/root rd.lvm.lv=rhel_vm-74-198/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti
X86 debug pts:       pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0
Machine:             vm-74-198.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-10-13 23:38:28 Sunday
cyclictest cmdline:  taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace -i 200
cyclictest results:

# Min Latencies: 00005
# Avg Latencies: 00006
# Max Latencies: 00015


So this bug has been fixed very well. Move to 'VERIFIED'.

Comment 122 errata-xmlrpc 2020-03-31 19:48:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1070

Comment 123 Jean-Tsung Hsiao 2022-10-31 14:43:51 UTC

(In reply to Marcelo Tosatti from comment #110)
> (In reply to Jean-Tsung Hsiao from comment #107)
> > I made it work after updating kernel-rt-kvm to
> > kernel-rt-kvm-3.10.0-1063.rt56.1023.el7.x86_64, and ran "dracut -v -f", and
> > reboot.
> > 
> > But, 30m cycclitest has max latencies at 21 and 22 us.
> > 
> > Please below to see if I am missing something.
> > 
> > [root@localhost rt-scripts]# ./run-cyclictest.sh -d 30m -c 1,2 -k
> > 
> > Test started at Wed Sep 11 19:59:16 EDT 2019
> > 
> > Test duration:    30m
> > Run rteval:       n
> > Run stress:       y
> > Isolated CPUs:    1,2
> > Kernel:           3.10.0-1063.rt56.1023.el7.x86_64
> > Kernel cmd-line:  BOOT_IMAGE=/vmlinuz-3.10.0-1063.rt56.1023.el7.x86_64
> 
> This kernel contains the fix for the problem. Make sure you run it on both
> host and guest.
> 
> Also updated tuned package is necessary (on both host and guest): 
> tuned-2.11.0-8.el7 or newer.
> 
> Can you confirm you see the problem with an updated setup ? 
> 
> > root=UUID=bc815b9b-25e7-4c22-9498-d1f84c446bcf ro rhgb quiet
> > crashkernel=auto spectre_v2=retpoline console=ttyS0,115200
> > default_hugepagesz=1G hugepagesz=1G hugepages=4 nohz=on nohz_full=1-4
> > rcu_nocbs=1-4 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup
> > LANG=en_US.UTF-8 skew_tick=1 isolcpus=1,2,3,4 intel_pstate=disable
> > nosoftlockup nohz=on nohz_full=1,2,3,4 rcu_nocbs=1,2,3,4
> > x86 debug opts:   retp_enabled=1 pti_enabled=1 ibrs_enabled=0 ibpb_enabled=1
> > Machine:          localhost.localdomain
> > CPU:              Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz
> > Results dir:      /root/results/cyclictest-results.xfavni
> > 
> > running stress
> >    taskset -c 1 stress --cpu 1
> >    taskset -c 2 stress --cpu 1
> > 
> > starting Wed Sep 11 19:59:17 EDT 2019
> >    taskset -c 1,2 cyclictest -m -n -q -p95 -D 30m -h60 -i 200 -t 2 -a 1,2
> > ended Wed Sep 11 20:29:17 EDT 2019
> > 
> > output dir is /root/results/cyclictest-results.xfavni
> > 
> > 
> > # Min Latencies: 00006 00009
> > # Avg Latencies: 00010 00010
> > # Max Latencies: 00022 00021
> > 
> > ./run-cyclictest.sh: line 261:  1954 Terminated              $cmdline 2>&1 >
> > $stress_out
> > ./run-cyclictest.sh: line 261:  1955 Terminated              $cmdline 2>&1 >
> > $stress_out
> > [root@localhost rt-scripts]#

The bug has been closed.