Bug 1690543
Summary: | 8 vCPU guest need max latency < 20 us with stress | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | jianzzha | |
Component: | kernel-rt | Assignee: | Marcelo Tosatti <mtosatti> | |
kernel-rt sub component: | Other | QA Contact: | Pei Zhang <pezhang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | urgent | CC: | bhu, broskos, chayang, cww, daolivei, derli, dhoward, eelena, fiezzi, hhuang, jhsiao, jinzhao, jlelli, juzhang, kabbott, lcapitulino, mtosatti, ngu, peterx, pezhang, pvaanane, snagar, sputhenp, virt-maint, williams | |
Version: | 7.5 | Keywords: | ZStream | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | kernel-rt-3.10.0-1063.rt56.1023.el7 | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1754846 1754847 1757165 (view as bug list) | Environment: | ||
Last Closed: | 2020-03-31 19:48:21 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1550584, 1701509, 1723499, 1730016, 1732264, 1734096, 1942499 | |||
Bug Blocks: | 1672377, 1715542, 1754846, 1754847 |
Description
jianzzha
2019-03-19 16:27:30 UTC
Jianzhu, Would you please confirm that when you don't run stress-ng on CPU0 the test passes? Thanks. (In reply to Luiz Capitulino from comment #3) > Jianzhu, > > Would you please confirm that when you don't run stress-ng on CPU0 > the test passes? Thanks. CPU0 has stress-ng. That's china mobile standard test procedure and all vendors need to follow it in the lab test (In reply to jianzzha from comment #4) > (In reply to Luiz Capitulino from comment #3) > > Jianzhu, > > > > Would you please confirm that when you don't run stress-ng on CPU0 > > the test passes? Thanks. > > CPU0 has stress-ng. That's china mobile standard test procedure and all > vendors need to follow it in the lab test I understand. But it's an important data point for us to know whether this only happens when there's stress in CPU0. (In reply to Luiz Capitulino from comment #7) > (In reply to jianzzha from comment #4) > > (In reply to Luiz Capitulino from comment #3) > > > Jianzhu, > > > > > > Would you please confirm that when you don't run stress-ng on CPU0 > > > the test passes? Thanks. > > > > CPU0 has stress-ng. That's china mobile standard test procedure and all > > vendors need to follow it in the lab test > > I understand. But it's an important data point for us to know whether this > only happens when there's stress in CPU0. I see what you meant. I saw Peter was able to reproduce it without openstack. That's good. As in openstack it is much harder to maneuver the system as a lot of libvirt control is in nova, but once we know how to improve it (kernel, settings, or whatever) we can get it into openstack. (obviously I didn't really notice that my previous comments are private... this one will be public) > I don't know how the tools differ, but I don't think we should use my script > for this BZ. Let's do exactly what Jianzhu is doing. Yes I was using jianzhu's command line when trying to reproduce. I used your script when I wanted to run a baseline test only. > > > This is the nightly test result covering 16h: > > > > - idle housekeeping vcpus (vcpu 0-1) > > - run stress-ng on real-time vcpus only (vcpu 2-5): taskset -c $i stress-ng > > --cpu 1 --cpu-load 70 --cpu-method loop > > - run cyclictest manually: cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q > > > > # Total: 058864224 058864220 058864217 058864213 > > # Min Latencies: 00005 00005 00005 00005 > > # Avg Latencies: 00007 00007 00008 00007 > > # Max Latencies: 00028 00028 00028 00028 [1] > > > > So the spikes triggered again even without housekeeping vcpu workload. > > What's your CPU? > > This result could be a spike, but it could also be the baseline for your CPU. > Note that Jianzhu is able to achieve around 12us for most CPUs. Ok if so then I'm unsure on whether the spikes I observed is the same as Jianzhu's... But let me try to re-summarize what I have now again before reusing another host to test, because it seems I already saw some issue. Firstly, my guest has 6 vcpus (2 housekeep, 4 realtime). (1) use stress workload (close to 24 hours) - keep vcpu 0-1 idle - run: "stress --cpu 1" on vcpu 2-5 in the background - run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q" # Total: 069851830 069851827 069851824 069851820 # Min Latencies: 00007 00007 00007 00007 # Avg Latencies: 00007 00007 00007 00007 # Max Latencies: 00017 00017 00017 00017 (2) use stress-ng workload (reproduce even within 1 hour, not to say 24H) - keep vcpu 0-1 idle - run: "stress-ng --cpu 1 --cpu-load 70 --cpu-method loop" on vcpu 2-5 in the background - run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q" # Total: 004082938 004082934 004082931 004082928 # Min Latencies: 00005 00005 00005 00005 # Avg Latencies: 00007 00007 00007 00008 # Max Latencies: 00027 00027 00027 00023 (It's basically the same as what I got from above [1]; and this is very easy to reproduce, say in every 1 hour test) So I really suspect the workload that we're using is affecting the test result of cyclictest. My host/guest kernel version: 3.10.0-957.rt56.910.el7.x86_64 Jianzhu, have you tried to run "stress --cpu 1" as workload on your host? (In reply to Peter Xu from comment #18) > (obviously I didn't really notice that my previous comments are private... > this one will be public) > > > I don't know how the tools differ, but I don't think we should use my script > > for this BZ. Let's do exactly what Jianzhu is doing. > > Yes I was using jianzhu's command line when trying to reproduce. I > used your script when I wanted to run a baseline test only. > > > > > > This is the nightly test result covering 16h: > > > > > > - idle housekeeping vcpus (vcpu 0-1) > > > - run stress-ng on real-time vcpus only (vcpu 2-5): taskset -c $i stress-ng > > > --cpu 1 --cpu-load 70 --cpu-method loop > > > - run cyclictest manually: cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q > > > > > > # Total: 058864224 058864220 058864217 058864213 > > > # Min Latencies: 00005 00005 00005 00005 > > > # Avg Latencies: 00007 00007 00008 00007 > > > # Max Latencies: 00028 00028 00028 00028 > > [1] > > > > > > > So the spikes triggered again even without housekeeping vcpu workload. > > > > What's your CPU? > > > > This result could be a spike, but it could also be the baseline for your CPU. > > Note that Jianzhu is able to achieve around 12us for most CPUs. > > Ok if so then I'm unsure on whether the spikes I observed is the same > as Jianzhu's... But let me try to re-summarize what I have now again > before reusing another host to test, because it seems I already saw > some issue. > > Firstly, my guest has 6 vcpus (2 housekeep, 4 realtime). > > (1) use stress workload (close to 24 hours) > > - keep vcpu 0-1 idle > - run: "stress --cpu 1" on vcpu 2-5 in the background > - run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q" > > # Total: 069851830 069851827 069851824 069851820 > # Min Latencies: 00007 00007 00007 00007 > # Avg Latencies: 00007 00007 00007 00007 > # Max Latencies: 00017 00017 00017 00017 > > (2) use stress-ng workload (reproduce even within 1 hour, not to say 24H) > > - keep vcpu 0-1 idle > - run: "stress-ng --cpu 1 --cpu-load 70 --cpu-method loop" on vcpu 2-5 in > the background > - run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q" > > # Total: 004082938 004082934 004082931 004082928 > # Min Latencies: 00005 00005 00005 00005 > # Avg Latencies: 00007 00007 00007 00008 > # Max Latencies: 00027 00027 00027 00023 > > (It's basically the same as what I got from above [1]; and this is > very easy to reproduce, say in every 1 hour test) > > So I really suspect the workload that we're using is affecting the > test result of cyclictest. > > My host/guest kernel version: > > 3.10.0-957.rt56.910.el7.x86_64 > > Jianzhu, have you tried to run "stress --cpu 1" as workload on your host? Peter Xu, You need to use CAT, which is available on Jianzhu's machine as far as i understand (instructions to setup CAT on comment #6). I reproduced this issue with 3.10.0-862.20.2.rt56.823.el7.x86_64 and 3.10.0-862.rt56.804.el7.x86_64. == In both host&guest # cat /sys/kernel/debug/x86/ibpb_enabled 0 # cat /sys/kernel/debug/x86/pti_enabled 0 # cat /sys/kernel/debug/x86/ibrs_enabled 0 # cat /sys/kernel/debug/x86/retp_enabled 0 == in host # cat /sys/module/kvm/parameters/halt_poll_ns 0 == stress-ng: add stress-ng to all cpus in guest cpu_list="0 1 2 3 4 5 6 7" for cpu in $cpu_list; do taskset -c $cpu stress-ng --cpu 1 --cpu-load 70 --cpu-method loop --timeout 24h & #taskset -c $cpu stress --cpu 1 & done == cyclictest # cyclictest -l 100000 -p 99 -t 6 -h 30 -m -n -a 2-7 (without "-D", the default running time is 100 seconds) Run 20 runs for each version, around 1~2 runs exceeds 20us. # Max Latencies: 00017 00027 00022 00016 00017 00019 # Max Latencies: 00017 00017 00030 00016 00017 00017 Besides, I checked rhel7.5 history testing(https://mojo.redhat.com/docs/DOC-1146668), we switched latency threshold from 20us to 40us after applying spectre&meltdown fixes. Hi, (In reply to Peter Xu from comment #18) [...] > Firstly, my guest has 6 vcpus (2 housekeep, 4 realtime). > > (1) use stress workload (close to 24 hours) > > - keep vcpu 0-1 idle > - run: "stress --cpu 1" on vcpu 2-5 in the background > - run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q" [...] > (2) use stress-ng workload (reproduce even within 1 hour, not to say 24H) > > - keep vcpu 0-1 idle > - run: "stress-ng --cpu 1 --cpu-load 70 --cpu-method loop" on vcpu 2-5 in > the background > - run: "cyclictest -p 99 -t 4 -h 30 -m -n -a 2-5 -q" I believe there is a macro difference between these two, which is the CPU busy percentage the stress(-ng) is creating. AFAIK "stress-ng --cpu 1 --cpu-load 70" imposes a 70% busy factor where "stress --cpu 1" it just busy loop (100%). Not sure this can make any difference, but maybe it's worth trying to use --cpu-load 100 with stress-ng and check what happens? (In reply to Peter Xu from comment #18) > So I really suspect the workload that we're using is affecting the > test result of cyclictest. I also suspected this when we discussed this issue by email before opening this BZ. I think we have two options: Juri suggestion from previous comment, or you could try yourself to run "stress --cpu 1" in vcpu0 and vcpu1 to see if you still get good latencies. Btw, we have to see how stress-ng calculates load, it may be doing some system call that's causing this... Marcelo, Do you agree with this plan? Ie. Spend some more time understand the differences Peter spotted before trying CAT? Hi, everyone, (In reply to Luiz Capitulino from comment #22) > (In reply to Peter Xu from comment #18) > > > So I really suspect the workload that we're using is affecting the > > test result of cyclictest. > > I also suspected this when we discussed this issue by email before opening > this BZ. I think we have two options: Juri suggestion from previous comment, > or you could try yourself to run "stress --cpu 1" in vcpu0 and vcpu1 to see > if you still get good latencies. > > Btw, we have to see how stress-ng calculates load, it may be doing some > system call that's causing this... Yeah actually I discussed this with Hai offlist days ago but I forgot to update here. There're at least two differences on how stress-ng can be differerent: 1. stress-ng uses 70% load rather than 100% 2. when the load is <100%, it'll do one select() per loop (please see stress-cpu.c:stress_cpu() - the code path when "cpu_load==100" is different, which will bypass select()) And before I saw the suggestion from Marcelo and Juri, I've already started a 24h test on --cpu-load=100 so let me update this result first: # Total: 069240667 069240663 069240659 069240656 # Min Latencies: 00007 00007 00007 00007 # Avg Latencies: 00007 00007 00007 00007 # Max Latencies: 00017 00018 00018 00019 I think it proved that either the cpu load or the select() syscall at least has some bad effect to latency (not to mention CAT so far). I noticed that Pei has halt_poll_ns set to zero. Note that I didn't do that (since no one told me to, yet... :) and it's 200000, and it seems there is a difference between mine and Pei too (maybe because of this). Jianzhu, are you setting halt_poll_ns to 0 on your host? And I really want to know whether my test result can reproduce somewhere else, especially Jianzhu's environment. Jianzhu, would you give it a shot? (In reply to Marcelo Tosatti from comment #6) > See man pqos for more details. > Step 1) Define CLOSID's (each CLOSID will map a certain part of the L3 > cache). > > > Follows an example on a local machine (with only 4 COSID's). > > # pqos -e "llc:0=0x000f;llc:1=0x00f0;llc:2=0x0f00;llc:3=0xf000" > NOTE: Mixed use of MSR and kernel interfaces to manage > CAT or CMT & MBM may lead to unexpected behavior. > SOCKET 0 L3CA COS0 => MASK 0xf > SOCKET 1 L3CA COS0 => MASK 0xf > SOCKET 0 L3CA COS1 => MASK 0xf0 > SOCKET 1 L3CA COS1 => MASK 0xf0 > SOCKET 0 L3CA COS2 => MASK 0xf00 > SOCKET 1 L3CA COS2 => MASK 0xf00 > SOCKET 0 L3CA COS3 => MASK 0xf000 > SOCKET 1 L3CA COS3 => MASK 0xf000 > Allocation configuration altered. > > In your case, there are more COSID's available: you can create 9 CLOSID's, > each of them with a part of the L3 cache (so that there is no overlap in > the bits between CLOSID's), as follows: > > CLOSID0 = 0xFFFFF > CLOSID1 = 0x3 > CLOSID2 = 0xC > CLOSID3 = 0x30 > CLOSID4 = 0xC0 > CLOSID5 = 0x300 > CLOSID6 = 0xC00 > CLOSID7 = 0x3000 > CLOSID8 = 0xC000 > > So command line would be > > pqos -e "llc:0=0xFFFFF;llc=1:0x3;llc=2:0xC;llc=3: ..." > > Step 2) Map CLOSID's to cores. > > -a CLASS2CORE, --alloc-assoc=CLASS2CORE > associate allocation classes with cores. CLASS2CORE format is > "TYPE:ID=CORE_LIST;...". > For CAT, TYPE is "llc" and ID is a class number. CORE_LIST is > comma or dash separated list of cores. > For example "-a llc:0=0,2,4,6-10;llc:1=1;" associates cores 0, > 2, 4, 6, 7, 8, 9, 10 with CAT class 0 and core 1 with > class 1. > > so that would be > > # pqos -a llc:0=host_cpus,llc:1=pcpu_of_vcpu1,llc:2=pcpu_of_vcpu2,..." what host_cpus is set to here, say in my system, NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23 [root@compute-0 ~]# virsh emulatorpin instance-00000001 emulator: CPU Affinity ---------------------------------- *: 20 [root@compute-0 ~]# virsh vcpupin instance-00000001 VCPU: CPU Affinity ---------------------------------- 0: 4 1: 6 2: 8 3: 10 4: 12 5: 14 6: 16 7: 18 (In reply to Peter Xu from comment #23) > difference between mine and Pei too (maybe because of this). Jianzhu, are > you setting halt_poll_ns to 0 on your host? I have the default 200000 > And I really want to know whether my test result can reproduce somewhere > else, especially Jianzhu's environment. Jianzhu, would you give it a shot? I have openstack setup again on this system and can try the 100% load. Why did you use 6 CPU though? the requirement was to have 8 vcpu VM and 0-1 for housekeeping (In reply to jianzzha from comment #25) > (In reply to Peter Xu from comment #23) > > > difference between mine and Pei too (maybe because of this). Jianzhu, are > > you setting halt_poll_ns to 0 on your host? > > I have the default 200000 > > > And I really want to know whether my test result can reproduce somewhere > > else, especially Jianzhu's environment. Jianzhu, would you give it a shot? > > I have openstack setup again on this system and can try the 100% load. > > Why did you use 6 CPU though? the requirement was to have 8 vcpu VM and 0-1 > for housekeeping I thought it should not really matter that much so I used a random number of vcpus. If you think that matters I can simply re-run those tests with exactly your cpu allcoation. But since you said "0-1 for housekeeping" but instead you were using 2 housekeeping vcpus in comment 0 - how many housekeeping vcpus are you using in fact? I'll use exactly those vcpus in my future tests. Thanks, (In reply to Peter Xu from comment #26) > But since you said "0-1 for housekeeping" but instead you were using 2 > housekeeping vcpus in comment 0 - how many housekeeping vcpus are you using > in fact? I'll use exactly those vcpus in my future tests. > ah I didn't make it clear, vcpu 0-1 for house keeping, 2-7 for cyclictest; when running cyclictest, 0-7 have stress I just tried 100% load on all vcpu and I can still see the >20us in just 2 runs. So it doesn't really make difference for me. I guess we will need first to agree on a gold image version to use. on the openstack setup I have on both host and guest level: 3.10.0-862.14.4.rt56.821.el7.x86_64 (In reply to Peter Xu from comment #26) > I thought it should not really matter that much so I used a random number of > vcpus. If you think that matters I can simply re-run those tests with > exactly your cpu allcoation. The number of vCPU absolutely make big difference, in OSP test I noticed a 8-vCPU guest has much higher latency than 2-vCPU guest. Did you guys not see this in the RT test? (In reply to jianzzha from comment #28) > (In reply to Peter Xu from comment #26) > > > I thought it should not really matter that much so I used a random number of > > vcpus. If you think that matters I can simply re-run those tests with > > exactly your cpu allcoation. > > The number of vCPU absolutely make big difference, in OSP test I noticed a > 8-vCPU guest has much higher latency than 2-vCPU guest. Did you guys not see > this in the RT test? Yeah I think there should be a difference at least for 2 vs 8. I used 6 only for an initial setup; I'll change to yours. (In reply to Luiz Capitulino from comment #22) > (In reply to Peter Xu from comment #18) > > > So I really suspect the workload that we're using is affecting the > > test result of cyclictest. > > I also suspected this when we discussed this issue by email before opening > this BZ. I think we have two options: Juri suggestion from previous comment, > or you could try yourself to run "stress --cpu 1" in vcpu0 and vcpu1 to see > if you still get good latencies. > > Btw, we have to see how stress-ng calculates load, it may be doing some > system call that's causing this... > > Marcelo, Do you agree with this plan? Ie. Spend some more time understand the > differences Peter spotted before trying CAT? I agree its good to attempt to get low latencies before trying CAT, and using CAT only if absolutely necessary. Ok here comes the initial results I got after I switch to 8 vcpus and update the kernel. I'm trying to summarize stuff up a bit since the thread is already getting long, so we may avoid doing page up and down. - basic environment - host/guest kernel: 3.10.0-862.14.4.rt56.821.el7.x86_64 (Jianzhu's version) - spectre/meltdown mitigations: all off - host configuration (16 cpus) - host node 0: 0,2,4,6,8,10,12,14 - host node 1: 1,3,5,7,9,11,13,15 - guest pinning (8 vcpus) - guest emulator: using CPU 2,4 - guest housekeeping: using CPU 6,8 - guest real-time: using CPU 5,7,9,11,13,15 |-------+--------------------------+----------+-------------| | index | environment | duration | max latency | |-------+--------------------------+----------+-------------| | 1 | 6 vcpus, 100% rtcpu-only | 24h | 17us | | 2 | 8 vcpus, 100% rtcpu-only | 2h | 16us | | 3 | 8 vcpus, 100% allcpu | 30m | 16us | | 4 | 8 vcpus, 70% allcpu | 1h | 28us | | 5 | 8 vcpus, 70% rtcpu-only | 20m | 24us | |-------+--------------------------+----------+-------------| - 100%/70% means the cpu workload of "--cpu-load". - rtcpu-only means only adding cpu load to rt cpus, so housekeeping cpus are idle; while allcpu means adding load to all cpus Entry 1 is the one of 6 vcpus to reference. Entries 2-5 are new ones with 8 vcpus. Conclusions: 1. Compare 1 with 2: vcpu number (6 or 8) seems to make no difference 2. Entry 4: this is the initial state of comment 0 when bug reported, so spike reproduced 3. Compare 3 with 4: again it verified that the workload should matter something 4. Compare 4 with 5: it should somehow show that the housekeeping cpu workload does not matter much because all these two can generate spikes Jianzhu, from comment 27 you said you can still see spikes even with the case of entry 3. It does not match with my test (I ran 30min, even longer than yours 1.5min*3). Before we move on to the final goal (70% load on all cpus), could you help me to make sure we can at least have the same data matched with entry 3 (run 100% cpu load on all cpus)? That should help us to make sure we have the same baseline before we dig into the 70% issue IMHO. Please check your environment setup, BIOS, everything. If you still cannot reproduce, maybe we can let Pei to run entry 3 again to see what Pei can get with it to make sure I didn't mess up anything. (In reply to jianzzha from comment #24) > (In reply to Marcelo Tosatti from comment #6) > > See man pqos for more details. > > Step 1) Define CLOSID's (each CLOSID will map a certain part of the L3 > > cache). > > > > > > Follows an example on a local machine (with only 4 COSID's). > > > > # pqos -e "llc:0=0x000f;llc:1=0x00f0;llc:2=0x0f00;llc:3=0xf000" > > NOTE: Mixed use of MSR and kernel interfaces to manage > > CAT or CMT & MBM may lead to unexpected behavior. > > SOCKET 0 L3CA COS0 => MASK 0xf > > SOCKET 1 L3CA COS0 => MASK 0xf > > SOCKET 0 L3CA COS1 => MASK 0xf0 > > SOCKET 1 L3CA COS1 => MASK 0xf0 > > SOCKET 0 L3CA COS2 => MASK 0xf00 > > SOCKET 1 L3CA COS2 => MASK 0xf00 > > SOCKET 0 L3CA COS3 => MASK 0xf000 > > SOCKET 1 L3CA COS3 => MASK 0xf000 > > Allocation configuration altered. > > > > In your case, there are more COSID's available: you can create 9 CLOSID's, > > each of them with a part of the L3 cache (so that there is no overlap in > > the bits between CLOSID's), as follows: > > > > CLOSID0 = 0xFFFFF > > CLOSID1 = 0x3 > > CLOSID2 = 0xC > > CLOSID3 = 0x30 > > CLOSID4 = 0xC0 > > CLOSID5 = 0x300 > > CLOSID6 = 0xC00 > > CLOSID7 = 0x3000 > > CLOSID8 = 0xC000 > > > > So command line would be > > > > pqos -e "llc:0=0xFFFFF;llc=1:0x3;llc=2:0xC;llc=3: ..." > > > > Step 2) Map CLOSID's to cores. > > > > -a CLASS2CORE, --alloc-assoc=CLASS2CORE > > associate allocation classes with cores. CLASS2CORE format is > > "TYPE:ID=CORE_LIST;...". > > For CAT, TYPE is "llc" and ID is a class number. CORE_LIST is > > comma or dash separated list of cores. > > For example "-a llc:0=0,2,4,6-10;llc:1=1;" associates cores 0, > > 2, 4, 6, 7, 8, 9, 10 with CAT class 0 and core 1 with > > class 1. > > > > so that would be > > > > # pqos -a llc:0=host_cpus,llc:1=pcpu_of_vcpu1,llc:2=pcpu_of_vcpu2,..." > > what host_cpus is set to here, say in my system, > > NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22 > NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23 > > > [root@compute-0 ~]# virsh emulatorpin instance-00000001 > emulator: CPU Affinity > ---------------------------------- > *: 20 > > [root@compute-0 ~]# virsh vcpupin instance-00000001 > VCPU: CPU Affinity > ---------------------------------- > 0: 4 > 1: 6 > 2: 8 > 3: 10 > 4: 12 > 5: 14 > 6: 16 > 7: 18 -a CLASS2CORE, --alloc-assoc=CLASS2CORE associate allocation classes with cores. CLASS2CORE format is "TYPE:ID=CORE_LIST;...". For CAT, TYPE is "llc" and ID is a class number. CORE_LIST is comma or dash separated list of cores. For example "-a llc:0=0,2,4,6-10;llc:1=1;" associates cores 0, 2, 4, 6, 7, 8, 9, 10 with CAT class 0 and core 1 with class 1. So it would be pqos -a llc:0=1,3,5,7,9,11,13,15,17,19,20,21,22,23 llc:1=4,llc:2=6,llc:3=8,llc:4=10,llc:5=12, llc:6=14,llc:7=16,llc:8=18 (check that COSID-0 has non realtime CPUs, and each realtime CPU is assigned one COSID between 1-8). COSID's being the values allocated in "pqos -e ...". (In reply to Peter Xu from comment #35) > Ok here comes the initial results I got after I switch to 8 vcpus and update > the kernel. Good report, thanks. > I'm trying to summarize stuff up a bit since the thread is already getting > long, so we may avoid doing page up and down. > > - basic environment > - host/guest kernel: 3.10.0-862.14.4.rt56.821.el7.x86_64 (Jianzhu's > version) > - spectre/meltdown mitigations: all off > - host configuration (16 cpus) > - host node 0: 0,2,4,6,8,10,12,14 > - host node 1: 1,3,5,7,9,11,13,15 > - guest pinning (8 vcpus) > - guest emulator: using CPU 2,4 > - guest housekeeping: using CPU 6,8 > - guest real-time: using CPU 5,7,9,11,13,15 > > |-------+--------------------------+----------+-------------| > | index | environment | duration | max latency | > |-------+--------------------------+----------+-------------| > | 1 | 6 vcpus, 100% rtcpu-only | 24h | 17us | > | 2 | 8 vcpus, 100% rtcpu-only | 2h | 16us | > | 3 | 8 vcpus, 100% allcpu | 30m | 16us | > | 4 | 8 vcpus, 70% allcpu | 1h | 28us | > | 5 | 8 vcpus, 70% rtcpu-only | 20m | 24us | > |-------+--------------------------+----------+-------------| > > - 100%/70% means the cpu workload of "--cpu-load". > - rtcpu-only means only adding cpu load to rt cpus, so housekeeping > cpus are idle; while allcpu means adding load to all cpus Note how stress-ng handles the "cpu-load = 100" case and the "cpu-load < 100" cases. Perhaps the reason for this spike is there. Can you increase cpu-load, to say 90% and 99% ? (only the rtcpu-only case is sufficient). Also, can you please show the histogram output from cyclictest, to know the frequency of the >20us events. > > Entry 1 is the one of 6 vcpus to reference. Entries 2-5 are new ones with 8 > vcpus. > > Conclusions: > > 1. Compare 1 with 2: vcpu number (6 or 8) seems to make no difference > 2. Entry 4: this is the initial state of comment 0 when bug reported, so > spike reproduced > 3. Compare 3 with 4: again it verified that the workload should matter > something > 4. Compare 4 with 5: it should somehow show that the housekeeping cpu > workload does not matter much because all these two can generate spikes > > Jianzhu, from comment 27 you said you can still see spikes even with the > case of entry 3. It does not match with my test (I ran 30min, even longer > than yours 1.5min*3). Before we move on to the final goal (70% load on all > cpus), could you help me to make sure we can at least have the same data > matched with entry 3 (run 100% cpu load on all cpus)? That should help us > to make sure we have the same baseline before we dig into the 70% issue IMHO. > > Please check your environment setup, BIOS, everything. If you still cannot > reproduce, maybe we can let Pei to run entry 3 again to see what Pei can get > with it to make sure I didn't mess up anything. (In reply to Marcelo Tosatti from comment #40) > (In reply to Peter Xu from comment #35) > > Ok here comes the initial results I got after I switch to 8 vcpus and update > > the kernel. > > Good report, thanks. > > > I'm trying to summarize stuff up a bit since the thread is already getting > > long, so we may avoid doing page up and down. > > > > - basic environment > > - host/guest kernel: 3.10.0-862.14.4.rt56.821.el7.x86_64 (Jianzhu's > > version) > > - spectre/meltdown mitigations: all off > > - host configuration (16 cpus) > > - host node 0: 0,2,4,6,8,10,12,14 > > - host node 1: 1,3,5,7,9,11,13,15 > > - guest pinning (8 vcpus) > > - guest emulator: using CPU 2,4 > > - guest housekeeping: using CPU 6,8 > > - guest real-time: using CPU 5,7,9,11,13,15 > > > > |-------+--------------------------+----------+-------------| > > | index | environment | duration | max latency | > > |-------+--------------------------+----------+-------------| > > | 1 | 6 vcpus, 100% rtcpu-only | 24h | 17us | > > | 2 | 8 vcpus, 100% rtcpu-only | 2h | 16us | > > | 3 | 8 vcpus, 100% allcpu | 30m | 16us | > > | 4 | 8 vcpus, 70% allcpu | 1h | 28us | > > | 5 | 8 vcpus, 70% rtcpu-only | 20m | 24us | > > |-------+--------------------------+----------+-------------| > > > > - 100%/70% means the cpu workload of "--cpu-load". > > - rtcpu-only means only adding cpu load to rt cpus, so housekeeping > > cpus are idle; while allcpu means adding load to all cpus > > Note how stress-ng handles the "cpu-load = 100" case and > the "cpu-load < 100" cases. Perhaps the reason for this spike > is there. You mentioned select() earlier, but also gettimeofday() is called very often. > Can you increase cpu-load, to say 90% and 99% ? (only the rtcpu-only case is > sufficient). > > Also, can you please show the histogram output from cyclictest, to > know the frequency of the >20us events. Another thing is now that you have another process running on the RT CPUs, its good to have a SCHED_FIFO priority for cyclictest: Replace -p 99 with --policy fifo -p 1 The next step, if SCHED_FIFO priority fails, would be tracing to find out where the extra latency comes from. (In reply to Marcelo Tosatti from comment #41) > You mentioned select() earlier, but also gettimeofday() is called > very often. True. > > > Can you increase cpu-load, to say 90% and 99% ? (only the rtcpu-only case is > > sufficient). Sure. > > > > Also, can you please show the histogram output from cyclictest, to > > know the frequency of the >20us events. > > Another thing is now that you have another process running > on the RT CPUs, its good to have a SCHED_FIFO priority for cyclictest: > > Replace -p 99 with --policy fifo -p 1 > > The next step, if SCHED_FIFO priority fails, would be tracing > to find out where the extra latency comes from. I suspect I've already been using FIFO. Cyclictest should by default use fifo IIUC if "-p" is specified, see: case OPT_PRIORITY: priority = atoi(optarg); if (policy != SCHED_FIFO && policy != SCHED_RR) policy = SCHED_FIFO; break; And since I didn't specify policy, it should be using FIFO already in all my previous test results. Thanks, Follow up with 90%/99% workload, this time I'm appending the histogram: |-------+--------------------------+----------+-------------| | index | environment | duration | max latency | |-------+--------------------------+----------+-------------| | 6 | 8 vcpus, 90% rtcpu-only | 1h | 31us | | 7 | 8 vcpus, 99% rtcpu-only | 2h | 19us | |-------+--------------------------+----------+-------------| 90% cpuload on rtcpu only, ~1H: # Histogram 000000 000000 000000 000000 000000 000000 000000 000001 000000 000000 000000 000000 000000 000000 000002 000000 000000 000000 000000 000000 000000 000003 000000 000000 000000 000000 000000 000000 000004 000000 000000 000000 000000 000000 000000 000005 000646 000611 000207 000405 000147 000326 000006 005419 004935 001974 000649 001961 003515 000007 2997397 3022392 2954316 1759617 2132180 2979399 000008 017984 016639 029222 1220846 876268 033792 000009 008019 007717 013977 004194 008258 007836 000010 266250 254322 273506 004397 281210 273484 000011 003135 002398 004394 279580 003203 003347 000012 002029 001859 003232 017097 002262 002264 000013 002046 001866 003283 005986 002000 001992 000014 058676 047713 068814 059527 053017 055798 000015 006405 006605 012614 000333 007108 006684 000016 000876 000799 001005 000001 001088 000527 000017 000788 001094 002586 000678 001400 001336 000018 000508 000772 000192 000020 000031 000019 000019 000197 000320 000341 000212 000113 000086 000020 000064 000071 000206 010446 000114 000099 000021 000213 000464 000707 001866 000368 000339 000022 000209 000280 000260 000167 000158 000039 000023 000038 000037 000054 000409 000000 000000 000024 000000 000001 000001 002601 000000 000000 000025 000000 000000 000000 000401 000000 000000 000026 000000 000000 000000 001052 000000 000000 000027 000000 000000 000000 000299 000000 000000 000028 000000 000000 000000 000100 000000 000000 000029 000000 000000 000000 000003 000000 000000 # Total: 003370899 003370895 003370891 003370886 003370886 003370882 # Min Latencies: 00005 00005 00005 00005 00005 00005 # Avg Latencies: 00007 00007 00007 00007 00007 00007 # Max Latencies: 00023 00024 00024 00031 00022 00022 # Histogram Overflows: 00000 00000 00000 00002 00000 00000 # Histogram Overflow at cycle number: # Thread 0: # Thread 1: # Thread 2: # Thread 3: 44166 3078688 # Thread 4: # Thread 5: 99% cpuload on rtcpu only, ~2H: # Histogram 000000 000000 000000 000000 000000 000000 000000 000001 000000 000000 000000 000000 000000 000000 000002 000000 000000 000000 000000 000000 000000 000003 000000 000000 000000 000000 000000 000000 000004 000000 000000 000000 000000 000000 000000 000005 000699 000691 000430 000588 000491 000610 000006 006501 008036 007542 006682 009096 007566 000007 7342617 7341466 7342801 7344175 7345626 7339440 000008 001742 001145 001317 001229 001438 001496 000009 002614 002036 001939 002260 001493 001818 000010 045624 046621 045936 045040 041680 048736 000011 000611 000447 000470 000467 000601 000760 000012 000015 000010 000010 000007 000012 000013 000013 000011 000006 000007 000004 000005 000005 000014 000008 000005 000006 000006 000007 000006 000015 000036 000025 000026 000030 000015 000018 000016 000032 000021 000022 000016 000036 000028 000017 000002 000001 000001 000000 000000 000001 000018 000000 000000 000000 000000 000000 000000 000019 000001 000000 000000 000000 000000 000000 000020 000000 000000 000000 000000 000000 000000 000021 000000 000000 000000 000000 000000 000000 000022 000000 000000 000000 000000 000000 000000 000023 000000 000000 000000 000000 000000 000000 000024 000000 000000 000000 000000 000000 000000 000025 000000 000000 000000 000000 000000 000000 000026 000000 000000 000000 000000 000000 000000 000027 000000 000000 000000 000000 000000 000000 000028 000000 000000 000000 000000 000000 000000 000029 000000 000000 000000 000000 000000 000000 # Total: 007400513 007400510 007400507 007400504 007400500 007400497 # Min Latencies: 00005 00005 00005 00005 00005 00005 # Avg Latencies: 00007 00007 00007 00007 00007 00007 # Max Latencies: 00019 00017 00017 00016 00016 00017 # Histogram Overflows: 00000 00000 00000 00000 00000 00000 # Histogram Overflow at cycle number: # Thread 0: # Thread 1: # Thread 2: # Thread 3: # Thread 4: # Thread 5: (In reply to Peter Xu from comment #35) > Ok here comes the initial results I got after I switch to 8 vcpus and update > the kernel. > > I'm trying to summarize stuff up a bit since the thread is already getting > long, so we may avoid doing page up and down. > > - basic environment > - host/guest kernel: 3.10.0-862.14.4.rt56.821.el7.x86_64 (Jianzhu's > version) > - spectre/meltdown mitigations: all off > - host configuration (16 cpus) > - host node 0: 0,2,4,6,8,10,12,14 > - host node 1: 1,3,5,7,9,11,13,15 > - guest pinning (8 vcpus) > - guest emulator: using CPU 2,4 > - guest housekeeping: using CPU 6,8 > - guest real-time: using CPU 5,7,9,11,13,15 > > |-------+--------------------------+----------+-------------| > | index | environment | duration | max latency | > |-------+--------------------------+----------+-------------| > | 1 | 6 vcpus, 100% rtcpu-only | 24h | 17us | > | 2 | 8 vcpus, 100% rtcpu-only | 2h | 16us | > | 3 | 8 vcpus, 100% allcpu | 30m | 16us | > | 4 | 8 vcpus, 70% allcpu | 1h | 28us | > | 5 | 8 vcpus, 70% rtcpu-only | 20m | 24us | > |-------+--------------------------+----------+-------------| > > - 100%/70% means the cpu workload of "--cpu-load". > - rtcpu-only means only adding cpu load to rt cpus, so housekeeping > cpus are idle; while allcpu means adding load to all cpus > > Entry 1 is the one of 6 vcpus to reference. Entries 2-5 are new ones with 8 > vcpus. > > Conclusions: > > 1. Compare 1 with 2: vcpu number (6 or 8) seems to make no difference > 2. Entry 4: this is the initial state of comment 0 when bug reported, so > spike reproduced > 3. Compare 3 with 4: again it verified that the workload should matter > something > 4. Compare 4 with 5: it should somehow show that the housekeeping cpu > workload does not matter much because all these two can generate spikes > > Jianzhu, from comment 27 you said you can still see spikes even with the > case of entry 3. It does not match with my test (I ran 30min, even longer > than yours 1.5min*3). Before we move on to the final goal (70% load on all > cpus), could you help me to make sure we can at least have the same data > matched with entry 3 (run 100% cpu load on all cpus)? That should help us > to make sure we have the same baseline before we dig into the 70% issue IMHO. > > Please check your environment setup, BIOS, everything. If you still cannot > reproduce, maybe we can let Pei to run entry 3 again to see what Pei can get > with it to make sure I didn't mess up anything. I tried 3), still no luck. one out of 3 runs, # Max Latencies: 00013 00013 00022 00018 00019 00024 the domainxml generated from the OSP is: [root@compute-0 ~]# virsh dumpxml instance-00000002 <domain type='kvm' id='1'> <name>instance-00000002</name> <uuid>5c0f7592-cc0c-4042-b46e-a0b1429d6310</uuid> <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="0.0.1-3.d7864fbgit.el7ost"/> <nova:name>demo1</nova:name> <nova:creationTime>2019-04-01 10:40:16</nova:creationTime> <nova:flavor name="nfv"> <nova:memory>8192</nova:memory> <nova:disk>60</nova:disk> <nova:swap>0</nova:swap> <nova:ephemeral>0</nova:ephemeral> <nova:vcpus>8</nova:vcpus> </nova:flavor> <nova:owner> <nova:user uuid="cc878ab426ea4b8fb5e21504409c7935">admin</nova:user> <nova:project uuid="458aa608e64f4c09b3d13dec9ae4a6a3">admin</nova:project> </nova:owner> <nova:root type="image" uuid="53a8a18b-bc55-4195-b6b9-872e490eec2c"/> </nova:instance> </metadata> <memory unit='KiB'>8388608</memory> <currentMemory unit='KiB'>8388608</currentMemory> <memoryBacking> <hugepages> <page size='1048576' unit='KiB' nodeset='0'/> </hugepages> <nosharepages/> <locked/> </memoryBacking> <vcpu placement='static'>8</vcpu> <cputune> <shares>8192</shares> <vcpupin vcpu='0' cpuset='4'/> <vcpupin vcpu='1' cpuset='6'/> <vcpupin vcpu='2' cpuset='8'/> <vcpupin vcpu='3' cpuset='10'/> <vcpupin vcpu='4' cpuset='12'/> <vcpupin vcpu='5' cpuset='14'/> <vcpupin vcpu='6' cpuset='16'/> <vcpupin vcpu='7' cpuset='18'/> <emulatorpin cpuset='20'/> <vcpusched vcpus='0' scheduler='fifo' priority='1'/> <vcpusched vcpus='1' scheduler='fifo' priority='1'/> <vcpusched vcpus='2' scheduler='fifo' priority='1'/> <vcpusched vcpus='3' scheduler='fifo' priority='1'/> <vcpusched vcpus='4' scheduler='fifo' priority='1'/> <vcpusched vcpus='5' scheduler='fifo' priority='1'/> <vcpusched vcpus='6' scheduler='fifo' priority='1'/> <vcpusched vcpus='7' scheduler='fifo' priority='1'/> </cputune> <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> <resource> <partition>/machine</partition> </resource> <sysinfo type='smbios'> <system> <entry name='manufacturer'>Red Hat</entry> <entry name='product'>OpenStack Compute</entry> <entry name='version'>0.0.1-3.d7864fbgit.el7ost</entry> <entry name='serial'>4c4c4544-0052-4e10-8044-b8c04f394e32</entry> <entry name='uuid'>5c0f7592-cc0c-4042-b46e-a0b1429d6310</entry> <entry name='family'>Virtual Machine</entry> </system> </sysinfo> <os> <type arch='x86_64' machine='pc-i440fx-rhel7.5.0'>hvm</type> <boot dev='hd'/> <smbios mode='sysinfo'/> </os> <features> <acpi/> <apic/> <pmu state='off'/> </features> <cpu mode='host-passthrough' check='none'> <topology sockets='8' cores='1' threads='1'/> <feature policy='require' name='tsc-deadline'/> <numa> <cell id='0' cpus='0-7' memory='8388608' unit='KiB' memAccess='shared'/> </numa> </cpu> <clock offset='utc'> <timer name='pit' tickpolicy='delay'/> <timer name='rtc' tickpolicy='catchup'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/var/lib/nova/instances/5c0f7592-cc0c-4042-b46e-a0b1429d6310/disk'/> <backingStore type='file' index='1'> <format type='raw'/> <source file='/var/lib/nova/instances/_base/63952bd3e89784b90636bbcb855f4c62b039e2fe'/> <backingStore/> </backingStore> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> <controller type='usb' index='0' model='piix3-uhci'> <alias name='usb'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> </controller> <controller type='pci' index='0' model='pci-root'> <alias name='pci.0'/> </controller> <interface type='bridge'> <mac address='fa:16:3e:15:c5:3f'/> <source bridge='qbr633e4bde-e1'/> <target dev='tap633e4bde-e1'/> <model type='virtio'/> <mtu size='9000'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> <serial type='pty'> <source path='/dev/pts/0'/> <log file='/var/lib/nova/instances/5c0f7592-cc0c-4042-b46e-a0b1429d6310/console.log' append='off'/> <target type='isa-serial' port='0'> <model name='isa-serial'/> </target> <alias name='serial0'/> </serial> <console type='pty' tty='/dev/pts/0'> <source path='/dev/pts/0'/> <log file='/var/lib/nova/instances/5c0f7592-cc0c-4042-b46e-a0b1429d6310/console.log' append='off'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> <input type='tablet' bus='usb'> <alias name='input0'/> <address type='usb' bus='0' port='1'/> </input> <input type='mouse' bus='ps2'> <alias name='input1'/> </input> <input type='keyboard' bus='ps2'> <alias name='input2'/> </input> <graphics type='vnc' port='5900' autoport='yes' listen='172.22.33.21' keymap='en-us'> <listen type='address' address='172.22.33.21'/> </graphics> <video> <model type='cirrus' vram='16384' heads='1' primary='yes'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <memballoon model='virtio'> <stats period='10'/> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </memballoon> </devices> <seclabel type='dynamic' model='selinux' relabel='yes'> <label>system_u:system_r:svirt_t:s0:c438,c944</label> <imagelabel>system_u:object_r:svirt_image_t:s0:c438,c944</imagelabel> </seclabel> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+107:+107</label> <imagelabel>+107:+107</imagelabel> </seclabel> </domain> [root@compute-0 ~]# cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.rt56.821.el7.x86_64 root=UUID=7aa9d695-b9c7-416f-baf7-7e8f89c1a3bc ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on skew_tick=1 isolcpus=4,6,8,10,12,14,16,18,20,22 intel_pstate=disable nosoftlockup nohz=on nohz_full=4,6,8,10,12,14,16,18,20,22 rcu_nocbs=4,6,8,10,12,14,16,18,20,22 spectre_v2=off nopti kvm-intel.vmentry_l1d_flush=never [root@compute-0 ~]# uname -r 3.10.0-862.14.4.rt56.821.el7.x86_64 in the guest: [root@host-10-1-1-4 ~]# cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.rt56.821.el7.x86_64 root=UUID=6bea2b7b-e6cc-4dba-ac79-be6530d348f5 ro console=tty0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto LANG=en_US.UTF-8 skew_tick=1 isolcpus=2-7 intel_pstate=disable nosoftlockup nohz=on nohz_full=2-7 rcu_nocbs=2-7 spectre_v2=off nopti default_hugepagesz=1GB hugepagesz=1G hugepages=1 [root@host-10-1-1-4 ~]# uname -r 3.10.0-862.14.4.rt56.821.el7.x86_64 (In reply to Peter Xu from comment #43) > Follow up with 90%/99% workload, this time I'm appending the histogram: > > |-------+--------------------------+----------+-------------| > | index | environment | duration | max latency | > |-------+--------------------------+----------+-------------| > | 6 | 8 vcpus, 90% rtcpu-only | 1h | 31us | > | 7 | 8 vcpus, 99% rtcpu-only | 2h | 19us | > |-------+--------------------------+----------+-------------| > > 90% cpuload on rtcpu only, ~1H: > > # Histogram > 000000 000000 000000 000000 000000 000000 000000 > 000001 000000 000000 000000 000000 000000 000000 > 000002 000000 000000 000000 000000 000000 000000 > 000003 000000 000000 000000 000000 000000 000000 > 000004 000000 000000 000000 000000 000000 000000 > 000005 000646 000611 000207 000405 000147 000326 > 000006 005419 004935 001974 000649 001961 003515 > 000007 2997397 3022392 2954316 1759617 2132180 2979399 > 000008 017984 016639 029222 1220846 876268 033792 > 000009 008019 007717 013977 004194 008258 007836 > 000010 266250 254322 273506 004397 281210 273484 > 000011 003135 002398 004394 279580 003203 003347 > 000012 002029 001859 003232 017097 002262 002264 > 000013 002046 001866 003283 005986 002000 001992 > 000014 058676 047713 068814 059527 053017 055798 > 000015 006405 006605 012614 000333 007108 006684 > 000016 000876 000799 001005 000001 001088 000527 > 000017 000788 001094 002586 000678 001400 001336 > 000018 000508 000772 000192 000020 000031 000019 > 000019 000197 000320 000341 000212 000113 000086 > 000020 000064 000071 000206 010446 000114 000099 > 000021 000213 000464 000707 001866 000368 000339 > 000022 000209 000280 000260 000167 000158 000039 > 000023 000038 000037 000054 000409 000000 000000 > 000024 000000 000001 000001 002601 000000 000000 > 000025 000000 000000 000000 000401 000000 000000 > 000026 000000 000000 000000 001052 000000 000000 > 000027 000000 000000 000000 000299 000000 000000 > 000028 000000 000000 000000 000100 000000 000000 > 000029 000000 000000 000000 000003 000000 000000 > # Total: 003370899 003370895 003370891 003370886 003370886 003370882 > # Min Latencies: 00005 00005 00005 00005 00005 00005 > # Avg Latencies: 00007 00007 00007 00007 00007 00007 > # Max Latencies: 00023 00024 00024 00031 00022 00022 > # Histogram Overflows: 00000 00000 00000 00002 00000 00000 > # Histogram Overflow at cycle number: > # Thread 0: > # Thread 1: > # Thread 2: > # Thread 3: 44166 3078688 > # Thread 4: > # Thread 5: > > > 99% cpuload on rtcpu only, ~2H: > > # Histogram > 000000 000000 000000 000000 000000 000000 000000 > 000001 000000 000000 000000 000000 000000 000000 > 000002 000000 000000 000000 000000 000000 000000 > 000003 000000 000000 000000 000000 000000 000000 > 000004 000000 000000 000000 000000 000000 000000 > 000005 000699 000691 000430 000588 000491 000610 > 000006 006501 008036 007542 006682 009096 007566 > 000007 7342617 7341466 7342801 7344175 7345626 7339440 > 000008 001742 001145 001317 001229 001438 001496 > 000009 002614 002036 001939 002260 001493 001818 > 000010 045624 046621 045936 045040 041680 048736 > 000011 000611 000447 000470 000467 000601 000760 > 000012 000015 000010 000010 000007 000012 000013 > 000013 000011 000006 000007 000004 000005 000005 > 000014 000008 000005 000006 000006 000007 000006 > 000015 000036 000025 000026 000030 000015 000018 > 000016 000032 000021 000022 000016 000036 000028 > 000017 000002 000001 000001 000000 000000 000001 > 000018 000000 000000 000000 000000 000000 000000 > 000019 000001 000000 000000 000000 000000 000000 > 000020 000000 000000 000000 000000 000000 000000 > 000021 000000 000000 000000 000000 000000 000000 > 000022 000000 000000 000000 000000 000000 000000 > 000023 000000 000000 000000 000000 000000 000000 > 000024 000000 000000 000000 000000 000000 000000 > 000025 000000 000000 000000 000000 000000 000000 > 000026 000000 000000 000000 000000 000000 000000 > 000027 000000 000000 000000 000000 000000 000000 > 000028 000000 000000 000000 000000 000000 000000 > 000029 000000 000000 000000 000000 000000 000000 > # Total: 007400513 007400510 007400507 007400504 007400500 007400497 > # Min Latencies: 00005 00005 00005 00005 00005 00005 > # Avg Latencies: 00007 00007 00007 00007 00007 00007 > # Max Latencies: 00019 00017 00017 00016 00016 00017 > # Histogram Overflows: 00000 00000 00000 00000 00000 00000 > # Histogram Overflow at cycle number: > # Thread 0: > # Thread 1: > # Thread 2: > # Thread 3: > # Thread 4: > # Thread 5: One theory would be that what is happening is: 1) stress-ng runs, dirties cache. 2) goes to sleep. 3) cyclictest wakes up CPU, finds cache dirty from stress-ng run, and the cache misses incur the additional us being seen. (In reply to jianzzha from comment #44) > I tried 3), still no luck. > one out of 3 runs, > # Max Latencies: 00013 00013 00022 00018 00019 00024 ... > <vcpupin vcpu='0' cpuset='4'/> > <vcpupin vcpu='1' cpuset='6'/> > <vcpupin vcpu='2' cpuset='8'/> > <vcpupin vcpu='3' cpuset='10'/> > <vcpupin vcpu='4' cpuset='12'/> > <vcpupin vcpu='5' cpuset='14'/> > <vcpupin vcpu='6' cpuset='16'/> > <vcpupin vcpu='7' cpuset='18'/> > <emulatorpin cpuset='20'/> Jianzhu, I noticed that you only isolated some cpus of node 0 but not node 1. I'm not sure whether this will matter but... is it possible to isolate the node 1 instead? I was trying to only run rt workload on node 1 and all the rest (housekeeping of host and guest, and also the emulator codes) on node 0: - host configuration (16 cpus) - host node 0: 0,2,4,6,8,10,12,14 - host node 1: 1,3,5,7,9,11,13,15 - guest pinning (8 vcpus) - guest emulator: using CPU 2,4 <-------------- this is node 0 only - guest housekeeping: using CPU 6,8 <-------------- this is node 0 only - guest real-time: using CPU 5,7,9,11,13,15 <-------------- this is node 1 only Thanks, (In reply to Peter Xu from comment #50) > (In reply to jianzzha from comment #44) > > I tried 3), still no luck. > > one out of 3 runs, > > # Max Latencies: 00013 00013 00022 00018 00019 00024 > > ... > > > <vcpupin vcpu='0' cpuset='4'/> > > <vcpupin vcpu='1' cpuset='6'/> > > <vcpupin vcpu='2' cpuset='8'/> > > <vcpupin vcpu='3' cpuset='10'/> > > <vcpupin vcpu='4' cpuset='12'/> > > <vcpupin vcpu='5' cpuset='14'/> > > <vcpupin vcpu='6' cpuset='16'/> > > <vcpupin vcpu='7' cpuset='18'/> > > <emulatorpin cpuset='20'/> > > Jianzhu, > > I noticed that you only isolated some cpus of node 0 but not node 1. I'm > not sure whether this will matter but... is it possible to isolate the node > 1 instead? I was trying to only run rt workload on node 1 and all the rest > (housekeeping of host and guest, and also the emulator codes) on node 0: > > - host configuration (16 cpus) > - host node 0: 0,2,4,6,8,10,12,14 > - host node 1: 1,3,5,7,9,11,13,15 > - guest pinning (8 vcpus) > - guest emulator: using CPU 2,4 > <-------------- this is node 0 only > - guest housekeeping: using CPU 6,8 > <-------------- this is node 0 only > - guest real-time: using CPU 5,7,9,11,13,15 > <-------------- this is node 1 only > > Thanks, I tried, no good (actually I think it is even worse, as one spike went up to 34us). can you paste your domain xml, let's compare what other difference is. After talking with Peter, I've made below changes in my testing: - Disabling l1d flush (I missed this step in the past testing) - Using host cores from NUMA node 1(replace using NUMA node 0) The latency looks better then my past testings, but still can exceed 20us a bit. In guest: # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.rt56.821.el7.x86_64 root=/dev/mapper/rhel_vm--74--76-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-74-76/root rd.lvm.lv=rhel_vm-74-76/swap rhgb quiet default_hugepagesz=1G iommu=pt intel_iommu=on kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti skew_tick=1 isolcpus=2,3,4,5,6,7 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,3,4,5,6,7 rcu_nocbs=2,3,4,5,6,7 In host: # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.rt56.821.el7.x86_64 root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti LANG=en_US.UTF-8 skew_tick=1 isolcpus=1,3,5,7,9,11,13,15,17,19,18,16,14 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,3,5,7,9,11,13,15,17,19,18,16,14 rcu_nocbs=1,3,5,7,9,11,13,15,17,19,18,16,14 (1)70% cpu load # cat stress_ng.sh echo running stress cpu_list="0 1 2 3 4 5 6 7" for cpu in $cpu_list; do taskset -c $cpu stress-ng --cpu 1 --cpu-load 70 --cpu-method loop --timeout 24h & done # cyclictest -p 99 -t 6 -h 30 -m -n -a 2-7 -D 90m Result: # Min Latencies: 00005 00005 00005 00005 00005 00005 # Avg Latencies: 00007 00007 00006 00006 00007 00006 # Max Latencies: 00023 00015 00014 00013 00015 00013 (2)100% cpu load # cat stress.sh echo running stress cpu_list="0 1 2 3 4 5 6 7" for cpu in $cpu_list; do taskset -c $cpu stress --cpu 1 & done # cyclictest -p 99 -t 6 -h 30 -m -n -a 2-7 -D 10h Result: # Min Latencies: 00007 00007 00007 00007 00007 00007 # Avg Latencies: 00014 00007 00007 00007 00007 00007 # Max Latencies: 00022 00015 00016 00016 00015 00016 Summary, in my setup, no matter with 70% cpu load or 100% cpu load, the latency > 20us(not always, but in some runs, it exceeds 20us). (In reply to jianzzha from comment #51) > can you paste your domain xml, let's compare what other difference is. [root@virtlab422 ~]# virsh dumpxml rhel7-rt <domain type='kvm' id='3'> <name>rhel7-rt</name> <uuid>12056a04-48c9-11e9-9820-1866da5ff2ec</uuid> <memory unit='KiB'>4194304</memory> <currentMemory unit='KiB'>4194304</currentMemory> <memoryBacking> <hugepages> <page size='1048576' unit='KiB'/> </hugepages> <locked/> </memoryBacking> <vcpu placement='static'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='6'/> <vcpupin vcpu='1' cpuset='8'/> <vcpupin vcpu='2' cpuset='5'/> <vcpupin vcpu='3' cpuset='7'/> <vcpupin vcpu='4' cpuset='9'/> <vcpupin vcpu='5' cpuset='11'/> <vcpupin vcpu='6' cpuset='13'/> <vcpupin vcpu='7' cpuset='15'/> <emulatorpin cpuset='2,4'/> <vcpusched vcpus='0' scheduler='fifo' priority='1'/> <vcpusched vcpus='1' scheduler='fifo' priority='1'/> <vcpusched vcpus='2' scheduler='fifo' priority='1'/> <vcpusched vcpus='3' scheduler='fifo' priority='1'/> <vcpusched vcpus='4' scheduler='fifo' priority='1'/> <vcpusched vcpus='5' scheduler='fifo' priority='1'/> <vcpusched vcpus='6' scheduler='fifo' priority='1'/> <vcpusched vcpus='7' scheduler='fifo' priority='1'/> </cputune> <numatune> <memory mode='strict' nodeset='0'/> </numatune> <resource> <partition>/machine</partition> </resource> <os> <type arch='x86_64' machine='pc-q35-rhel7.6.0'>hvm</type> <boot dev='hd'/> </os> <features> <acpi/> <pmu state='off'/> <vmport state='off'/> <ioapic driver='qemu'/> </features> <cpu mode='host-passthrough' check='none'> <feature policy='require' name='tsc-deadline'/> </cpu> <clock offset='utc'> <timer name='rtc' tickpolicy='catchup'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none' io='threads' iommu='on' ats='on'/> <source file='/home/images/rhel7-rt.qcow2'/> <backingStore/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </disk> <controller type='usb' index='0' model='none'> <alias name='usb'/> </controller> <controller type='pci' index='0' model='pcie-root'> <alias name='pcie.0'/> </controller> <controller type='pci' index='1' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='1' port='0x0'/> <alias name='pci.1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='2' port='0x0'/> <alias name='pci.2'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='3' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='3' port='0x0'/> <alias name='pci.3'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <controller type='pci' index='4' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='4' port='0x0'/> <alias name='pci.4'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </controller> <controller type='pci' index='5' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='5' port='0x0'/> <alias name='pci.5'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </controller> <controller type='sata' index='0'> <alias name='ide'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/> </controller> <interface type='network'> <mac address='52:54:00:c4:6e:1e'/> <source network='default' bridge='virbr0'/> <target dev='vnet0'/> <model type='virtio'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </interface> <serial type='pty'> <source path='/dev/pts/1'/> <target type='isa-serial' port='0'> <model name='isa-serial'/> </target> <alias name='serial0'/> </serial> <console type='pty' tty='/dev/pts/1'> <source path='/dev/pts/1'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> <input type='mouse' bus='ps2'> <alias name='input0'/> </input> <input type='keyboard' bus='ps2'> <alias name='input1'/> </input> <memballoon model='virtio'> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </memballoon> <iommu model='intel'> <driver intremap='on' caching_mode='on' iotlb='on'/> </iommu> </devices> <seclabel type='dynamic' model='selinux' relabel='yes'> <label>system_u:system_r:svirt_t:s0:c160,c234</label> <imagelabel>system_u:object_r:svirt_image_t:s0:c160,c234</imagelabel> </seclabel> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+107:+107</label> <imagelabel>+107:+107</imagelabel> </seclabel> </domain> @Peter, Pei, notice in comment 60 and 61, there are some difference on the domain xml setup by OSP13 versus your non-osp test setup. Some of the difference might account for the latency difference we observed. we need to find out if/what the xml item difference can impact the cyclictest difference. Nova doesn't allow manual edit of the xml, in most case it will overwrite manual editing. In stead, can you edit your domain XML setting to match the OSP13 setting and see if/what item cause degradation. If such item difference exists and identified, we can update nova code. (In reply to jianzzha from comment #76) > @Peter, Pei, > > notice in comment 60 and 61, there are some difference on the domain xml > setup by OSP13 versus your non-osp test setup. Some of the difference might > account for the latency difference we observed. we need to find out if/what > the xml item difference can impact the cyclictest difference. > > Nova doesn't allow manual edit of the xml, in most case it will overwrite > manual editing. In stead, can you edit your domain XML setting to match the > OSP13 setting and see if/what item cause degradation. If such item > difference exists and identified, we can update nova code. Hi Jianzhu, Peter, My servers are running regularly rhel7.7 testing now, and this run will be finished until tomorrow. Then I'll keep testing environment(eg. all package versions), and replace XML with OSP XML provided by Jianzhu in Comment 44. Next I'll update the latency difference between OSP XML config and our current XML config after all finish. Regarding machines type mentioned in Comment 61, q35 is fully supported from rhel7.6 and newer versions, however default machine type on rhel7.6+ is still pc-i440fx. So actually I do regularly testing with pc-i440fx on rhel7.6+. I remember in our past rhel7.6 testing, actually machine type doesn't affect the latency result. Besides, q35 is default machine type on rhel8. Best regards, Pei Hi Jianzhu, I've tested OSP KVM-RT XML in Comment 44, the max latency looks much higher. After several try of removing some devices(vnc, cirrus, console, usb..), but still get very higher latency. I'll try to continue to find which config cause this spike difference. KVM-RT XML from OSP: (10h cyclictest result) # Total: 038718584 038718559 038718545 038718499 038718514 038718411 # Min Latencies: 00006 00006 00006 00006 00006 00006 # Avg Latencies: 00011 00011 00011 00011 00011 00011 # Max Latencies: 00054 00057 00062 00063 00063 00063 KVM-RT XML from our past testings: (20h cyclictest result) # Total: 081445061 081445077 081445069 081445067 081445056 081445053 # Min Latencies: 00006 00006 00006 00006 00006 00006 # Avg Latencies: 00010 00010 00010 00010 00010 00010 # Max Latencies: 00015 00022 00021 00016 00019 00015 Versions: 3.10.0-957.15.1.rt56.927skipktimersoftd1.el7.x86_64 Best regards, Pei Hi Jianzhu, With KVM-RT XML from OSP, the high spike was caused by <stats period='10'/> from memballoon device: <memballoon model='virtio'> <stats period='10'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </memballoon> Removing <stats period='10'/>, others XML config keep same(including all your devices, eg. usb, vnc, video..), the max latency value is equal to the KVM-RT xml from platform. Best regards, Pei (In reply to Pei Zhang from comment #87) > Hi Jianzhu, > > With KVM-RT XML from OSP, the high spike was caused by <stats period='10'/> > from memballoon device: > > > <memballoon model='virtio'> > <stats period='10'/> > <address type='pci' domain='0x0000' bus='0x00' slot='0x05' > function='0x0'/> > </memballoon> > > > Removing <stats period='10'/>, others XML config keep same(including all > your devices, eg. usb, vnc, video..), the max latency value is equal to the > KVM-RT xml from platform. Great find Pei! Would you please open a BZ against RHOS for this issue? Please, check bug 1646397 as an example for product, component, etc. Except that the memballoon issue you found is a bug for KVM-RT, not an RFE. (In reply to Luiz Capitulino from comment #89) > (In reply to Pei Zhang from comment #87) > > Hi Jianzhu, > > > > With KVM-RT XML from OSP, the high spike was caused by <stats period='10'/> > > from memballoon device: > > > > > > <memballoon model='virtio'> > > <stats period='10'/> > > <address type='pci' domain='0x0000' bus='0x00' slot='0x05' > > function='0x0'/> > > </memballoon> > > > > > > Removing <stats period='10'/>, others XML config keep same(including all > > your devices, eg. usb, vnc, video..), the max latency value is equal to the > > KVM-RT xml from platform. > > Great find Pei! > > Would you please open a BZ against RHOS for this issue? Please, check > bug 1646397 as an example for product, component, etc. Except that > the memballoon issue you found is a bug for KVM-RT, not an RFE. Thanks Luiz for the bz reference, I filed a new RHOSP BZ to track this issue: Bug 1701509 - <stats period='10'/> of memballoon device cause high latency spike for KVM-RT guest (In reply to Luiz Capitulino from comment #89) > (In reply to Pei Zhang from comment #87) > > Hi Jianzhu, > > > > With KVM-RT XML from OSP, the high spike was caused by <stats period='10'/> > > from memballoon device: > > > > > > <memballoon model='virtio'> > > <stats period='10'/> > > <address type='pci' domain='0x0000' bus='0x00' slot='0x05' > > function='0x0'/> > > </memballoon> > > > > > > Removing <stats period='10'/>, others XML config keep same(including all > > your devices, eg. usb, vnc, video..), the max latency value is equal to the > > KVM-RT xml from platform. > > Great find Pei! > > Would you please open a BZ against RHOS for this issue? Please, check > bug 1646397 as an example for product, component, etc. Except that > the memballoon issue you found is a bug for KVM-RT, not an RFE. Indeed great finding! Excellent team work. Luiz, is there anything that need to tackle from openstack side on this issue? (In reply to jianzzha from comment #93) > (In reply to Luiz Capitulino from comment #89) > > (In reply to Pei Zhang from comment #87) > > > Hi Jianzhu, > > > > > > With KVM-RT XML from OSP, the high spike was caused by <stats period='10'/> > > > from memballoon device: > > > > > > > > > <memballoon model='virtio'> > > > <stats period='10'/> > > > <address type='pci' domain='0x0000' bus='0x00' slot='0x05' > > > function='0x0'/> > > > </memballoon> > > > > > > > > > Removing <stats period='10'/>, others XML config keep same(including all > > > your devices, eg. usb, vnc, video..), the max latency value is equal to the > > > KVM-RT xml from platform. > > > > Great find Pei! > > > > Would you please open a BZ against RHOS for this issue? Please, check > > bug 1646397 as an example for product, component, etc. Except that > > the memballoon issue you found is a bug for KVM-RT, not an RFE. > > Indeed great finding! Excellent team work. > > Luiz, is there anything that need to tackle from openstack side on this > issue? Yes, Pei has opened bug 1701509 for this issue. First we need to know what are the implications of not using the balloon stats. If we can live without it, then dropping it it's the easiest solution. We're going to use this BZ as the main tracker for achieving max < 20us in guests. In order to achieve it though, we need to solve two other issues: o Bug 1550584 - spurious ktimersoftd wake ups increases latency o Bug 1701509 - <stats period='10'/> of memballoon device cause high latency spike for KVM-RT guest Setting them as dependecies. When host is running under [root@netqe10 ~]# uname -r 3.10.0-1062.rt56.1022.el7.x86_64 [root@netqe10 ~]# The guest comes up successfully. But, when host is running under kernel-rt-3.10.0-1063.rt56.1023.el7.x86_64, the guest failed to start: virsh # start master-virbr0 error: Failed to start domain master-virbr0 error: unsupported configuration: Domain requires KVM, but it is not available. Check that virtualization is enabled in the host BIOS, and host configuration is setup to load the kvm modules. Below is the qemu-kvm info: [root@netqe10 ~]# rpm -qa | grep qemu qemu-kvm-common-rhev-2.12.0-33.el7_7.2.x86_64 ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch libvirt-daemon-driver-qemu-4.5.0-23.el7.x86_64 qemu-kvm-tools-rhev-2.12.0-33.el7_7.2.x86_64 qemu-img-rhev-2.12.0-33.el7_7.2.x86_64 qemu-kvm-rhev-2.12.0-33.el7_7.2.x86_64 Bad kernel or outdated qemu-kvm-rhev ? I made it work after updating kernel-rt-kvm to kernel-rt-kvm-3.10.0-1063.rt56.1023.el7.x86_64, and ran "dracut -v -f", and reboot. But, 30m cycclitest has max latencies at 21 and 22 us. Please below to see if I am missing something. [root@localhost rt-scripts]# ./run-cyclictest.sh -d 30m -c 1,2 -k Test started at Wed Sep 11 19:59:16 EDT 2019 Test duration: 30m Run rteval: n Run stress: y Isolated CPUs: 1,2 Kernel: 3.10.0-1063.rt56.1023.el7.x86_64 Kernel cmd-line: BOOT_IMAGE=/vmlinuz-3.10.0-1063.rt56.1023.el7.x86_64 root=UUID=bc815b9b-25e7-4c22-9498-d1f84c446bcf ro rhgb quiet crashkernel=auto spectre_v2=retpoline console=ttyS0,115200 default_hugepagesz=1G hugepagesz=1G hugepages=4 nohz=on nohz_full=1-4 rcu_nocbs=1-4 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup LANG=en_US.UTF-8 skew_tick=1 isolcpus=1,2,3,4 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,2,3,4 rcu_nocbs=1,2,3,4 x86 debug opts: retp_enabled=1 pti_enabled=1 ibrs_enabled=0 ibpb_enabled=1 Machine: localhost.localdomain CPU: Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz Results dir: /root/results/cyclictest-results.xfavni running stress taskset -c 1 stress --cpu 1 taskset -c 2 stress --cpu 1 starting Wed Sep 11 19:59:17 EDT 2019 taskset -c 1,2 cyclictest -m -n -q -p95 -D 30m -h60 -i 200 -t 2 -a 1,2 ended Wed Sep 11 20:29:17 EDT 2019 output dir is /root/results/cyclictest-results.xfavni # Min Latencies: 00006 00009 # Avg Latencies: 00010 00010 # Max Latencies: 00022 00021 ./run-cyclictest.sh: line 261: 1954 Terminated $cmdline 2>&1 > $stress_out ./run-cyclictest.sh: line 261: 1955 Terminated $cmdline 2>&1 > $stress_out [root@localhost rt-scripts]# I have a question on the Reproducer, step 3. The cyclitest has no time duration --- -D option. So, the test duration is based on "-l 100000". And, it could be very short --- could be just 20 seconds. So, is this the intent to run the cyclitest as short as 20 seconds ? NOTE: My test mentioned in Comment #107 haa duration of 30 minutes. Steps to Reproduce: *** COPY from bug description above *** 1. setup 8 vCPU guest, 2 for house keeping, 6 for cyclictest 2. stress all 8 cores with: for in in {0..7}; do taskset -c i stress-ng --cpu 1 --cpu-load 70 --cpu-method loop --timeout 24h &; done 3. run cyclict test 3 times in a row: cyclictest -l 100000 -p 99 -t 6 -h 30 -m -n -a 2-7 Jean, How is the max=22us impacting you? Marcelo, can you take a look at this? (In reply to Jean-Tsung Hsiao from comment #107) > I made it work after updating kernel-rt-kvm to > kernel-rt-kvm-3.10.0-1063.rt56.1023.el7.x86_64, and ran "dracut -v -f", and > reboot. > > But, 30m cycclitest has max latencies at 21 and 22 us. > > Please below to see if I am missing something. > > [root@localhost rt-scripts]# ./run-cyclictest.sh -d 30m -c 1,2 -k > > Test started at Wed Sep 11 19:59:16 EDT 2019 > > Test duration: 30m > Run rteval: n > Run stress: y > Isolated CPUs: 1,2 > Kernel: 3.10.0-1063.rt56.1023.el7.x86_64 > Kernel cmd-line: BOOT_IMAGE=/vmlinuz-3.10.0-1063.rt56.1023.el7.x86_64 This kernel contains the fix for the problem. Make sure you run it on both host and guest. Also updated tuned package is necessary (on both host and guest): tuned-2.11.0-8.el7 or newer. Can you confirm you see the problem with an updated setup ? > root=UUID=bc815b9b-25e7-4c22-9498-d1f84c446bcf ro rhgb quiet > crashkernel=auto spectre_v2=retpoline console=ttyS0,115200 > default_hugepagesz=1G hugepagesz=1G hugepages=4 nohz=on nohz_full=1-4 > rcu_nocbs=1-4 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup > LANG=en_US.UTF-8 skew_tick=1 isolcpus=1,2,3,4 intel_pstate=disable > nosoftlockup nohz=on nohz_full=1,2,3,4 rcu_nocbs=1,2,3,4 > x86 debug opts: retp_enabled=1 pti_enabled=1 ibrs_enabled=0 ibpb_enabled=1 > Machine: localhost.localdomain > CPU: Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz > Results dir: /root/results/cyclictest-results.xfavni > > running stress > taskset -c 1 stress --cpu 1 > taskset -c 2 stress --cpu 1 > > starting Wed Sep 11 19:59:17 EDT 2019 > taskset -c 1,2 cyclictest -m -n -q -p95 -D 30m -h60 -i 200 -t 2 -a 1,2 > ended Wed Sep 11 20:29:17 EDT 2019 > > output dir is /root/results/cyclictest-results.xfavni > > > # Min Latencies: 00006 00009 > # Avg Latencies: 00010 00010 > # Max Latencies: 00022 00021 > > ./run-cyclictest.sh: line 261: 1954 Terminated $cmdline 2>&1 > > $stress_out > ./run-cyclictest.sh: line 261: 1955 Terminated $cmdline 2>&1 > > $stress_out > [root@localhost rt-scripts]# (In reply to Luiz Capitulino from comment #109) > Jean, > > How is the max=22us impacting you? Just curious about the Reproducer described in the description: run cyclict test 3 times in a row: cyclictest -l 100000 -p 99 -t 6 -h 30 -m -n -a 2-7 I tried "-l 100000". The test duration took only 20 seconds on my Haswell test bed, and the spikes were well below 20 us. So, is this a valid test? All tests that I have run are without "-l" option. > > Marcelo, can you take a look at this? (In reply to Jean-Tsung Hsiao from comment #111) > (In reply to Luiz Capitulino from comment #109) > > Jean, > > > > How is the max=22us impacting you? > > Just curious about the Reproducer described in the description: > > run cyclict test 3 times in a row: cyclictest -l 100000 -p 99 -t 6 -h 30 -m > -n -a 2-7 > > I tried "-l 100000". The test duration took only 20 seconds on my Haswell > test bed, and the spikes were well below 20 us. > > So, is this a valid test? > > All tests that I have run are without "-l" option. Yes, this is a valid test. Can you please reply to comment 110? Jean-Tsung Hsiao, can you please reply to comment#110. Thank you! Testing 24h cyclictest with 3 KVM-RT standard testing scenarios, the max latency is 17us which is expected. ==Results== (1)Single VM with 1 rt vCPU: # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00015 (2)Single VM with 8 rt vCPUs: # Min Latencies: 00005 00007 00007 00007 00007 00007 00007 00007 # Avg Latencies: 00006 00007 00007 00007 00007 00007 00007 00007 # Max Latencies: 00015 00015 00015 00015 00015 00015 00016 00015 (3)Multiple VMs each with 1 rt vCPU: - VM1 # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00015 - VM2 # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00015 - VM3 # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00017 - VM4 # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00015 ==Versions== tuned-2.11.0-8.el7.noarch qemu-kvm-rhev-2.12.0-37.el7.x86_64 libvirt-4.5.0-27.el7.x86_64 kernel-rt-3.10.0-1101.rt56.1061.el7.x86_64 ==Details of this testing== - Host kernel line: BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap console=ttyS0,115200n81 LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1,3,5,7,9,11,13,15,17,19,12,14,16,18 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,3,5,7,9,11,13,15,17,19,12,14,16,18 rcu_nocbs=1,3,5,7,9,11,13,15,17,19,12,14,16,18 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti - Testing info of three test cases: (1)Single VM with 1 rt vCPU: Test started at: 2019-10-12 12:52:06 Saturday Kernel cmdline: BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_vm--74--14-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-74-14/root rd.lvm.lv=rhel_vm-74-14/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti X86 debug pts: pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0 Machine: vm-74-14.lab.eng.pek2.redhat.com CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz Test duration(plan): 24h Test ended at: 2019-10-13 12:52:08 Sunday cyclictest cmdline: taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace -i 200 cyclictest results: # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00015 (2)Single VM with 8 rt vCPUs: Test started at: 2019-10-12 12:54:32 Saturday Kernel cmdline: BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_vm--73--228-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-73-228/root rd.lvm.lv=rhel_vm-73-228/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1,2,3,4,5,6,7,8 intel_pstate=disable nosoftlockup nohz=on nohz_full=1,2,3,4,5,6,7,8 rcu_nocbs=1,2,3,4,5,6,7,8 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti X86 debug pts: pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0 Machine: vm-73-228.lab.eng.pek2.redhat.com CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz Test duration(plan): 24h Test ended at: 2019-10-13 12:54:33 Sunday cyclictest cmdline: taskset -c 1,2,3,4,5,6,7,8 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 8 -a 1,2,3,4,5,6,7,8 --notrace -i 200 cyclictest results: # Min Latencies: 00005 00007 00007 00007 00007 00007 00007 00007 # Avg Latencies: 00006 00007 00007 00007 00007 00007 00007 00007 # Max Latencies: 00015 00015 00015 00015 00015 00015 00016 00015 (3)Multiple VMs each with 1 rt vCPU: - VM1 Test started at: 2019-10-12 23:38:24 Saturday Kernel cmdline: BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--130-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-130/root rd.lvm.lv=rhel_bootp-73-75-130/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti X86 debug pts: pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0 Machine: bootp-73-75-130.lab.eng.pek2.redhat.com CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz Test duration(plan): 24h Test ended at: 2019-10-13 23:38:29 Sunday cyclictest cmdline: taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace -i 200 cyclictest results: # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00015 - VM2 Test started at: 2019-10-12 23:38:23 Saturday Kernel cmdline: BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_vm--74--190-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-74-190/root rd.lvm.lv=rhel_vm-74-190/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti X86 debug pts: pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0 Machine: vm-74-190.lab.eng.pek2.redhat.com CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz Test duration(plan): 24h Test ended at: 2019-10-13 23:38:27 Sunday cyclictest cmdline: taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace -i 200 cyclictest results: # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00015 - VM3 Test started at: 2019-10-12 23:38:24 Saturday Kernel cmdline: BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_vm--74--203-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-74-203/root rd.lvm.lv=rhel_vm-74-203/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti X86 debug pts: pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0 Machine: vm-74-203.lab.eng.pek2.redhat.com CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz Test duration(plan): 24h Test ended at: 2019-10-13 23:38:28 Sunday cyclictest cmdline: taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace -i 200 cyclictest results: # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00017 - VM4 Test started at: 2019-10-12 23:38:24 Saturday Kernel cmdline: BOOT_IMAGE=/vmlinuz-3.10.0-1101.rt56.1061.el7.x86_64 root=/dev/mapper/rhel_vm--74--198-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_vm-74-198/root rd.lvm.lv=rhel_vm-74-198/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 kvm-intel.vmentry_l1d_flush=never spectre_v2=off nopti X86 debug pts: pti_enable=0 ibpb_enabled=1 ibrs_enabled=0 retp_enabled=0 Machine: vm-74-198.lab.eng.pek2.redhat.com CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz Test duration(plan): 24h Test ended at: 2019-10-13 23:38:28 Sunday cyclictest cmdline: taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 --notrace -i 200 cyclictest results: # Min Latencies: 00005 # Avg Latencies: 00006 # Max Latencies: 00015 So this bug has been fixed very well. Move to 'VERIFIED'. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:1070 (In reply to Marcelo Tosatti from comment #110) > (In reply to Jean-Tsung Hsiao from comment #107) > > I made it work after updating kernel-rt-kvm to > > kernel-rt-kvm-3.10.0-1063.rt56.1023.el7.x86_64, and ran "dracut -v -f", and > > reboot. > > > > But, 30m cycclitest has max latencies at 21 and 22 us. > > > > Please below to see if I am missing something. > > > > [root@localhost rt-scripts]# ./run-cyclictest.sh -d 30m -c 1,2 -k > > > > Test started at Wed Sep 11 19:59:16 EDT 2019 > > > > Test duration: 30m > > Run rteval: n > > Run stress: y > > Isolated CPUs: 1,2 > > Kernel: 3.10.0-1063.rt56.1023.el7.x86_64 > > Kernel cmd-line: BOOT_IMAGE=/vmlinuz-3.10.0-1063.rt56.1023.el7.x86_64 > > This kernel contains the fix for the problem. Make sure you run it on both > host and guest. > > Also updated tuned package is necessary (on both host and guest): > tuned-2.11.0-8.el7 or newer. > > Can you confirm you see the problem with an updated setup ? > > > root=UUID=bc815b9b-25e7-4c22-9498-d1f84c446bcf ro rhgb quiet > > crashkernel=auto spectre_v2=retpoline console=ttyS0,115200 > > default_hugepagesz=1G hugepagesz=1G hugepages=4 nohz=on nohz_full=1-4 > > rcu_nocbs=1-4 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup > > LANG=en_US.UTF-8 skew_tick=1 isolcpus=1,2,3,4 intel_pstate=disable > > nosoftlockup nohz=on nohz_full=1,2,3,4 rcu_nocbs=1,2,3,4 > > x86 debug opts: retp_enabled=1 pti_enabled=1 ibrs_enabled=0 ibpb_enabled=1 > > Machine: localhost.localdomain > > CPU: Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz > > Results dir: /root/results/cyclictest-results.xfavni > > > > running stress > > taskset -c 1 stress --cpu 1 > > taskset -c 2 stress --cpu 1 > > > > starting Wed Sep 11 19:59:17 EDT 2019 > > taskset -c 1,2 cyclictest -m -n -q -p95 -D 30m -h60 -i 200 -t 2 -a 1,2 > > ended Wed Sep 11 20:29:17 EDT 2019 > > > > output dir is /root/results/cyclictest-results.xfavni > > > > > > # Min Latencies: 00006 00009 > > # Avg Latencies: 00010 00010 > > # Max Latencies: 00022 00021 > > > > ./run-cyclictest.sh: line 261: 1954 Terminated $cmdline 2>&1 > > > $stress_out > > ./run-cyclictest.sh: line 261: 1955 Terminated $cmdline 2>&1 > > > $stress_out > > [root@localhost rt-scripts]# The bug has been closed. |