RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1678810 - Starting 2nd VM on same NUMA socket is causing huge latency during startup of second VM
Summary: Starting 2nd VM on same NUMA socket is causing huge latency during startup of...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-rt
Version: 7.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Luiz Capitulino
QA Contact: Pei Zhang
URL:
Whiteboard:
Depends On: 1550584 1723499 1723502
Blocks: 1672377
TreeView+ depends on / blocked
 
Reported: 2019-02-19 16:19 UTC by Ajay Simha
Modified: 2021-03-24 13:19 UTC (History)
25 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-05 15:52:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ajay Simha 2019-02-19 16:19:23 UTC
Description of problem:
there is an increase in latency test on VM1 when VM2 is being launched on the same NUMA node. Can you please let us know if you/other experts in RedHat can suggest us any modification to KVM configs to reduce the noisy neighbor effect during VM2 launch.



Version-Release number of selected component (if applicable):
Host OS: RHEL 7.5 
Guest OS: Centos 7.5

Both running RT kernels.

How reproducible:
To be furnished by Cisco


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Luiz Capitulino 2019-02-19 16:30:22 UTC
How many microseconds "huge" is? What's the maximum latency you expect and what is
the maximum latency you get without having the second VM starting on the same
NUMA node?

Comment 3 Luiz Capitulino 2019-02-19 21:52:28 UTC
I've spent some hours trying to reproduce this, but I'm not sure I was able
to reproduce the issue you describe.

All my tests were short runs (from 10 minutes to 40 minutes). The guest
I ran cyclictest had 2 vCPUs (one real-time and one housekeeping). The second
guest is also real-time and had the same configuration, except that I just
used it to start and poweroff in a loop.

The maximum baseline latency in this system is around 24us (again, short
runs only). Having the second guest starting and powering off for 40 minutes
just increased this results for a few microseconds when using our recommended
configuration, which is to have vcpu0 and emulator threads of each guest in a
different numa node than the real-time vcpu threads.

When all threads of all guests are in the same numa node, I get a maximum
latency of 38us. This is 60% higher. However, this is not the recommended
configuration. Actually, I'm not even sure this is supported if you don't use
CAT (Cache Allocation Technology).

Now, to be able to help more, I need to know all the questions asked in
comment 1 plus:

1. All versions (qemu, kernel, tuned)

2. Is host and guest running the same version of everything?

3. You mentioned by email you're not using tuned. I'm not sure this is supported.
   In any case, have you tried with tuned?

4. What is the detailed configuration of hosts and guests? (guest size,
   vcpu count, pinning, etc)

5. What is your test-case?

6. What is your expected latency?

Comment 4 Yichen Wang 2019-02-20 01:54:33 UTC
Answers to all questions below:

Steps to Reproduce:
1. Spawn a CentOS VM with hw:cpu_policy=dedicated, hw:cpu_realtime=yes, hw:cpu_realtime_mask=^0, back-ported the fix from Rocky to Queens so hw:emulator_threads_policy=share will pin emulator threads to a dedicate pool;
2. Update CentOS VM to latest RT kernel, and run cyclictest in that VM
3. Spawn another VM with the same socket
4. Observe a spike in the first VM spawned;

All versions (qemu, kernel, tuned):
Kernel: 3.10.0-957.1.3.rt56.913.el7.x86_64
QEMU: qemu-kvm-rhev-2.12.0-18.el7_6.1.x86_64
tuned: N/A

Is host and guest running the same version of everything?
Host is RHEL, Guest is CentOS 3.10.0-957.5.1.rt56.916.el7.x86_64

You mentioned by email you're not using tuned. I'm not sure this is supported. In any case, have you tried with tuned?
No. We will not use tuned. tuned will do nothing but GRUB. However it is changing GRUB with "isolcpus" which will cause problems in QEMU emulator thread and results in a very slow IO. We use "tuna" instead, and the same recommendation is given by RedHat some time ago, and I am not sure if it fixed already.

What is the detailed configuration of hosts and guests? (guest size, vcpu count, pinning, etc)
Host: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, HT OFF.
BOOT_IMAGE=/vmlinuz-3.10.0-957.1.3.rt56.913.el7.x86_64 root=/dev/mapper/quincy--compute--2_vg_root-lv_root ro audit=1 crashkernel=auto rd.lvm.lv=quincy-compute-2_vg_root/lv_root rd.lvm.lv=quincy-compute-2_vg_root/lv_swap nomodeset console=tty0 elevator=cfq hugepagesz=2M hugepages=89993 intel_idle.max_cstate=1 nomodeset nohz_full=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39 intel_iommu=on clocksource=tsc skew_tick=1 rcu_nocbs=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39 rcu_nocb_poll=1 pcie_aspm.policy=performance pci-stub.ids=1137:0071,8086:10ed,8086:154c console=ttyS1,115200n8 console=tty0 nosoftlockup iommu=pt nohz=on tsc=reliable idle=poll intel_pstate=disable nmi_watchdog=0 transparent_hugepage=never

Guest: 9 vCPU, 1G Huge Page * 4
novalibvirt_18079 [root@quincy-compute-2 /]# virsh dumpxml 1 | grep vcpu
        <nova:vcpus>9</nova:vcpus>
  <vcpu placement='static'>9</vcpu>
    <vcpupin vcpu='0' cpuset='2'/>
    <vcpupin vcpu='1' cpuset='3'/>
    <vcpupin vcpu='2' cpuset='4'/>
    <vcpupin vcpu='3' cpuset='5'/>
    <vcpupin vcpu='4' cpuset='6'/>
    <vcpupin vcpu='5' cpuset='7'/>
    <vcpupin vcpu='6' cpuset='8'/>
    <vcpupin vcpu='7' cpuset='9'/>
    <vcpupin vcpu='8' cpuset='10'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='3' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='4' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='5' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='6' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='7' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='8' scheduler='fifo' priority='1'/>
novalibvirt_18079 [root@quincy-compute-2 /]# virsh emulatorpin 1
emulator: CPU Affinity
----------------------------------
       *: 1 

What is your test-case?
See above, steps to reproduce.

What is your expected latency?
When just the first VM alone, the cylictest shows the . max to be around 30us (short runs):
[root@test-vm1 ~]# sudo ./cyclictest_new -S -m -n -p99 -d0 -A ffff
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.35 0.13 0.08 1/214 8550

T: 0 ( 8542) P:99 I:1000 C:  24414 Min:     11 Act:   14 Avg:   14 Max:      31 GT25:      22 GT50:   0 GT75:   0 GT100:   0 GT150:   0 GT200:   0 GT250:   0
T: 1 ( 8543) P:99 I:1000 C:  24414 Min:      7 Act:   11 Avg:    9 Max:      34 GT25:     160 GT50:   0 GT75:   0 GT100:   0 GT150:   0 GT200:   0 GT250:   0
T: 2 ( 8544) P:99 I:1000 C:  24414 Min:      7 Act:    7 Avg:   11 Max:      24 GT25:       0 GT50:   0 GT75:   0 GT100:   0 GT150:   0 GT200:   0 GT250:   0
T: 3 ( 8545) P:99 I:1000 C:  24414 Min:      7 Act:   16 Avg:   12 Max:      25 GT25:       0 GT50:   0 GT75:   0 GT100:   0 GT150:   0 GT200:   0 GT250:   0
T: 4 ( 8546) P:99 I:1000 C:  24414 Min:      7 Act:   10 Avg:   10 Max:      28 GT25:       2 GT50:   0 GT75:   0 GT100:   0 GT150:   0 GT200:   0 GT250:   0
T: 5 ( 8547) P:99 I:1000 C:  24414 Min:      7 Act:   12 Avg:    9 Max:      26 GT25:       1 GT50:   0 GT75:   0 GT100:   0 GT150:   0 GT200:   0 GT250:   0
T: 6 ( 8548) P:99 I:1000 C:  24414 Min:      7 Act:    9 Avg:   11 Max:      24 GT25:       0 GT50:   0 GT75:   0 GT100:   0 GT150:   0 GT200:   0 GT250:   0
T: 7 ( 8549) P:99 I:1000 C:  24414 Min:      7 Act:   12 Avg:   10 Max:      25 GT25:       0 GT50:   0 GT75:   0 GT100:   0 GT150:   0 GT200:   0 GT250:   0
T: 8 ( 8550) P:99 I:1000 C:  24414 Min:      7 Act:   13 Avg:   12 Max:      29 GT25:       1 GT50:   0 GT75:   0 GT100:   0 GT150:   0 GT200:   0 GT250:   0

When the second VM is coming up, we see the max is around 250us. The expected should be no impact.
[root@test-vm1 ~]# sudo ./cyclictest_new -S -m -n -p99 -d0 -A ffff
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.04 0.08 0.07 1/214 8578

T: 0 ( 8569) P:99 I:1000 C: 131085 Min:      6 Act:   13 Avg:   14 Max:     260 GT25:     334 GT50:  10 GT75:  11 GT100:  26 GT150:   8 GT200:   2 GT250:   1
T: 1 ( 8570) P:99 I:1000 C: 131085 Min:      6 Act:    7 Avg:    9 Max:     235 GT25:      47 GT50:  15 GT75:  11 GT100:   8 GT150:   2 GT200:   1 GT250:   0
T: 2 ( 8571) P:99 I:1000 C: 131085 Min:      7 Act:   14 Avg:   12 Max:     231 GT25:      67 GT50:  11 GT75:   8 GT100:  17 GT150:   9 GT200:   1 GT250:   0
T: 3 ( 8572) P:99 I:1000 C: 131085 Min:      7 Act:   12 Avg:   11 Max:     253 GT25:      98 GT50:   7 GT75:  16 GT100:  14 GT150:   3 GT200:   0 GT250:   1
T: 4 ( 8573) P:99 I:1000 C: 131085 Min:      7 Act:   10 Avg:   11 Max:     223 GT25:     104 GT50:  10 GT75:   7 GT100:  13 GT150:   0 GT200:   2 GT250:   0
T: 5 ( 8574) P:99 I:1000 C: 131085 Min:      7 Act:    7 Avg:    9 Max:     254 GT25:      37 GT50:  13 GT75:   8 GT100:  17 GT150:   4 GT200:   1 GT250:   1
T: 6 ( 8575) P:99 I:1000 C: 131085 Min:      7 Act:    9 Avg:   11 Max:     258 GT25:      75 GT50:  10 GT75:  12 GT100:  11 GT150:   9 GT200:   1 GT250:   1
T: 7 ( 8576) P:99 I:1000 C: 131085 Min:      7 Act:   13 Avg:   11 Max:     260 GT25:      33 GT50:  12 GT75:   9 GT100:  12 GT150:   2 GT200:   2 GT250:   1
T: 8 ( 8577) P:99 I:1000 C: 131085 Min:      7 Act:   10 Avg:    9 Max:     183 GT25:      46 GT50:   9 GT75:  11 GT100:   7 GT150:   4 GT200:   0 GT250:   0

Additional info:
So we are suspecting, when VM comes up, KVM must be doing something which cause VM stall even the core assignments are exclusive. This is just cylictest, when doing SRIOV with DPDK on NICs, the packets drops are also observed. So clearly something is happening.

Comment 5 Luiz Capitulino 2019-02-20 13:30:55 UTC
Thanks for the detailed information. There's a number of things to be observed:

1. We recently found out a number of configuration issues in
   OpenStack for RT. Some of those need manual workarounds, others
   depend on fixes in latest OpenStack. Andrew, Jianzhu, can you
   help with that or give us pointers?

2. It is not true that tuned only changes grub, it does a lot
   more than that. As far as I remember, OpenStack got that
   issue fixed. If not, you still must use tuned in host and
   guest and apply your workaround on top of tuned. Andrew and/or
   Jianzhu can also help with this

3. I don't think it's is supported to run all guest threads in the
   same numa node if you don't have CAT. Please, pin vcpu0 and
   emulator threads to a different node where you have the real-time
   vcpu threads and the DPDK threads

Finally, when I said I couldn't reproduce this issue in comment 3 I was
using a more up to date kernel. I'll downgrade it to your version and
try to reproduce again, but my initial hypothesis is that this is a
configuration issue with OpenStack.

Comment 6 Yichen Wang 2019-02-20 22:23:16 UTC
Hi, Luiz. Thanks a lot for your quick follow up. Some more comments:

1. Jianzhu already helped me on the configuration before. I've tried everything he asked except tuned. If there are more, or we have a documentation on this, I am more than glad to take it.

2. I actually tried to look what tuned does when Jianzhu shared the same to me. Unfortunately I don't find much information about it. All I found is this guy: https://github.com/redhat-performance/tuned/blob/master/profiles/realtime/tuned.conf, please confirm is that all tuned is doing? If yes I would just try those manually and see if things are better.

3. Can you elaborate more about "run all guest threads in the same numa node if you don't have CAT"? Why can't I put two VMs in the same NUMA node if they are have available cores? Also what is CAT? For now I've tried to pin vcpu0 and emulator in different core, is that what you mean by "node"?

We are not bleeding edge latest, but should not be too far behind. Let me know if you think the kernel version helps.

Comment 7 Eric Elena 2019-02-21 09:48:26 UTC
Hello Yichen,

(chiming in as I just started to get involved in this project)

2. tuned does other things than adding isolcpus to grub. Looking at the realtime-virtual-host profile, it is true that is modifies grub cmdline but it also changes the priority of some processes (linux scheduler) and includes the profile realtime.
realtime modifies some sysctl and sysfs values, modifies grub cmdline (again), tune the IRQ affinity of isolated cores and includes the profile network-latency.
network-latency disables transparent hugepages, modifies some sysctl (again), modifies grub cmdline (again) and includes the profile latency-performance.
latency-performance modifies some sysctl and applies a specific governor to the CPU to disable power saving mechanisms and have the best performances.

Even though grub cmdline is modified three times, I expect that the new grub is really installed only once.

3. CAT stands for Cache Allocation Technology. In a nutshell, it's a feature that controls access to the shared Last Level Cache (LLC, L3-cache) and can prioritize access to the LLC to higher priorities application. Without this feature, an application (read a VM) can monopolize the resources available in the LLC as no constraint is applied on it. This is called a noisy neighbour. With this feature, you can reserve a part of this LLC to a high priority application. A noisy neighbour will no more be able to exhaust the resources available in the LLC and the high priority application can use the reserved part of the LLC at any time.

Without CAT, running 2 VM on the same NUMA node may lead to the situation where one of the VM is a noisy neighbour and the performances of the other VM will be (slightly?) impacted, which is far from being optimal for a realtime workload. With CAT, you can partition the LLC and a VM won't be impacted if the other VM wants to use all the LLC. pqos is the application to dynamically configure CAT (reboot is not required).

[0] will give you more information about CAT, [1] is th result of using CAT in a telco NFV environment.

At least in Red Hat's terminology a NUMA node is a physical CPU that you put in a socket.

I hope this helps.

[0] https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology

[1] https://software.intel.com/en-us/articles/cache-allocation-technology-telco-nfv-noisy-neighbor-experiments

Comment 8 Luiz Capitulino 2019-02-21 13:26:18 UTC
(In reply to Yichen Wang from comment #6)
> Hi, Luiz. Thanks a lot for your quick follow up. Some more comments:
> 
> 1. Jianzhu already helped me on the configuration before. I've tried
> everything he asked except tuned. If there are more, or we have a
> documentation on this, I am more than glad to take it.

Yes, Jianzhu wrote a document with all necessary tweaks. Jianzhu, can
you jump in?
 
> 2. I actually tried to look what tuned does when Jianzhu shared the same to
> me. Unfortunately I don't find much information about it. All I found is
> this guy:
> https://github.com/redhat-performance/tuned/blob/master/profiles/realtime/
> tuned.conf, please confirm is that all tuned is doing? If yes I would just
> try those manually and see if things are better.

As Eric mentioned in the previous comment, there's a tuned real-time profile
for KVM-RT in the host and the guest. Again, tuned is required. Not using tuned
makes your setup unsupported.
 
> 3. Can you elaborate more about "run all guest threads in the same numa node
> if you don't have CAT"? Why can't I put two VMs in the same NUMA node if
> they are have available cores? Also what is CAT? For now I've tried to pin
> vcpu0 and emulator in different core, is that what you mean by "node"?

I think Eric explained CAT in the previous comment. However, I talked with
Andrew Theurer from the performance team (who built a reference implementation
of KVM-RT with Jianzhu) and he told me that for OSP, what they did was to
separate emulator threads and vcpu threads in different nodes. For example,
on a two numa node system, you pin all vcpu threads to node0 and all emulator
threads to node1.

But I'm not familiar with OpenStack, so we need Jianzhu or Andrew to jump
in and help reviwing your configuration.

Comment 9 Luiz Capitulino 2019-02-21 13:39:29 UTC
Following up myself on comment 5: I've updated my kernel and qemu to match yours
and I also created a larger guest (6 vcpus). I ran a few different varations of
your test-case (ie. start the guest & powering off vs. rebooting). In all tests
I get good latencies for 1 hour runs (max is about 30us). The only detail
is that I had to disable spectre & meltdown in host and guest and L1D in host,
otherwise I do get a relavant spike (not huge as yours, but still relavant
close to 50us).

I think the next step for this issue is for someone with OSP RT background to
review your configuration and/or to try to reproduce this issue in our
reference implementation in our lab (this would be Jianzhu).

Meanwhile, if you want to try something you could:

1. Use tuned in host and guest as required

2. Disable spectre, meltdown and L1D flushing:

IMPORTANT: This is disabling mitigations for security issues. This will cause
           the system to become vunelrable. The implications of this have to be
           discussed with the customer.

In host and guest do:

# cd /sys/kernel/debug/x86
# echo 0 > ibrs_enabled
# echo 0 > ibpb_enabled
# echo 0 > pti_enabled
# echo 0 > retp_enabled

In host only do:

# echo never > /sys/module/kvm_intel/parameters/vmentry_l1d_flush

Comment 12 Andrew Theurer 2019-02-21 17:35:33 UTC
(In reply to Yichen Wang from comment #4)
> Answers to all questions below:
> 
> Steps to Reproduce:
> 1. Spawn a CentOS VM with hw:cpu_policy=dedicated, hw:cpu_realtime=yes,
> hw:cpu_realtime_mask=^0, back-ported the fix from Rocky to Queens so
> hw:emulator_threads_policy=share will pin emulator threads to a dedicate
> pool;
> 2. Update CentOS VM to latest RT kernel, and run cyclictest in that VM
> 3. Spawn another VM with the same socket
> 4. Observe a spike in the first VM spawned;
> 
> All versions (qemu, kernel, tuned):
> Kernel: 3.10.0-957.1.3.rt56.913.el7.x86_64
> QEMU: qemu-kvm-rhev-2.12.0-18.el7_6.1.x86_64
> tuned: N/A
> 
> Is host and guest running the same version of everything?
> Host is RHEL, Guest is CentOS 3.10.0-957.5.1.rt56.916.el7.x86_64
> 
> You mentioned by email you're not using tuned. I'm not sure this is
> supported. In any case, have you tried with tuned?
> No. We will not use tuned. tuned will do nothing but GRUB.

It does a lot more than GRUB.

> However it is
> changing GRUB with "isolcpus" which will cause problems in QEMU emulator
> thread and results in a very slow IO.

This is not a problem when emulator threads are pinned to outside the isolcpu list, which should be accompished with emulatorpin feature in Nova.

> We use "tuna" instead, and the same
> recommendation is given by RedHat some time ago, and I am not sure if it
> fixed already.

This is not good enough.  You must include isolcpus by way of using tuned virtual-host.  Tuned vitual host includes tunings other than options on GRUB.  There is no support for a RT_KVM host without tuned-virtual-host.  Support for a guest is maintained while using tuned realtime-virtual-guest

Comment 13 Bandan Das 2019-02-21 17:55:22 UTC
(In reply to Eric Elena from comment #7)
> Hello Yichen,
> 
> (chiming in as I just started to get involved in this project)
> 
> 2. tuned does other things than adding isolcpus to grub. Looking at the
> realtime-virtual-host profile, it is true that is modifies grub cmdline but
> it also changes the priority of some processes (linux scheduler) and
> includes the profile realtime.
> realtime modifies some sysctl and sysfs values, modifies grub cmdline
> (again), tune the IRQ affinity of isolated cores and includes the profile
> network-latency.
> network-latency disables transparent hugepages, modifies some sysctl
> (again), modifies grub cmdline (again) and includes the profile
> latency-performance.
> latency-performance modifies some sysctl and applies a specific governor to
> the CPU to disable power saving mechanisms and have the best performances.
> 

Even without using tuned, if we can get the setup to match what tuned does, we will
have a known starting point. So, even though we might eventually diverge and do 
things differently later on, debugging would be easier.

Comment 15 Pei Zhang 2019-02-22 15:22:02 UTC
I would like to update more tuning info in our kvm-rt testing:

1. Using tuned-profiles-realtime/tuned-profiles-nfv-host/tuned packages to isolate host cores.

2. Guest XML of cputune

(1) pin each vCPU thread to a different isolated host core.

(2) pin qemu emulator thread (I/O thread) to non-isolated cores. 

(3) For 2 NUMA nodes server, we should pin all RT-vCPUs to host cores from one NUMA node, but pin all nonRT-vCPUs from another NUMA node. 

3. Using tuned-profiles-realtime/tuned-profiles-nfv-guest/tuned packages to isolate guest cores.

For examples:

- Server has 2 NUMA nodes:
# lscpu | grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19

- Isolated cores 2,4,6,8,10,12,14,16,18,19,17,15,13:
# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-1006.rt56.963.el7.x86_64 root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13

- Pin RT-vCPU 1~8 to host cores 2,4,6,8,10,12,14,16 which all from NUMA node 0.
- Pin nonRT-vCPU 0,9 to host cores 19,17 from NUMA node 1. 
- QEMU emulator CPU is using non-isolated host cores 3,5,7:
<vcpu placement="static">10</vcpu>
<cputune>
    <vcpupin cpuset="19" vcpu="0"/>
    <vcpupin cpuset="2" vcpu="1"/>
    <vcpupin cpuset="4" vcpu="2"/>
    <vcpupin cpuset="6" vcpu="3"/>
    <vcpupin cpuset="8" vcpu="4"/>
    <vcpupin cpuset="10" vcpu="5"/>
    <vcpupin cpuset="12" vcpu="6"/>
    <vcpupin cpuset="14" vcpu="7"/>
    <vcpupin cpuset="16" vcpu="8"/>
    <vcpupin cpuset="17" vcpu="9"/>
    <emulatorpin cpuset="3,5,7"/>
    <vcpusched priority="1" scheduler="fifo" vcpus="0-9"/>
</cputune>

Comment 16 Yichen Wang 2019-02-22 23:28:58 UTC
Hi, Luiz/Eric/Andrew/Branden/Pei,

Thanks a lot for your prompt helps on us. I am following the suggestions and tried 3 more options. As Ian stated before, we don't really have control on VMs, so tunings are only being done in host level. Also this spawning 2nd VM affects the 1st VM has to be something to do with host level, as VM is not even aware such events. However, configurations in (3) does improve the latency in general, and we will make that recommendation to our customer.

The reproduce of the issue is not really consistent, i.e. the spike ranges from 50us to 200us, in a average of 100us based on my 10 tries. The impression is this does get better with those tunings, but I believe up to 200us is still not a "perfect". I even tried to isolate which configuration contributes the most ((1)/(2)/(3)). Since the deviation varys too much, it is really hard to say which one or all of them are winners. So here are the details for the work that I did:

(1) tuned:
# rpm -qa | grep tuned
tuned-profiles-realtime-2.10.0-6.el7.noarch
tuned-2.10.0-6.el7.noarch
tuned-profiles-nfv-guest-2.10.0-6.el7.noarch
tuned-profiles-nfv-host-2.10.0-6.el7.noarch
tuned-profiles-nfv-2.10.0-6.el7.noarch

nfv-host profiles are applied on host level.

(2) Additional tuned:
These are coming from the link I shared above. I don't see these are automatically applied for some reason, maybe you guys can share if below are really needed or it is indeed missing in the above profile RPM?
sysctl -w kernel.hung_task_timeout_secs=600
sysctl -w kernel.nmi_watchdog=0
sysctl -w kernel.sched_rt_runtime_us=-1
sysctl -w vm.stat_interval=10
sysctl -w kernel.timer_migration=0
# echo ${not_isolated_cpumask} > /sys/bus/workqueue/devices/writeback/cpumask
# echo ${not_isolated_cpumask} > /sys/devices/virtual/workqueue/cpumask
# /sys/devices/system/machinecheck/machinecheck*/ignore_ce = 1
cd /sys/devices/system/machinecheck/
for f in `ls -r */ignore_ce`; do echo 1 > $f; done

(3) Additional configuration suggested by Luiz:
cd /sys/kernel/debug/x86
echo 0 > ibrs_enabled
echo 0 > ibpb_enabled # This actually failed
echo 0 > pti_enabled
echo 0 > retp_enabled
echo never > /sys/module/kvm_intel/parameters/vmentry_l1d_flush

(4) More points replied inline:
Disable spectre, meltdown and L1D flushing
==> Need to work with VNF vendor to see if this is a concern. Will do this if customer is OK with it.

CAT:
==> Additonal to above comments, CAT might be needed. We are using Intel Skylake, can you give us more information on how to tune with utilizing the technology?

Emulator/Worker in different NUMA:
==> We can't really do that in production setup. The reason is, We are going to for sure consume all available cores for doing real workload in the hypervisor, and all the workloads besides vcpu0 are rt threads. With your suggestions, there is a big waste of computing powers, as you are suggesting leaving one side of NUMA just for QEMU emulator threads. Otherwise we will end up with all VMs have their emulator threads in the far end NUMA, and I don't see how that would help. We are glad if you guys can elaborate on the reason behind it?

Recomended Configs from above:
==> It is hard for us to tell among (1)(2)(3) which or all contribute to better performance, so would appreciate your recommendations based on the fact that above configurations (1)(2)(3) do, and suggest us what makes the most sense?

Tools for measuring:
==> As mentioned above, the issue is not 100% consistent by reading the data. So I would want to ask you guys, what kind of tools or methods should we use to taking deeper dive, instead of just blindly tuning different things without knowing the what is the real issue?

Comment 17 Ian Wells 2019-02-23 00:26:42 UTC
Just to clarify, if you can point me at the relevant documentation:

"There is no support for a RT_KVM host without tuned-virtual-host."

Where is this written down?

Comment 18 Eric Elena 2019-02-25 10:08:05 UTC
Hello Yichen,

I don't have a CAT-enabled CPU at hand to test the CAT configuration. If you want to give it a try anyway:
- install intel-cmt-cat (included in the RHEL repository)

- check the number of cache ways: sudo pqos -v

- define the mask for your workload. The mask will be applied against the LLC and the resulting memory area will be assigned to a class of service (COS). Masks can overlap. The bits composing the mask must be contiguous. There are N bits, N being the number of cache ways.

For example let's say there are 12 cache ways, 2 sockets, 20 cores per socket, and you want to allocate COS1 with 6 cache ways (50%) to the 2 sockets cores 4-9 and 24-29, COS2 with 4 cache ways (~33%) to socket 0 core 10-15, COS2 with 2 cache ways (~16%) to socket 1 core 30-33 and COS3 to the remaining sockets:
|      | Mask socket 0   | Mask socket 0 | Core       | Mask socket 1   | Mask socket 1 | Core         |
|      | 11            0 | hexa          |            | 11            0 | hexa          |              |
+------+-----------------+---------------+------------+-----------------+---------------+--------------+
| COS0 |  1111 1111 1111 | 0xfff         | all        |  1111 1111 1111 | 0xfff         | all          |
| COS1 |  1111 1100 0000 | 0xfc0         | 4-9        |  1111 1100 0000 | 0xfc0         | 24-29        |
| COS2 |  0000 0011 1100 | 0x3c          | 10-15      |  0000 0011 0000 | 0x30          | 30-33        |
| COS3 |  0000 0000 0011 | 0x3           | 0-3, 16-19 |  0000 0000 1111 | 0xf           | 20-23, 34-39 |
| COS4 |  1111 1111 1111 | 0xfff         | all        |  1111 1111 1111 | 0xfff         | all          |
| ...  |  ...            | ...           |            |  ...            | ...           | ...          |

COS0, COS4 and following are defined by default (catch-all) and are an example of overlapping allocation.

- create the COS:
  sudo pqos -e "llc:1=0xfc0"
  sudo pqos -e "llc@0:2=0x3c"
  sudo pqos -e "llc@1:2=0x30"
  sudo pqos -e "llc@0:3=0x3"
  sudo pqos -e "llc@1:3=0xf"

This can be combined in a one-liner but it's harder to read it:
  sudo pqos -e "llc:1=0xfc0;llc@0:2=0x3c;llc@1:2=0x30;llc@0:3=0x3;llc@1:3=0xf"

- map the COS and the cores:
  sudo pqos -a "llc:1=4-9"
  sudo pqos -a "llc:1=24-29"
  sudo pqos -a "llc:2=10-15"
  sudo pqos -a "llc:2=30-33"
  sudo pqos -a "llc:3=0-3,16-19"
  sudo pqos -a "llc:320-23,34-39"

This can also be combined with a one-liner:
  sudo pqos -a "llc:1=4-9;llc:1=24-29;llc:2=10-15;llc:2=30-33;llc:3=0-3,16-19;llc:320-23,34-39"

- cehck the configuration: sudo pqos -s

- check the results: sudo pqos -T

Remarks:
o The numbers are given as examples only, you should check with your own configuration for the number of cache ways, the strategy to partition the LLC, where the cores are located, etc. as I cannot test it myself
o The COS should be mapped per socket (there is nothing in the doc to select the socket when doing the mapping), in case of doubt it's better to define different COS for each socket
o It may be possible to map a COS per PID by using the kernel interface (MSR by default). Add -I to each command:
  sudo pqos -I -s # show the configuration
  sudo pqos -I -T # show the results
  sudo pqos -I -V -s # show the PIS association
  sudo pqos -I -e "llc:1=0xfc0" # create a COS
  sudo pqos -I -a "core:1=4-9" # map a COS to cores
  sudo pqos -I -a "pid:2=36489" # map a COS to a PID
  sudo pqos -I -R # reset the configuration

However as you should use CPU pinning, doing the mapping with the cores should be enough.

Please be aware if the following issue [0] if you plan to use secure boot.

There are other tools to manage or monitor the memory (Intel Resource Director Technology) but at that point it would be better to ask Intel directly.

I guess the recommendation to deploy 1 VM per NUMA is to avoid this kind of tuning.

[0] https://github.com/intel/intel-cmt-cat/wiki/UEFI-Secure-Boot-Compatibility

Comment 19 Andrew Theurer 2019-02-25 11:22:20 UTC
(In reply to Yichen Wang from comment #16)
> Hi, Luiz/Eric/Andrew/Branden/Pei,
> 
> Thanks a lot for your prompt helps on us. I am following the suggestions and
> tried 3 more options. As Ian stated before, we don't really have control on
> VMs, so tunings are only being done in host level. Also this spawning 2nd VM
> affects the 1st VM has to be something to do with host level, as VM is not
> even aware such events. However, configurations in (3) does improve the
> latency in general, and we will make that recommendation to our customer.
> 
> The reproduce of the issue is not really consistent, i.e. the spike ranges
> from 50us to 200us, in a average of 100us based on my 10 tries. The
> impression is this does get better with those tunings, but I believe up to
> 200us is still not a "perfect". I even tried to isolate which configuration
> contributes the most ((1)/(2)/(3)). Since the deviation varys too much, it
> is really hard to say which one or all of them are winners. So here are the
> details for the work that I did:
> 
> (1) tuned:
> # rpm -qa | grep tuned
> tuned-profiles-realtime-2.10.0-6.el7.noarch
> tuned-2.10.0-6.el7.noarch
> tuned-profiles-nfv-guest-2.10.0-6.el7.noarch
> tuned-profiles-nfv-host-2.10.0-6.el7.noarch
> tuned-profiles-nfv-2.10.0-6.el7.noarch
> 
> nfv-host profiles are applied on host level.

Do you have realtime-virtual-host?  This is the one we need.
 
> (2) Additional tuned:
> These are coming from the link I shared above. I don't see these are
> automatically applied for some reason, maybe you guys can share if below are
> really needed or it is indeed missing in the above profile RPM?
> sysctl -w kernel.hung_task_timeout_secs=600
> sysctl -w kernel.nmi_watchdog=0
> sysctl -w kernel.sched_rt_runtime_us=-1
> sysctl -w vm.stat_interval=10
> sysctl -w kernel.timer_migration=0
> # echo ${not_isolated_cpumask} > /sys/bus/workqueue/devices/writeback/cpumask
> # echo ${not_isolated_cpumask} > /sys/devices/virtual/workqueue/cpumask
> # /sys/devices/system/machinecheck/machinecheck*/ignore_ce = 1
> cd /sys/devices/system/machinecheck/
> for f in `ls -r */ignore_ce`; do echo 1 > $f; done

Luiz can better answer this, but I believe the realtime-virtual-host will do these for you.

> (3) Additional configuration suggested by Luiz:
> cd /sys/kernel/debug/x86
> echo 0 > ibrs_enabled
> echo 0 > ibpb_enabled # This actually failed
> echo 0 > pti_enabled
> echo 0 > retp_enabled
> echo never > /sys/module/kvm_intel/parameters/vmentry_l1d_flush
> 
> (4) More points replied inline:
> Disable spectre, meltdown and L1D flushing
> ==> Need to work with VNF vendor to see if this is a concern. Will do this
> if customer is OK with it.
> 
> CAT:
> ==> Additonal to above comments, CAT might be needed. We are using Intel
> Skylake, can you give us more information on how to tune with utilizing the
> technology?
> 
> Emulator/Worker in different NUMA:
> ==> We can't really do that in production setup. The reason is, We are going
> to for sure consume all available cores for doing real workload in the
> hypervisor, and all the workloads besides vcpu0 are rt threads. With your
> suggestions, there is a big waste of computing powers, as you are suggesting
> leaving one side of NUMA just for QEMU emulator threads. Otherwise we will
> end up with all VMs have their emulator threads in the far end NUMA, and I
> don't see how that would help. We are glad if you guys can elaborate on the
> reason behind it?

This boils down to how much the effects of sharing the cache has.  In the context of this test, I would hope it would be minimal, but I don't think we know for sure.  Would it be possible to configure this way just for this test, to see if it makes a difference?

> Recomended Configs from above:
> ==> It is hard for us to tell among (1)(2)(3) which or all contribute to
> better performance, so would appreciate your recommendations based on the
> fact that above configurations (1)(2)(3) do, and suggest us what makes the
> most sense?

I believe getting 1 and 2 covered with a single tuned profile is the most important right now.  Spectre/Meltdown may produce spikes, but IMO unlikely that those would be 200us.  If we are seeing up to 200us, my suspicion is that something is interrupting the vcpu thread by either sending a signal or scheduling another task on the same host cpu.

> Tools for measuring:
> ==> As mentioned above, the issue is not 100% consistent by reading the
> data. So I would want to ask you guys, what kind of tools or methods should
> we use to taking deeper dive, instead of just blindly tuning different
> things without knowing the what is the real issue?

First, the cyclitest should run for at least an hour, and 12 hours would be much better, especially for the baseline measurement when there is only 1VM.  That way we have a better idea if these spikes happen eventually even with no other VM starting.

Tracing certain functions (ftrace, trace-cmd) is typically what is done to identify specific problems, but you can also start by looking for higher-level things that are likely to cause latency spikes.  I would start with monitoring /proc/interrupts (capture file multiple times during test) and seeing what interrupts are being delivered to the affected CPUs during the test.  If we don't see anything with /proc/interrupts, we can capture /proc/sched_debug (multiple times during test) and see if we have threads other than vcpu thread running on the host cpus (that run vcpu threads that are involved in the cyclictest).

Comment 20 Luiz Capitulino 2019-02-25 13:40:09 UTC
(In reply to Andrew Theurer from comment #19)

> > (2) Additional tuned:
> > These are coming from the link I shared above. I don't see these are
> > automatically applied for some reason, maybe you guys can share if below are
> > really needed or it is indeed missing in the above profile RPM?
> > sysctl -w kernel.hung_task_timeout_secs=600
> > sysctl -w kernel.nmi_watchdog=0
> > sysctl -w kernel.sched_rt_runtime_us=-1
> > sysctl -w vm.stat_interval=10
> > sysctl -w kernel.timer_migration=0
> > # echo ${not_isolated_cpumask} > /sys/bus/workqueue/devices/writeback/cpumask
> > # echo ${not_isolated_cpumask} > /sys/devices/virtual/workqueue/cpumask
> > # /sys/devices/system/machinecheck/machinecheck*/ignore_ce = 1
> > cd /sys/devices/system/machinecheck/
> > for f in `ls -r */ignore_ce`; do echo 1 > $f; done
> 
> Luiz can better answer this, but I believe the realtime-virtual-host will do
> these for you.

That's correct.

Comment 21 jianzzha 2019-02-26 18:55:11 UTC
(In reply to Yichen Wang from comment #6)
Yicheng, apologize if you have already sent it, can you paste lscpu from your compute node?

Comment 22 Yichen Wang 2019-02-27 05:30:52 UTC
@jianzzha, this is lscpu:
[root@quincy-compute-2 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    1
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping:              4
CPU MHz:               1601.000
CPU max MHz:           1601.0000
CPU min MHz:           1000.0000
BogoMIPS:              3200.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              28160K
NUMA node0 CPU(s):     0-19
NUMA node1 CPU(s):     20-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d

We tried to use CAT, and it does help us a lot on this noisy neighbor issue. However, during the debugging, we saw more issues on this ktimersoftd threads. While running some workload in VM at 100% CPU usage. On the host, we are constantly seeing it is being waken up every 1 sec, which is stealing CPU cycles:
amples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
  427156.801036 [0004]  CPU 2/KVM[5438/5414]                0.007      0.000      0.078
  427156.801043 [0004]  <idle>                              0.078      0.000      0.007
  427166.801662 [0004]  CPU 2/KVM[5438/5414]                0.007      0.000  10000.619
  427166.801665 [0004]  ktimersoftd/4[52]               10000.709   9999.745      0.002
  427166.801670 [0004]  <idle>                          10000.622      0.000      0.005
  427166.801746 [0004]  CPU 2/KVM[5438/5414]                0.007      0.000      0.076
  427166.801754 [0004]  <idle>                              0.076      0.000      0.007
  427176.801518 [0004]  CPU 2/KVM[5438/5414]                0.007      0.000   9999.764
  427176.801522 [0004]  ktimersoftd/4[52]                9999.852   9998.881      0.003
  427176.801525 [0004]  <idle>                           9999.768      0.000      0.003
  427176.801604 [0004]  CPU 2/KVM[5438/5414]                0.007      0.000      0.078
  427176.801611 [0004]  <idle>                              0.078      0.000      0.007
  427186.802079 [0004]  CPU 2/KVM[5438/5414]                0.007      0.000  10000.467
  427186.802084 [0004]  ktimersoftd/4[52]               10000.557   9999.722      0.004
  427186.802089 [0004]  <idle>                          10000.472      0.000      0.005

After some researches, there are really limited resource on the ktimersoftd. But a lot of threads are reporting the same thing that I saw. Even on same on Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1550584. It is not clear to me what RedHat will do, but I would vote +1 on that for sure. Anyone here can give updates on that one?

Comment 23 jianzzha 2019-02-27 13:20:18 UTC
from the cpu count I assume you are using host CPU1 as shared emulator for both VMs. What if you use hw:emulator_threads_policy=isolate rather than share and see if 2nd VM has impact to first. You will not have enough CPU if you still use 9 CPU for each VM, but for test purpose you can use less CPU for 2nd VM.

Comment 25 Yichen Wang 2019-02-27 19:17:27 UTC
@jianzzha
I do try that, makes no difference, until I play with CAT.

Now everything seems good now, except the above timer interruption issue in Bugzilla 1550584. Please help to see what we can do with that...

Comment 26 Luiz Capitulino 2019-02-27 20:28:09 UTC
Yichen,

Are you still seeing the 250us spike when a 2nd guest is started? I'm
asking because bug 1550584 is a relatively small spike that's always
there, it is not caused by starting other guests.

PS: Restoring NEEDINFO for asimha

Comment 29 Yichen Wang 2019-03-01 01:07:39 UTC
No. I don't see spike goes up to 250us with CAT. Further debugging leads me to 1550584.

Comment 32 Ian Wells 2019-03-01 17:38:35 UTC
To expand on that a bit:

What is the 1 second interrupt for?
How would we find that out?
How do we disable it?

We can certainly out-compete ktimersoftd on priority, but having an unnecessary timer interrupt firing is still introducing variability and it's also unclear if there's actually a scheduled task that should be running and wouldn't if ktimersoftd was prevented from running.

Comment 33 Luiz Capitulino 2019-03-01 20:55:52 UTC
(In reply to Yichen Wang from comment #29)
> No. I don't see spike goes up to 250us with CAT. Further debugging leads me
> to 1550584.

OK, so I'm setting bug 1550584 as a dependency for this BZ.

Comment 35 Luiz Capitulino 2019-03-01 21:07:29 UTC
(In reply to Ian Wells from comment #32)
> To expand on that a bit:
> 
> What is the 1 second interrupt for?
> How would we find that out?
> How do we disable it?
> 
> We can certainly out-compete ktimersoftd on priority, but having an
> unnecessary timer interrupt firing is still introducing variability and it's
> also unclear if there's actually a scheduled task that should be running and
> wouldn't if ktimersoftd was prevented from running.

Ian,

The 1 second tick is used by the Linux scheduler to keep running tasks
statistics up to date. It is not possible to disable it in the RHEL7
kernel.

Please, note that the 1 second tick and bug 1550584 are not the same. Bug 1550584
is about a spurious wakeup of the ktimersoftd thread that should be possible to
avoid. Whereas the 1 second tick is a harcoded design decision in the kernel
that's necessary for the kernel to work properly.

However, even with the 1 second tick, we've been able to keep latencies extremely
low and did not experience important drops with DPDK as long as our configuration
recommendations are fully implemented.

Comment 37 Yichen Wang 2019-03-01 21:46:17 UTC
Just want to add more information here, the ktimersoftd is waken up, and cause context switches. We are trying to avoid all possible to get a purely isolated core as much as we can, from below:
 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  ktimersoftd/3:42      |      0.000 ms |       89 | avg:    0.002 ms | max:    0.004 ms | max at:    539.583098 s
  ktimersoftd/8:92      |      0.000 ms |       89 | avg:    0.002 ms | max:    0.003 ms | max at:    539.586096 s
  ktimersoftd/4:52      |      0.000 ms |       89 | avg:    0.002 ms | max:    0.003 ms | max at:    539.586096 s
  ktimersoftd/9:102     |      0.000 ms |       89 | avg:    0.002 ms | max:    0.003 ms | max at:    539.586096 s
  ktimersoftd/7:82      |      0.000 ms |       89 | avg:    0.002 ms | max:    0.003 ms | max at:    539.586097 s
  ktimersoftd/5:62      |      0.000 ms |       89 | avg:    0.002 ms | max:    0.004 ms | max at:    539.586097 s
  ktimersoftd/10:112    |      0.000 ms |       89 | avg:    0.002 ms | max:    0.003 ms | max at:    539.586096 s
  ktimersoftd/6:72      |      0.000 ms |       89 | avg:    0.002 ms | max:    0.003 ms | max at:    539.586096 s
  CPU 2/KVM:6171        |      0.000 ms |       18 | avg:    0.001 ms | max:    0.002 ms | max at:    544.585825 s
  CPU 3/KVM:6172        |      0.000 ms |       18 | avg:    0.001 ms | max:    0.002 ms | max at:    554.585816 s
  CPU 5/KVM:6174        |      0.000 ms |       18 | avg:    0.001 ms | max:    0.002 ms | max at:    544.585842 s
  CPU 4/KVM:6173        |      0.000 ms |       18 | avg:    0.001 ms | max:    0.002 ms | max at:    584.585771 s
  CPU 8/KVM:6177        |      0.000 ms |       18 | avg:    0.001 ms | max:    0.002 ms | max at:    544.585860 s
  CPU 6/KVM:6175        |      0.000 ms |       18 | avg:    0.001 ms | max:    0.002 ms | max at:    544.585854 s
  CPU 7/KVM:6176        |      0.000 ms |       18 | avg:    0.001 ms | max:    0.002 ms | max at:    544.585848 s
 -----------------------------------------------------------------------------------------------------------------
  TOTAL:                |      0.000 ms |      838 |
 ---------------------------------------------------

You can see for about 90 seconds sample period, we are seeing 89 context switches for running ktimersoftd, and 18 context switches for running worker thread. Hence I really want to stop ktimersoftd from waking up, and see if we can get better RT stability for our VM.

Comment 38 Luiz Capitulino 2019-03-04 17:01:10 UTC
(In reply to Yichen Wang from comment #37)

> You can see for about 90 seconds sample period, we are seeing 89 context
> switches for running ktimersoftd, and 18 context switches for running worker
> thread. Hence I really want to stop ktimersoftd from waking up, and see if
> we can get better RT stability for our VM.

Yichen,

Since you mentioned on comment 29 that with CAT you don't see the 250us
spike when starting a second VM anymore, are you still getting a spike?
If yes, what is the spike and what is your latency requirement?

We've been working with a treshold of 50us for maximum latency, and we get
less than that with cyclictest for 24 hours runs even with the 1 sec tick
and the the spurious ktimersoftd wake ups (bug 1550584).

Also, we're unable to reproduce the 250us spike when starting a second
VM when using a RHEL-RT system, configured with the tuned profiles and OSP13.

Comment 39 Yichen Wang 2019-03-04 19:21:49 UTC
Hi(In reply to Luiz Capitulino from comment #38)
> (In reply to Yichen Wang from comment #37)
> 
> > You can see for about 90 seconds sample period, we are seeing 89 context
> > switches for running ktimersoftd, and 18 context switches for running worker
> > thread. Hence I really want to stop ktimersoftd from waking up, and see if
> > we can get better RT stability for our VM.
> 
> Yichen,
> 
> Since you mentioned on comment 29 that with CAT you don't see the 250us
> spike when starting a second VM anymore, are you still getting a spike?
> If yes, what is the spike and what is your latency requirement?

We have a slight different working profile for our VNFs, but anyway. With all cores dedicated and exclusive for worker and emulator threads, and all tunings in place, my spike is from 0us to 50us, which is good enough. Our latency requirement is spike less than 50us, so we should be good. We are also looking for memory bandwidth allocation as well, but we will see.

> 
> We've been working with a treshold of 50us for maximum latency, and we get
> less than that with cyclictest for 24 hours runs even with the 1 sec tick
> and the the spurious ktimersoftd wake ups (bug 1550584).

Cyclictest is good for testing out the "scheduling latency", so even with 1 sec tick, yes, cyclictest will still do good. However, when you are running test cases which simulating what DPDK doing in pure userspace, you will see for every 1 second window, you will loss some CPU cycles for doing ktimersoftd and context switchings. This is something cylictest won't show, but it will affect the jitter in a real DPDK-based application. So we have to fix it.

> 
> Also, we're unable to reproduce the 250us spike when starting a second
> VM when using a RHEL-RT system, configured with the tuned profiles and OSP13.

Yes, I confirm that, with all the tuning in place, I don't see spike goes up to 250us.

Comment 40 Luiz Capitulino 2019-03-04 21:13:54 UTC
(In reply to Yichen Wang from comment #39)

> > We've been working with a treshold of 50us for maximum latency, and we get
> > less than that with cyclictest for 24 hours runs even with the 1 sec tick
> > and the the spurious ktimersoftd wake ups (bug 1550584).
> 
> Cyclictest is good for testing out the "scheduling latency", so even with 1
> sec tick, yes, cyclictest will still do good. However, when you are running
> test cases which simulating what DPDK doing in pure userspace, you will see
> for every 1 second window, you will loss some CPU cycles for doing
> ktimersoftd and context switchings. This is something cylictest won't show,
> but it will affect the jitter in a real DPDK-based application. So we have
> to fix it.

Yichen,

You're right that cyclictest is mostly good for scheduling latency. However,
in our lab we also tested with DPDK itself and with a queueing packet simulator
we wrote. In both cases, we still got the expected throughput for an specified
packet loss.

If you think the 1 second tick and/or the ktimersoftd wakeups are hurting you, may I
suggest we move forward by doing the following:

1. Let's close this BZ, since you're not observing the original problem anymore

2. Open a new BZ for each of the issues you think is hurting your workload. For
the ktimersoftd we already have bug 1550584, so we don't need a new one. For
each BZ, please provide an explanation how it is affecting the workload. For example,
are you getting packet drops? If you're running a simulator, that's fine, but we'll
need the simulator to reproduce the problem in our lab

Once we have the BZs with all the necessary details, we can study the feasibility
of having a fix or workaround, and prioritize them accordingly.

Comment 41 Yichen Wang 2019-03-04 22:49:27 UTC
(In reply to Luiz Capitulino from comment #40)
> (In reply to Yichen Wang from comment #39)
> 
> > > We've been working with a treshold of 50us for maximum latency, and we get
> > > less than that with cyclictest for 24 hours runs even with the 1 sec tick
> > > and the the spurious ktimersoftd wake ups (bug 1550584).
> > 
> > Cyclictest is good for testing out the "scheduling latency", so even with 1
> > sec tick, yes, cyclictest will still do good. However, when you are running
> > test cases which simulating what DPDK doing in pure userspace, you will see
> > for every 1 second window, you will loss some CPU cycles for doing
> > ktimersoftd and context switchings. This is something cylictest won't show,
> > but it will affect the jitter in a real DPDK-based application. So we have
> > to fix it.
> 
> Yichen,
> 
> You're right that cyclictest is mostly good for scheduling latency. However,
> in our lab we also tested with DPDK itself and with a queueing packet
> simulator
> we wrote. In both cases, we still got the expected throughput for an
> specified
> packet loss.
> 
> If you think the 1 second tick and/or the ktimersoftd wakeups are hurting
> you, may I
> suggest we move forward by doing the following:
> 
> 1. Let's close this BZ, since you're not observing the original problem
> anymore
> 
> 2. Open a new BZ for each of the issues you think is hurting your workload.
> For
> the ktimersoftd we already have bug 1550584, so we don't need a new one. For
> each BZ, please provide an explanation how it is affecting the workload. For
> example,
> are you getting packet drops? If you're running a simulator, that's fine,
> but we'll
> need the simulator to reproduce the problem in our lab
> 
> Once we have the BZs with all the necessary details, we can study the
> feasibility
> of having a fix or workaround, and prioritize them accordingly.

I have no problem closing this BZ, and there is nothing else I need follow up other than 1550584.

Our test tools are fairly easy, just doing a tight loop, and counting how many cycles that are missing in a 1 second window. I agree with you, this interrupt is really small, and even with DPDK it might still be OK to fit in our SLA, but at this moment I cannot tell you and we need move forward with our testing and let you know.

Thanks very much for all the helps!

Comment 42 Luiz Capitulino 2019-03-05 15:52:54 UTC
Yichen,

In the beginning of KVM-RT, we also used a tight loop the same way you're using.
However, we found out that not all interruptions affect DPDK performance. For an
interruption to be a problem, it has to have a certain duration and a certain
frequency. So, trying to remove every possible interruption is counter productive:
it can take several years on upstream, and in the end DPDK performance may not
improve as expected.

So, what would be wonderful for us is if you could run a more realistic test-case
and then we can work together on debugging and classifying interruptions as blockers
and non-blockers. FWIW, this is exactly what we did for current KVM-RT.

I'm closing this as NOTABUG as per your feedback on comment 41.

Comment 43 Ian Wells 2019-03-05 18:51:27 UTC
Luiz - I see you've closed the bug and we will open another, but just to add context: the test you describe is the wrong one for the vRAN circumstance, which has specific hard real-time requirements and not simply a general purpose one to drop no packets.

Comment 44 Luiz Capitulino 2019-03-05 19:09:58 UTC
Ian,

Sorry if I was too generic. With DPDK, we did three kinds of testing: maximum throughput with zero-loss,
maximum throughput with 1% loss, and packet round-trip latency (I guess with zero-loss). Those were
long ago. But more recently, we tested with FlexRAN. They were all a pass.


Note You need to log in before you can comment on or make changes to this bug.