Bug 1447938

Summary:	Boot single VM with 8 rt vCPUs, the latency value is high
Product:	Red Hat Enterprise Linux 7	Reporter:	Pei Zhang <pezhang>
Component:	tuned	Assignee:	Jaroslav Škarvada <jskarvad>
Status:	CLOSED ERRATA	QA Contact:	Tereza Cerna <tcerna>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	atragler, bhu, chayang, hhuang, jeder, jmario, jskarvad, juzhang, knoel, lcapitulino, michen, mtessun, pezhang, sgordon, tcerna, thozza, virt-maint, williams
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	tuned-2.8.0-4.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-01 12:35:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1240765

Description Pei Zhang 2017-05-04 09:28:13 UTC

Description of problem:
Boot single VM with 8 vCPUs, one of the max latency value is 00947


Version-Release number of selected component (if applicable):
3.10.0-663.rt56.581.el7.x86_64
libvirt-3.2.0-3.el7.x86_64
qemu-kvm-rhev-2.9.0-2.el7.x86_64
tuned-2.8.0-2.el7.noarch


How reproducible:
Running 1 time.


Steps to Reproduce:
1. Boot single VM with 8 rt vCPUs
2. Do latency testing


Actual results:

Test started at Wed May  3 13:40:38 CST 2017

Test duration:    12h
Run rteval:       y
Isolated CPUs:    1,2,3,4,5,6,7,8
Kernel:           3.10.0-663.rt56.581.el7.x86_64
Kernel cmd-line:  BOOT_IMAGE=/vmlinuz-3.10.0-663.rt56.581.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--122-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-122/root rd.lvm.lv=rhel_bootp-73-75-122/swap rhgb quiet default_hugepagesz=1G iommu=pt intel_iommu=on isolcpus=1,2,3,4,5,6,7,8 nohz=on nohz_full=1,2,3,4,5,6,7,8 rcu_nocbs=1,2,3,4,5,6,7,8 intel_pstate=disable nosoftlockup
Results dir:      /home/log_nfv-virt-rt-kvm
running stress
   taskset -c 1 /home/nfv-virt-rt-kvm/tests/stress --cpu 1
   taskset -c 2 /home/nfv-virt-rt-kvm/tests/stress --cpu 1
   taskset -c 3 /home/nfv-virt-rt-kvm/tests/stress --cpu 1
   taskset -c 4 /home/nfv-virt-rt-kvm/tests/stress --cpu 1
   taskset -c 5 /home/nfv-virt-rt-kvm/tests/stress --cpu 1
   taskset -c 6 /home/nfv-virt-rt-kvm/tests/stress --cpu 1
   taskset -c 7 /home/nfv-virt-rt-kvm/tests/stress --cpu 1
   taskset -c 8 /home/nfv-virt-rt-kvm/tests/stress --cpu 1
running rteval
   rteval --onlyload --duration=12h --verbose
starting Wed May  3 13:40:43 CST 2017
   taskset -c 1,2,3,4,5,6,7,8 cyclictest -m -n -q -p95 -D 12h -h60 -i 200 -t 8 -a 1,2,3,4,5,6,7,8
ended Thu May  4 01:40:43 CST 2017

Test ended at Thu May  4 01:40:43 CST 2017

# Min Latencies: 00003 00005 00005 00005 00005 00005 00005 00005
# Avg Latencies: 00006 00005 00005 00005 00005 00006 00006 00006
# Max Latencies: 00947 00024 00023 00047 00024 00023 00024 00024

Expected results:
The latency value should be < 20 us


Additional info:
1. Booting single VM with single rt vCPU, works well.
2. Booting 4 VMs with single rt vCPU, works well.
3. This should be regression, with 3.10.0-653.rt56.571.el7.x86_64 works well.


Reference
[1]XML doc
 <domain type='kvm'>
  <name>rhel7.4_rt_8vcpu</name>
  <uuid>6944fe08-2fbf-11e7-8d08-1866da5ff2e9</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB' nodeset='0'/>
    </hugepages>
    <locked/>
  </memoryBacking>
  <vcpu placement='static'>10</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='19'/>
    <vcpupin vcpu='1' cpuset='2'/>
    <vcpupin vcpu='2' cpuset='4'/>
    <vcpupin vcpu='3' cpuset='6'/>
    <vcpupin vcpu='4' cpuset='8'/>
    <vcpupin vcpu='5' cpuset='10'/>
    <vcpupin vcpu='6' cpuset='12'/>
    <vcpupin vcpu='7' cpuset='14'/>
    <vcpupin vcpu='8' cpuset='16'/>
    <vcpupin vcpu='9' cpuset='17'/>
    <emulatorpin cpuset='3,5,7'/>
    <vcpusched vcpus='0-9' scheduler='fifo' priority='1'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
  <os>
    <type arch='x86_64' machine='pc'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pmu state='off'/>
    <vmport state='off'/>
  </features>
  <cpu mode='host-passthrough'>
    <feature policy='require' name='tsc-deadline'/>
  </cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none' io='threads'/>
      <source file='/home/images_nfv-virt-rt-kvm/rhel7.4_rt_8vcpu.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='none'/>
    <controller type='pci' index='0' model='pci-root'/>
    <interface type='bridge'>
      <mac address='28:66:da:5f:dd:02'/>
      <source bridge='switch'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </memballoon>
  </devices>
</domain>

[2] Host command line
# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-663.rt56.581.el7.x86_64 root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on LANG=en_US.UTF-8 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup

Comment 3 Luiz Capitulino 2017-05-04 13:02:52 UTC

Close to 1ms!

I have some questions:

- You seem to have lots of pCPUs, are you sure you don't have Hyper Threading enabled in the BIOS?

- Did the same test-case (same number of vCPUs, same duration etc) work before kernel -581 without a spike?

- Does this need 12 hours to reproduce?

- Have you tried to reproduce with less than 8 RT vCPUs?

I'll try to reproduce this today.

Comment 6 Luiz Capitulino 2017-05-05 00:02:24 UTC

While trying to reproduce this issue I was able to get max=30us in one of the vCPUs. When debugging it I found that rteval was pinning load processes to isolated CPUs (this is bug 1448201).

It's hard to tell whether or not that rteval issue has anything to do with this spike. We'll have to wait for a fixed rteval and try to reproduce again.

It's also a bit unexpected that even with that rteval issue cyclictest gives good results most of the time.

Comment 7 Pei Zhang 2017-05-05 09:33:32 UTC

(In reply to Luiz Capitulino from comment #6)
> While trying to reproduce this issue I was able to get max=30us in one of
> the vCPUs. When debugging it I found that rteval was pinning load processes
> to isolated CPUs (this is bug 1448201).

Luiz, seems you have reproduced and have found the reason.

I'm not sure if my reproduce is still needed to confirm that this issue is reproducible or is occasional. 

Today I tried 1 hour, 2 hours and 3 hours, using 8 rt vCPUs, and latency value looks good. Now I'm testing 12 hours again. If the reproduce is not needed anymore, please let me know.


Thanks,
Pei

> It's hard to tell whether or not that rteval issue has anything to do with
> this spike. We'll have to wait for a fixed rteval and try to reproduce again.
> 
> It's also a bit unexpected that even with that rteval issue cyclictest gives
> good results most of the time.

Comment 9 Luiz Capitulino 2017-05-05 13:15:50 UTC

Here's my 12h result without rteval (so without any load whatsoever):

# Min Latencies: 00005 00005 00005 00005 00005 00005 00005 00005
# Avg Latencies: 00005 00005 00005 00005 00005 00005 00005 00005
# Max Latencies: 00011 00010 00010 00010 00010 00010 00010 00010

Yesterday, I was able to get one of the vCPUs close to max=30 when running for 1 hour w/ rteval. Now, we can't tell if it's because of rteval pinning bug or not.

Comment 10 Luiz Capitulino 2017-05-05 19:46:28 UTC

I did another where I run a kernel build on the housekeeping CPUs instead of running rteval. After a few tries, I got this (1 hour run):

# Min Latencies: 00003 00005 00005 00005 00005 00005 00005 00005
# Avg Latencies: 00005 00005 00005 00005 00005 00005 00005 00005
# Max Latencies: 00010 00010 00010 00021 00034 00010 00028 00010

So, there's a problem besides rteval. I'll debug it.

Comment 12 Luiz Capitulino 2017-05-10 13:38:47 UTC

I think I might have found the problem. In my debugging I was seeing two things:

1. Increasing the CPU count makes the problem worse

2. When tracing, I saw this:

guest-trace.dat:           stress-8443  [002]  3369.175510046: (+542)   function:             tick_sched_handle.isra.18 <-- tick_sched_timer
guest-trace.dat:           stress-8447  [008]  3369.175510050: (+4)     function:             tick_sched_handle.isra.18 <-- tick_sched_timer
guest-trace.dat:           stress-8448  [003]  3369.175510051: (+1)     function:             tick_sched_handle.isra.18 <-- tick_sched_timer
guest-trace.dat:           stress-8445  [009]  3369.175510052: (+1)     function:             tick_sched_handle.isra.18 <-- tick_sched_timer
guest-trace.dat:           stress-8449  [006]  3369.175510053: (+1)     function:             tick_sched_handle.isra.18 <-- tick_sched_timer
guest-trace.dat:           stress-8444  [007]  3369.175510053: (+0)     function:             tick_sched_handle.isra.18 <-- tick_sched_timer
guest-trace.dat:           stress-8446  [004]  3369.175510055: (+2)     function:             tick_sched_handle.isra.18 <-- tick_sched_timer
guest-trace.dat:           stress-8442  [005]  3369.175510057: (+2)     function:             tick_sched_handle.isra.18 <-- tick_sched_timer

In this trace we can see that tick_sched_handle() is being called almost at the same time in all CPUs.

The two points above seem to indicate that we may be running into lock contention due to the fact that the tick is ticking at the same time on all CPUs. This is a known problem and is avoided with skew_tick=1.

Indeed, I can't reproduce the problem when I boot host and guest with skew_tick=1. My reproducer consists of running the 1h test-case a few times. I'd always get the results from comment 10 in the second or third run. However, with skew_tick=1 I'm not able to reproduce this for a whole day. Also, running the test-case for 15 hours I got:

# Min Latencies: 00003 00005 00005 00005 00005 00005 00005 00005
# Avg Latencies: 00005 00005 00005 00005 00005 00005 00005 00005
# Max Latencies: 00010 00010 00010 00010 00010 00010 00010 00010

I never got this good result before when running with 6 RT vCPUs, I'd always get one CPU doing max=21us.

Now, when running for one hour I did get max=17us only once. While this is higher than the other CPUs, we are still below max=20 (our treshold).

Comment 13 Luiz Capitulino 2017-05-12 13:24:22 UTC

I did some additional tests with skew_tick=1 and it really seems to fix the issue for me. So, I'm moving this to tuned as this can be addressed in the RT profiles. I'm also clearing the Regression keyword, as this issue has always existed.

Jaroslav, I'm going to post a patch for this issue. I'm waiting input from the RT team in order to know if I add this to KVM-RT profiles only or to the main RT profile too.

Comment 14 Jeremy Eder 2017-05-12 13:33:01 UTC

Does it need to be conditionalized on the number of cpus or sockets or does it belong everywhere?

What about non-RT kernel scenarios?

Comment 15 Luiz Capitulino 2017-05-12 13:45:27 UTC

(In reply to Jeremy Eder from comment #14)
> Does it need to be conditionalized on the number of cpus or sockets or does
> it belong everywhere?

I _guess_ it won't hurt small boxes, but as Clark had the same question, I'll look more into this.

> What about non-RT kernel scenarios?

I think anything that's very sensible to jitter may benefit. One good example is zero-loss (I'll open a separate BZ for it). But the kernel docs mention that a drawback is higher power consupmtion (which doesn't matter much for KVM-RT and zero-loss).

Comment 16 Jeremy Eder 2017-05-12 13:58:07 UTC

Propose adding this to the network-latency profile then, which currently has the following inheritance:

- cpu-partitioning
- realtime
- and itself of course

If we need it to only be set when cores/sockets are greater than X, then perhaps we should add that detection capability to tuned infrastructure to be leveraged by the profiles?  Contingent on Luis investigation.

We will also need to clearly document this feature as a comment in the network-latency tuned profile, because I am sure we will get questions.

- What exactly does it do, in plain english
- How does it relate to nohz_full
- When should it be set and when not (if I use nohz_full should I avoid skew_tick or similar).

Does this sound reasonable?

Comment 17 Luiz Capitulino 2017-05-12 14:11:07 UTC

Yes, it does!

Just a note that the profiles are not yet making use of kernel command-line inheritance, so I'll have to change that first and this makes me nervous as a small mistake can cause a big breakage.

On the other hand, adding the same option to all those profiles is not an option.

Comment 18 Luiz Capitulino 2017-05-12 19:24:13 UTC

I did some additional research on this: read more code, understood the history behind skew_tick and talked a bit with skew_tick author Mike Galbraith.

Two conclusions:

1. skew_tick=1 is a MUST HAVE for RT and KVM-RT

2. The benefits on non-RT are not entirely clear to me, but according to Mike they do exist

Here's the main problem: by default all CPUs will tick at the exactly same time. In the RT kernel, jiffies_lock is a raw_spinlock. All CPUs usually read jiffies from the tick handler, so they will be stuck waiting for each other. skew_tick=1 solves this because it causes each CPU to tick a different times, avoiding reading jiffies at the same time.

The only theoretical problem with skew_tick is that it may cause power consumption to increase, but this is disputed.

On upstream and on the RHEL kernel jiffies_lock is not a spinlock it's a seqlock. So I'd expect the situation to be a lot less worse than RT, but still some contention is possible.

Also, it seems there's two additional sources of contention when all CPUs tick at the same time:

o RCU. See upstream commit 02482880092, which advises for skew_tick=1 due to RCU lock contention. Not sure whether this affects RHEL7 though

o hrtimer's cpu_base lock is a raw spinlock, so I guess this could run into contention too?

Now, I'm still trying to decide if I add skew_tick=1:

1. Only to RT and KVM-RT profiles for now; or
2. Add it to the network-latency profile and re-work all the profiles including it to make use of kernel command-line inheritance

Item 2 is the Right Thing, but I'm afraid it's a intrusive change as I'll have to change several profiles and I don't know how much kernel command-line inheritance has been used so far as it's a new feature.

Opnions?

Comment 19 Luiz Capitulino 2017-05-15 16:05:27 UTC

I've decided to add skew_tick=1 to the RT profiles first. I'll open a new BZ to discuss its inclusion in non-RT profiles.

Comment 20 Luiz Capitulino 2017-05-15 17:13:04 UTC

Here's the BZ for adding skew_tick=1 in non-RT profiles:

Bug 1451073 - tuned: use skew_tick=1 in all low-latency profiles

Comment 28 Tereza Cerna 2017-06-06 10:42:58 UTC

@pezhang Thank you you will test it. This helps me a log.

I'm adding qa_ack for this bug.

Comment 32 Tereza Cerna 2017-06-14 11:14:43 UTC

Hi Pei,

I see everything is ready for testing. Can you do some functional test?

Regards,
Tereza

Comment 33 Pei Zhang 2017-06-16 04:31:08 UTC

(In reply to Tereza Cerna from comment #32)
> Hi Pei,
> 
> I see everything is ready for testing. Can you do some functional test?

I'm doing latency testing now, it needs to run 12 hours, we can get the results by tomorrow. 

Actually I tested with tuned-2.8.0-5.el7.noarch and latest kernel-rt yesterday, however after running 10 hours later the guest become hang, hit existed bug[1]:

[1]Bug 1448770 - INFO: task xfsaild/dm-2:1175 blocked for more than 600 seconds

Now I'm trying old kernel-rt version (3.10.0-663.rt56.581.el7.x86_64) and tuned-2.8.0-5.el7.noarch.


I need to confirm: what do you mean 'functional test'?  If you mean functional of tuned itself, please see Comment 31. 


Regards,
Pei

> Regards,
> Tereza

Comment 34 Tereza Cerna 2017-06-16 14:21:20 UTC

Hi Pei,

I need you to test your component with new version of tuned. If it works well. And if it is working well together.

Thank you so much for your effort.

Comment 35 Pei Zhang 2017-06-17 12:32:20 UTC

Update:

Latency testing results looks good with tuned-2.8.0-5.el7.noarch.

Versions:
3.10.0-663.rt56.581.el7.x86_64
tuned-2.8.0-5.el7.noarch
libvirt-3.2.0-10.el7.x86_64
qemu-kvm-rhev-2.9.0-10.el7.x86_64

Latency results after running 12 hours:
# Min Latencies: 00004 00006 00006 00006 00005 00006 00006 00006
# Avg Latencies: 00006 00006 00006 00006 00006 00006 00006 00006
# Max Latencies: 00012 00011 00011 00011 00011 00011 00011 00011

Comment 36 Tereza Cerna 2017-06-19 07:10:20 UTC

Hi Pei. Thank you so much for your testing.

Comment 37 Tereza Cerna 2017-06-19 07:13:12 UTC

According to testing from #35, the bug is switched to VERIFIED.
Everything looks well, patch was applied.

Comment 38 errata-xmlrpc 2017-08-01 12:35:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2102

Comment 39 Fedora Update System 2017-10-13 14:20:19 UTC

tuned-2.9.0-0.1.rc1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-d9c6b990df

Comment 40 Fedora Update System 2017-10-29 21:06:37 UTC

tuned-2.9.0-1.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2017-0e45ce4685

Comment 41 Fedora Update System 2017-10-29 21:12:42 UTC

tuned-2.9.0-1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-c30e9bd1ea