Bug 1665995

Summary:	Huge latency spike for VM(s) with 1 rt vCPU
Product:	Red Hat Enterprise Linux 8	Reporter:	Pei Zhang <pezhang>
Component:	kernel-rt	Assignee:	Juri Lelli <jlelli>
kernel-rt sub component:	Other	QA Contact:	Pei Zhang <pezhang>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	bhu, chayang, hhuang, juzhang, lcapitulino, lgoncalv, mtosatti, pezhang, virt-maint
Version:	8.1	Keywords:	Regression
Target Milestone:	rc	Flags:	bhu: mirror+
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-03 15:56:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1666562, 1666680, 1669294
Bug Blocks:	1640832, 1649535, 1701002

Description Pei Zhang 2019-01-14 15:58:51 UTC

Description of problem:
Boot VM with 1 rt vCPU, the max latency exceeds 40us.

Version-Release number of selected component (if applicable):
4.18.0-60.rt9.112.el8.x86_64
qemu-kvm-3.1.0-4.module+el8+2676+33bd6e2b.x86_64

How reproducible:
3/3

Steps to Reproduce:
1. Install a RHEL8 host

2. Setup RT host

3. Install a RHEL8 guest

4. Setup RT guest

5. Start kernel compiling in host with $twice_of_host_housekeeping_cores

6. Start kernel compiling in guest with $twice_of_guest_housekeeping_cores

7. Start cyclictest
# taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 -b40 --notrace

Step 5,6,7 are executing at same time.

The max latency of cyclictest exceeds 40us after several minutes. See below:

==Latency testing results==
Test started at:     2019-01-14 04:40:38 Monday
Kernel cmdline:      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-60.rt9.112.el8.x86_64 root=/dev/mapper/rhel_vm--74--21-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_vm--74--21-swap rd.lvm.lv=rhel_vm-74-21/root rd.lvm.lv=rhel_vm-74-21/swap skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 default_hugepagesz=1G iommu=pt intel_iommu=on
X86 debug pts:       pti_enable= ibpb_enabled= ibrs_enabled= retp_enabled=
Machine:             vm-74-254.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-01-14 05:01:58 Monday
cyclictest cmdline:  taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 -b40 --notrace
cyclictest results: 

# Min Latencies: 00007
# Avg Latencies: 00016
# Max Latencies: 38483


Actual results:
The max latency > 40us.


Expected results:
The max latency should be no more than 40us.


Additional info:
1. Single VM with 8 rt vCPUs, 24 hours latency looks good.

2. Multiple VMs each with 1 rt vCPU, also hit same high spike issue. See below:
==VM1==
Test started at:     2019-01-14 05:11:40 Monday
Kernel cmdline:      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-60.rt9.112.el8.x86_64 root=/dev/mapper/rhel_bootp--73--75--148-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_bootp--73--75--148-swap rd.lvm.lv=rhel_bootp-73-75-148/root rd.lvm.lv=rhel_bootp-73-75-148/swap skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 default_hugepagesz=1G iommu=pt intel_iommu=on
X86 debug pts:       pti_enable= ibpb_enabled= ibrs_enabled= retp_enabled=
Machine:             vm-74-204.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-01-14 05:25:04 Monday
cyclictest cmdline:  taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 -b40 --notrace
cyclictest results: 

# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00814


==VM2==
Test started at:     2019-01-14 05:11:40 Monday
Kernel cmdline:      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-60.rt9.112.el8.x86_64 root=/dev/mapper/rhel_vm--74--199-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_vm--74--199-swap rd.lvm.lv=rhel_vm-74-199/root rd.lvm.lv=rhel_vm-74-199/swap skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 default_hugepagesz=1G iommu=pt intel_iommu=on
X86 debug pts:       pti_enable= ibpb_enabled= ibrs_enabled= retp_enabled=
Machine:             vm-74-216.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-01-14 05:25:04 Monday
cyclictest cmdline:  taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 -b40 --notrace
cyclictest results: 

# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00284


==VM3==
Test started at:     2019-01-14 05:11:40 Monday
Kernel cmdline:      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-60.rt9.112.el8.x86_64 root=/dev/mapper/rhel_bootp--73--75--32-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_bootp--73--75--32-swap rd.lvm.lv=rhel_bootp-73-75-32/root rd.lvm.lv=rhel_bootp-73-75-32/swap skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 default_hugepagesz=1G iommu=pt intel_iommu=on
X86 debug pts:       pti_enable= ibpb_enabled= ibrs_enabled= retp_enabled=
Machine:             vm-74-215.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-01-14 05:25:04 Monday
cyclictest cmdline:  taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 -b40 --notrace
cyclictest results: 

# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00985


==VM4==
Test started at:     2019-01-14 05:11:40 Monday
Kernel cmdline:      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-60.rt9.112.el8.x86_64 root=/dev/mapper/rhel_bootp--73--75--106-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_bootp--73--75--106-swap rd.lvm.lv=rhel_bootp-73-75-106/root rd.lvm.lv=rhel_bootp-73-75-106/swap skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 default_hugepagesz=1G iommu=pt intel_iommu=on
X86 debug pts:       pti_enable= ibpb_enabled= ibrs_enabled= retp_enabled=
Machine:             vm-74-196.lab.eng.pek2.redhat.com
CPU:                 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Test duration(plan): 24h
Test ended at:       2019-01-14 05:25:04 Monday
cyclictest cmdline:  taskset -c 1 /home/nfv-virt-rt-kvm/tools/cyclictest -m -n -q -p95 -D 24h -h60 -t 1 -a 1 -b40 --notrace
cyclictest results: 

# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 31783

Comment 1 Pei Zhang 2019-01-15 10:25:35 UTC

I checked the difference between rhel7 and rhel8 hosts today:

In rhel7.7 host:
# cat /usr/lib/tuned/realtime-virtual-host/lapic_timer_adv_ns
2000

# cat /sys/module/kvm/parameters/lapic_timer_advance_ns
2000


In rhel8.0 host:
# cat /usr/lib/tuned/realtime-virtual-host/lapic_timer_adv_ns
cat: /usr/lib/tuned/realtime-virtual-host/lapic_timer_adv_ns: No such file or directory

# cat /sys/module/kvm/parameters/lapic_timer_advance_ns
0


Luiz, could you please have a look if this is a possible reason of spike? Thanks.


Best regards,
Pei

Comment 2 Luiz Capitulino 2019-01-15 17:52:12 UTC

Pei,

Yes, I'll look into this later today or tomorrow. I'll have to use your machine
(but I already have the login details).

Comment 3 Luiz Capitulino 2019-01-15 21:39:09 UTC

I took a very quick look and it seems we have two problems:

1. Sometime after booting, the kernel decides that the TSC on CPU19 is unstable
   and changes the clocksource to hpet. I didn't know the kernel could mark
   the TSC of a single CPU as unstable, but in any case KVM-RT requires a
   stable TSC (and you're using CPU19 on your KVM-RT guest)

[ 2611.752700] clocksource: timekeeping watchdog on CPU19: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 2611.765227] clocksource:                       'hpet' wd_now: b5059197 wd_last: bbcfe69d mask: ffffffff
[ 2611.775703] clocksource:                       'tsc' cs_now: 5d733215270 cs_last: 218c2d31186 mask: ffffffffffffffff
[ 2611.787464] tsc: Marking TSC unstable due to clocksource watchdog
[ 2611.787472] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 2611.787475] sched_clock: Marking unstable (2611880509082, -95077647)<-(2611847545715, -60076864)
[ 2611.822971] clocksource: Switched to clocksource hpet

2. tuned is failing when applying the realtime-virtual-host profile:

2019-01-14 09:01:12,100 ERROR    tuned.plugins.plugin_script: script '/usr/lib/tuned/realtime-virtual-host/script.sh' error output: 'Failed to start rt-entsk.service: Unit rt-entsk.service not found.'
2019-01-14 09:01:12,156 ERROR    tuned.units.manager: BUG: Unhandled exception in start_tuning: dictionary changed size during iteration

Although the TSC issue seems quite serious and impactful for KVM-RT, I can't
tell with 100% certainty that one or both issues are the root cause of the
spike. We'll have to fix them to know it.

I'll open a different BZ for issue 2 and will follow up with instructions
for issue 1.

Comment 4 Luiz Capitulino 2019-01-15 21:46:01 UTC

Pei,

For issue 1, I think you said you were using this machine with a
RHEL7 kernel and you didn't have any issues, right? Do you remember
if you were using CPU19 (ie. if a qemu thread was pinned to that CPU).

Maybe Marcelo will have better recommendations, but I think I'd first
try to determine if we need a guest to cause this issue. One way of
doing this is to just reboot the system and wait 1 hour or 2. Then
check if the error is in dmesg. If it is there, then we don't need a
guest and you could try the same procedure with the non-RT kernel to
see if this problem is specific to RT or not.

Comment 5 Pei Zhang 2019-01-16 09:21:09 UTC

(In reply to Luiz Capitulino from comment #3)

> 
> 2. tuned is failing when applying the realtime-virtual-host profile:
> 
> 2019-01-14 09:01:12,100 ERROR    tuned.plugins.plugin_script: script
> '/usr/lib/tuned/realtime-virtual-host/script.sh' error output: 'Failed to
> start rt-entsk.service: Unit rt-entsk.service not found.'
> 2019-01-14 09:01:12,156 ERROR    tuned.units.manager: BUG: Unhandled
> exception in start_tuning: dictionary changed size during iteration
> 
> Although the TSC issue seems quite serious and impactful for KVM-RT, I can't
> tell with 100% certainty that one or both issues are the root cause of the
> spike. We'll have to fix them to know it.
> 
> I'll open a different BZ for issue 2 and will follow up with instructions
> for issue 1.

Hi Luiz, 

I've filed bug 1666562 to track it.

(In reply to Luiz Capitulino from comment #4)
> Pei,
> 
> For issue 1, I think you said you were using this machine with a
> RHEL7 kernel and you didn't have any issues, right? Do you remember
> if you were using CPU19 (ie. if a qemu thread was pinned to that CPU).

Yes, the RHEL7.7 works well.

Yes, I was using CPU19 as guest housekeeping cpu in scenario "Single VM with 1 rt vCPU". We test RHEL7.7 and RHEL8 with same configurations, including cores isolation, cpu pin.

I checked history testing of RHEL8. (As the spike looks very big, so I ran several runs to double confirm before reporting bug).

==For scenario "Single VM with 1 rt vCPU"==

Server: dell-per430-09.lab.eng.pek2.redhat.com 

TSC related log: "clocksource: timekeeping watchdog on CPU19..."

Running: 3 times, all 3 runs hit spike and showed above TSC log.

CPU pin info:
  <cputune>
    <vcpupin vcpu='0' cpuset='19'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <emulatorpin cpuset='1,3,5,7,9'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune>


==For scenario "Multiple VMs each with 1 rt vCPU"==

Server: dell-per430-11.lab.eng.pek2.redhat.com 

TSC related log: 
"clocksource: timekeeping watchdog on CPU13:.."
"clocksource: timekeeping watchdog on CPU15:.."
"clocksource: timekeeping watchdog on CPU19:.."

Running: 4 times, all 4 runs hit spike and showed one of above TSC logs.

CPU pin info:

-VM1
  <cputune>
    <vcpupin vcpu='0' cpuset='19'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <emulatorpin cpuset='1,3,5,7,9'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune>

-VM2
  <cputune>
    <vcpupin vcpu='0' cpuset='17'/>
    <vcpupin vcpu='1' cpuset='16'/>
    <emulatorpin cpuset='1,3,5,7,9'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune>

-VM3
  <cputune>
    <vcpupin vcpu='0' cpuset='15'/>
    <vcpupin vcpu='1' cpuset='14'/>
    <emulatorpin cpuset='1,3,5,7,9'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune>

-VM4
  <cputune>
    <vcpupin vcpu='0' cpuset='13'/>
    <vcpupin vcpu='1' cpuset='12'/>
    <emulatorpin cpuset='1,3,5,7,9'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune>


For both scenarios:
Host cores isolation:
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-60.rt9.112.el8.x86_64 root=/dev/mapper/rhel_dell--per430--11-root ro crashkernel=auto resume=/dev/mapper/rhel_dell--per430--11-swap rd.lvm.lv=rhel_dell-per430-11/root rd.lvm.lv=rhel_dell-per430-11/swap console=ttyS0,115200n81 skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 default_hugepagesz=1G iommu=pt intel_iommu=on

Guest cores isolation:
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-60.rt9.112.el8.x86_64 root=/dev/mapper/rhel_bootp--73--75--106-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_bootp--73--75--106-swap rd.lvm.lv=rhel_bootp-73-75-106/root rd.lvm.lv=rhel_bootp-73-75-106/swap skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 default_hugepagesz=1G iommu=pt intel_iommu=on

> 
> Maybe Marcelo will have better recommendations, but I think I'd first
> try to determine if we need a guest to cause this issue. One way of
> doing this is to just reboot the system and wait 1 hour or 2. Then
> check if the error is in dmesg. If it is there, then we don't need a
> guest and you could try the same procedure with the non-RT kernel to
> see if this problem is specific to RT or not.

I'll update the results after rebooting host and waiting 2 hours later.

Thank you.


Best regards,
Pei

Comment 6 Pei Zhang 2019-01-16 10:54:06 UTC

(In reply to Pei Zhang from comment #5)
[...]
> > Maybe Marcelo will have better recommendations, but I think I'd first
> > try to determine if we need a guest to cause this issue. One way of
> > doing this is to just reboot the system and wait 1 hour or 2. Then
> > check if the error is in dmesg. If it is there, then we don't need a
> > guest and you could try the same procedure with the non-RT kernel to
> > see if this problem is specific to RT or not.
> 
> I'll update the results after rebooting host and waiting 2 hours later.

Waiting 2 hours after rebooting rt host, there is no TSC related log in rt host.


Best regards,
Pei

Comment 7 Luiz Capitulino 2019-01-16 16:56:12 UTC

Thanks for this info. There are two other things you could do:

- Since this appears to be a regression, you could go back some kernels
  to see if the issue goes away. If it does, then that's excellent news

- Try the following procedure to see if it reproduces without KVM:

1. Reboot the system

2. Write a hog app, say hog.c:

int main(void) { for (;;); return 0 };

3. Build it and dun it with fifo:1 pinned to CPU19:

# chrt -f 1 taskset -c 19 hog

4. Wait a few hours to see if it will trig the issue

If none of the above reveals any new information, then KVM is probably
involved in this issue and we'll have to debug it. I'll try to reproduce it too.

Comment 8 Pei Zhang 2019-01-17 11:38:01 UTC

(In reply to Luiz Capitulino from comment #7)
> Thanks for this info. There are two other things you could do:
> 
> - Since this appears to be a regression, you could go back some kernels
>   to see if the issue goes away. If it does, then that's excellent news
> 
> - Try the following procedure to see if it reproduces without KVM:
> 
> 1. Reboot the system
> 
> 2. Write a hog app, say hog.c:
> 
> int main(void) { for (;;); return 0 };
> 
> 3. Build it and dun it with fifo:1 pinned to CPU19:
> 
> # chrt -f 1 taskset -c 19 hog
> 
> 4. Wait a few hours to see if it will trig the issue
> 
> If none of the above reveals any new information, then KVM is probably
> involved in this issue and we'll have to debug it. I'll try to reproduce it
> too.

OK. I'm doing these 2 things. Will update the results once finish.


Best regards,
Pei

Comment 9 Luiz Capitulino 2019-01-17 13:37:28 UTC

There are two things I forgot to tell you.

The first one is that our first step for this BZ should be to make
it reproducible. That is, boot the machine, run the single guest
test-case and monitor dmesg to see how long we need until the
kernel marks the TSC as unstable. Then we do what I described in
comment 7, since we'll know how long it takes to reproduce.

The other thing is, I think you could try with the non-RT kernel
too as a last resort (ie. after trying the 2 things from
comment 7). The latencies may not be good with the non-RT kernel,
but we just want to know if the TSC will become unstable too.

Comment 10 Luiz Capitulino 2019-01-17 19:48:08 UTC

Pei,

I think I may have found the root cause of this issue. Please, let's
cancel the debugging from comment 7 and comment 9. I'm working on
a kernel that I plan to post here soon for you to test.

Comment 11 Pei Zhang 2019-01-18 02:45:06 UTC

(In reply to Luiz Capitulino from comment #10)
> Pei,
> 
> I think I may have found the root cause of this issue. Please, let's
> cancel the debugging from comment 7 and comment 9. I'm working on
> a kernel that I plan to post here soon for you to test.

Got it.

Best regards,
Pei

Comment 12 Marcelo Tosatti 2019-01-18 14:30:31 UTC

Hello!

(In reply to Luiz Capitulino from comment #3)
> I took a very quick look and it seems we have two problems:
> 
> 1. Sometime after booting, the kernel decides that the TSC on CPU19 is
> unstable
>    and changes the clocksource to hpet. I didn't know the kernel could mark
>    the TSC of a single CPU as unstable, but in any case KVM-RT requires a
>    stable TSC (and you're using CPU19 on your KVM-RT guest)
> 
> [ 2611.752700] clocksource: timekeeping watchdog on CPU19: Marking
> clocksource 'tsc' as unstable because the skew is too large:
> [ 2611.765227] clocksource:                       'hpet' wd_now: b5059197
> wd_last: bbcfe69d mask: ffffffff
> [ 2611.775703] clocksource:                       'tsc' cs_now: 5d733215270
> cs_last: 218c2d31186 mask: ffffffffffffffff
> [ 2611.787464] tsc: Marking TSC unstable due to clocksource watchdog
> [ 2611.787472] TSC found unstable after boot, most likely due to broken
> BIOS. Use 'tsc=unstable'.
> [ 2611.787475] sched_clock: Marking unstable (2611880509082,
> -95077647)<-(2611847545715, -60076864)
> [ 2611.822971] clocksource: Switched to clocksource hpet

Yes this can slowdown (although not that significantly)

So the  

And BTW, such machines (with unstable TSC) should not be supported by
RHEL.
 
Separate bug, though.

> 
> 2. tuned is failing when applying the realtime-virtual-host profile:
> 
> 2019-01-14 09:01:12,100 ERROR    tuned.plugins.plugin_script: script
> '/usr/lib/tuned/realtime-virtual-host/script.sh' error output: 'Failed to
> start rt-entsk.service: Unit rt-entsk.service not found.'

Make sure rt-entsk binary is running to workaround the chronyd initiated
problem.

> 2019-01-14 09:01:12,156 ERROR    tuned.units.manager: BUG: Unhandled
> exception in start_tuning: dictionary changed size during iteration
> 
> Although the TSC issue seems quite serious and impactful for KVM-RT, I can't
> tell with 100% certainty that one or both issues are the root cause of the
> spike. We'll have to fix them to know it.
> 
> I'll open a different BZ for issue 2 and will follow up with instructions
> for issue 1.

Comment 13 Luiz Capitulino 2019-01-18 22:10:19 UTC

(In reply to Marcelo Tosatti from comment #12)

> And BTW, such machines (with unstable TSC) should not be supported by
> RHEL.

It is actually a false positive, I'll follow up with an explanation.

> > 2. tuned is failing when applying the realtime-virtual-host profile:
> > 
> > 2019-01-14 09:01:12,100 ERROR    tuned.plugins.plugin_script: script
> > '/usr/lib/tuned/realtime-virtual-host/script.sh' error output: 'Failed to
> > start rt-entsk.service: Unit rt-entsk.service not found.'
> 
> Make sure rt-entsk binary is running to workaround the chronyd initiated
> problem.

As reported in bug 1666562, this binary doesn't seem to exist in RHEL8.

Comment 14 Luiz Capitulino 2019-01-18 22:12:18 UTC

I was able to reproduce this issue and I believe I have debugged it down
to the root cause. What follows is my hypothesis. However, I'm still
running a long duration test to confirm my findinds and would like Pei
to do the same before we commit to moving forward.

The kernel clocksource code has a feature called the clocksource
watchdog. The goal of the clocksource watchdog is to detect unstable
TSCs at run-time. It does this by scheduling a timer on each CPU
every few seconds. The timer handler will read the time from different
sources and do some calculations to find out if there's a skew in
the TSC. It's a complex code that take locks, disable interrupts, etc.

In the host, since our RT threads run at 100%, the clocksource watchdog
seems to hang between clock reads, which causes it to think there's a
skew, which in turn cause it to mark the TSC as unstable.

In the guest, it actually succeeds running, however it causes nohz full
to de-activate.

In both cases it also generates a lot of interrupts. Everything
contributes to the big spike we're seeing.

Now, why are we only seeing this in RHEL8? Because in RHEL7, the
clocksource watchdog was optional: you can disable it and we did
disable it for RHEL7. However, on recent upstream and RHEL8, the
clocksource watchdog is hardcoded enabled in x86_64. It seems to
be a platform requirement to have it.

To test my hypothesis, I forcefully disabled the clocksource watchdog
in the code. Which fixed the TSC unstable thing and the spike.

PS: I use this machine since the beggining of time for KVM-RT and
never had TSC problems with it (just like Pei)

PPS: Note that we also have tuned issues. I had to downgrade my tuned,
which caused a bunch of issues on its own but I don't have spikes anymore

Comment 18 Pei Zhang 2019-01-24 11:25:58 UTC

12 hours testing looks good too, didn't hit huge spike.


(1)Single VM with 1 rt vCPU:

# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00031

(2)Multiple VMs each with 1 rt vCPU:
-VM1
# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00032

-VM2
# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00031

-VM3
# Min Latencies: 00007
# Avg Latencies: 00015
# Max Latencies: 00037

-VM4
# Min Latencies: 00007
# Avg Latencies: 00015
# Max Latencies: 00032

Versions:
qemu-kvm-3.1.0-5.module+el8+2708+fbd828c6.x86_64
tuned-2.10.0-14.el8.noarch
libvirt-4.10.0-1.module+el8+2317+367e35b5.x86_64
kernel-rt-4.18.0-60.rt9.112.el8.test.x86_64


Beaker job: https://beaker.engineering.redhat.com/recipes/6422241#tasks

Comment 19 Luiz Capitulino 2019-01-24 13:31:15 UTC

Excellent! So, this means the root cause from comment 14 is correct.

Pei, if you have the bandwidth to now run the complete suite for
24 hours, you could do it. This is very useful, because we'll
test most components except for tuned (which is older).

Comment 20 Pei Zhang 2019-01-25 13:51:39 UTC

(In reply to Luiz Capitulino from comment #19)
> Excellent! So, this means the root cause from comment 14 is correct.
> 
> Pei, if you have the bandwidth to now run the complete suite for
> 24 hours, you could do it. This is very useful, because we'll
> test most components except for tuned (which is older).

Hi Luiz, 

OK. I'm testing full suit for 24 hours now. Will update the results once finish.

About tuned, I think we can test with latest version. As Bug 1666562 mentioned, need to install rt-setup manually. I tried and rt-entsk.service ERROR disappears. Also, In above 12 hours tests in Comment 18, I've tried with latest tuned version and it looks good.

Please let me know if you have any comment. Thanks.


Best regards,
Pei

Comment 21 Luiz Capitulino 2019-01-25 14:11:57 UTC

(In reply to Pei Zhang from comment #20)
> (In reply to Luiz Capitulino from comment #19)
> > Excellent! So, this means the root cause from comment 14 is correct.
> > 
> > Pei, if you have the bandwidth to now run the complete suite for
> > 24 hours, you could do it. This is very useful, because we'll
> > test most components except for tuned (which is older).
> 
> Hi Luiz, 
> 
> OK. I'm testing full suit for 24 hours now. Will update the results once
> finish.
> 
> About tuned, I think we can test with latest version. As Bug 1666562
> mentioned, need to install rt-setup manually. I tried and rt-entsk.service
> ERROR disappears. Also, In above 12 hours tests in Comment 18, I've tried
> with latest tuned version and it looks good.

Yes, you're right about bug 1666562. But note that we also have bug 1669294,
which doesn't seem to happen 100% of the time. So, before starting the
test you can check /var/log/tuned.log and see if you have this error
message:

2019-01-16 03:22:34,859 ERROR    tuned.units.manager: BUG: Unhandled exception in start_tuning: dictionary changed size during iteration

If you do, reboot & try again. Note that the error message has to
contain "start_tuning" to be harmful, if it contains "stop_tuning"
then it's fine to run the test since it means the bug triggered
when rebooting.

Comment 22 Pei Zhang 2019-01-25 14:42:17 UTC

(In reply to Luiz Capitulino from comment #21)
> (In reply to Pei Zhang from comment #20)
> > (In reply to Luiz Capitulino from comment #19)
> > > Excellent! So, this means the root cause from comment 14 is correct.
> > > 
> > > Pei, if you have the bandwidth to now run the complete suite for
> > > 24 hours, you could do it. This is very useful, because we'll
> > > test most components except for tuned (which is older).
> > 
> > Hi Luiz, 
> > 
> > OK. I'm testing full suit for 24 hours now. Will update the results once
> > finish.
> > 
> > About tuned, I think we can test with latest version. As Bug 1666562
> > mentioned, need to install rt-setup manually. I tried and rt-entsk.service
> > ERROR disappears. Also, In above 12 hours tests in Comment 18, I've tried
> > with latest tuned version and it looks good.
> 
> Yes, you're right about bug 1666562. But note that we also have bug 1669294,
> which doesn't seem to happen 100% of the time. So, before starting the
> test you can check /var/log/tuned.log and see if you have this error
> message:
> 
> 2019-01-16 03:22:34,859 ERROR    tuned.units.manager: BUG: Unhandled
> exception in start_tuning: dictionary changed size during iteration
> If you do, reboot & try again. Note that the error message has to
> contain "start_tuning" to be harmful, if it contains "stop_tuning"
> then it's fine to run the test since it means the bug triggered
> when rebooting.

Make sense. I checked and this "start_tuning" error still exists in latest tuned version. So I agree and I'll change back to use old tuned-2.10.0-7.el8.noarch version.

Thanks you Luiz for reminding.


Best regards,
Pei

Comment 23 Luiz Capitulino 2019-01-25 14:59:33 UTC

Pei,

You can try rebooting to see if it goes away. On my system, it takes two or three
reboots for it to trigger.

Comment 24 Pei Zhang 2019-01-29 00:57:05 UTC

Testing kernel with Luiz's patch, full suit cases tests of 24 hours get pass. Max latency looks good.

(1)Single VM with 1 rt vCPU:
# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00033

(2)Multiple VMs each with 1 rt vCPU:
-VM1
# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00037

-VM2
# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00031

-VM3
# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00035

-VM4
# Min Latencies: 00007
# Avg Latencies: 00014
# Max Latencies: 00032

(3)Single VM with 8 rt vCPUs:
# Min Latencies: 00013 00007 00012 00012 00012 00014 00013 00013
# Avg Latencies: 00017 00014 00017 00017 00017 00017 00017 00017
# Max Latencies: 00036 00026 00029 00028 00029 00029 00028 00034


Versions:
kernel-rt-4.18.0-60.rt9.112.el8.test.x86_64
libvirt-4.10.0-1.module+el8+2317+367e35b5.x86_64
tuned-2.10.0-7.el8.noarch
qemu-kvm-3.1.0-7.module+el8+2715+f4b84bed.x86_64


Beaker job: https://beaker.engineering.redhat.com/recipes/6431467#tasks

Comment 25 Pei Zhang 2019-01-29 01:00:02 UTC

(In reply to Luiz Capitulino from comment #23)
> Pei,
> 
> You can try rebooting to see if it goes away. On my system, it takes two or
> three
> reboots for it to trigger.

OK, I'll have a try manually. Thanks.

Best regards,
Pei

Comment 26 Luiz Capitulino 2019-01-29 13:29:04 UTC

Here's a description of the test-case that reproduces the bug.

In the host, we isolate one or more CPUs (that is, we move off all
user-space tasks, all kernel threads that can be moved, set irq
masks, workqueue cpumasks, etc). Then, we start a guest with two
CPUs, each of them pinned to an isolated CPU in the host. In the guest,
one of the CPUs is isolated just like in the host, the other is not
so that it can run housekeeping tasks. In the host, the guest CPU
threads are assigned RT priority.

In the guest, we start a hog thread, and pin it to the isolated CPU.
We also run cyclictest on that CPU, with the intent of measuring
all the interruptions the hog thread would incur.

The issue described in this bug happens towards the end of the test-run,
for tests with at least 30 minutes of duration.

Comment 27 Juri Lelli 2019-02-12 12:58:52 UTC

(In reply to Luiz Capitulino from comment #26)
> Here's a description of the test-case that reproduces the bug.
> 
> In the host, we isolate one or more CPUs (that is, we move off all
> user-space tasks, all kernel threads that can be moved, set irq
> masks, workqueue cpumasks, etc). Then, we start a guest with two
> CPUs, each of them pinned to an isolated CPU in the host. In the guest,
> one of the CPUs is isolated just like in the host, the other is not
> so that it can run housekeeping tasks. In the host, the guest CPU
> threads are assigned RT priority.
> 
> In the guest, we start a hog thread, and pin it to the isolated CPU.
> We also run cyclictest on that CPU, with the intent of measuring
> all the interruptions the hog thread would incur.
> 
> The issue described in this bug happens towards the end of the test-run,
> for tests with at least 30 minutes of duration.

Hi Luiz,

I believe I've been able to tune a system to achieve acceptable latencies
(both host and guest). So, I'd like to reproduce the problem.

Re-reading the comments, however, it's not clear to me whether the hog
thread is supposed to be running in the host or in the guest and if it's
a FIFO or NORMAL task. If it's FIFO, what is the priority it is running at?
What about cyclictest and kwatchdogd(s)?

Could you please clarify?

Thanks!

Comment 29 Juri Lelli 2019-02-21 14:30:40 UTC

I've been able to reproduce both symptoms (latency spike and tsc
unstable warning) running the scripts Luiz shared on a system
configured following kvm-rt tuning guide.

However, I'm not able to reproduce the tsc unstable warning anymore
after upgrading to latest version of python-linux-procfs
(https://bugzilla.redhat.com/show_bug.cgi?id=1669294).
My understanding is that, before the above package got fixed,
tuned was failing to apply required tuning at boot and this
was causing, for example, ktimersoftd to run at the same
priority of the isolated guest vcpu. It couldn't thus preempt
the vcpu when required (for running clocksource watchdog
calculations) and was ending up seeing huge clock skew when
finally got to run (thus disabling tsc and printing the
warning).

That said, I'm still seeing differences between having
clocksource watchdog enabled or not, but not so big. In tests
I performed (a couple of hours runtime) having watchdog
enabled showed ~50us max latency against ~30us when disabling
it. I also performed a test disabling the watchdog only
inside the guest and this one showed ~30us max latency as
well.

So, I was wondering if it could make sense to run the
tests again on the reference test box(es) just to check
where we stand after packages affecting tuned behavior
got fixed. In case we still see latencies that are not
acceptable it would maybe interesting to disable clocksource
watchdog inside the guest only.

I build a kernel where an additional cmdline can enable
(by default) or disable the watchdog. "nocswd" disables
it at boot. It might be useful to quickly test different
configurations:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20274388

Testing with latest kernel-rt can work as well of course
to see where we stand.

Thanks!

Comment 30 Luiz Capitulino 2019-02-21 19:02:17 UTC

(In reply to Juri Lelli from comment #29)
> I've been able to reproduce both symptoms (latency spike and tsc
> unstable warning) running the scripts Luiz shared on a system
> configured following kvm-rt tuning guide.
> 
> However, I'm not able to reproduce the tsc unstable warning anymore
> after upgrading to latest version of python-linux-procfs
> (https://bugzilla.redhat.com/show_bug.cgi?id=1669294).
> My understanding is that, before the above package got fixed,
> tuned was failing to apply required tuning at boot and this
> was causing, for example, ktimersoftd to run at the same
> priority of the isolated guest vcpu. It couldn't thus preempt
> the vcpu when required (for running clocksource watchdog
> calculations) and was ending up seeing huge clock skew when
> finally got to run (thus disabling tsc and printing the
> warning).

This makes a lot of sense! Thanks a lot for checking this out.
Now with bug 1669294 fixed, we went from total catastrophe
to having a spike.
 
> That said, I'm still seeing differences between having
> clocksource watchdog enabled or not, but not so big. In tests
> I performed (a couple of hours runtime) having watchdog
> enabled showed ~50us max latency against ~30us when disabling
> it. I also performed a test disabling the watchdog only
> inside the guest and this one showed ~30us max latency as
> well.
> 
> So, I was wondering if it could make sense to run the
> tests again on the reference test box(es) just to check
> where we stand after packages affecting tuned behavior
> got fixed. In case we still see latencies that are not
> acceptable it would maybe interesting to disable clocksource
> watchdog inside the guest only.
> 
> I build a kernel where an additional cmdline can enable
> (by default) or disable the watchdog. "nocswd" disables
> it at boot. It might be useful to quickly test different
> configurations:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20274388

I think it makes sense to do this testing. However, going
from 30us to 50us is still a big spike. So, I still think
we have to disable the clocksource watchdog.

Pei, can you run this testing? Basically, you have to install
the fixed version of python-linux-procfs and run all three
test-cases against latest RHEL8 RT kernel and Juri's kernel.
I think running for 1 hour is enough.

Btw, Juri, I don't think your comment has to be private. So, if you
make your comment public, you can make mine too.

> 
> Testing with latest kernel-rt can work as well of course
> to see where we stand.
> 
> Thanks!

Comment 31 Juri Lelli 2019-02-21 19:16:16 UTC

(In reply to Luiz Capitulino from comment #30)

[...]

> Pei, can you run this testing? Basically, you have to install
> the fixed version of python-linux-procfs and run all three
> test-cases against latest RHEL8 RT kernel and Juri's kernel.
> I think running for 1 hour is enough.

Maybe also double check that ktimersoftd(s) have FIFO:3 priority,
ksoftirqd(s) have FIFO:2 priority and that timer_migration is
disabled (/proc/sys/kernel/timer_migration == 0) before starting
the tests (if possible). Both host and guest.

Comment 32 Pei Zhang 2019-02-22 11:28:16 UTC

Hi Luiz, Juri,

Currently I just finished one test case testing based on above testing requirements, there is no huge latency any more, however the max latency still exceeds 40us for both 2 versions:

(1) With 4.18.0-68.rt9.123.el8.x86_64

Test scenario: Single VM with single rt vCPU

1 hour max latency is:

# Min Latencies: 00007
# Avg Latencies: 00016
# Max Latencies: 00045


(2) With 4.18.0-68.rt9.123.el8.nocswd.x86_64 (build provided by Juri)

Test scenario: Single VM with single rt vCPU

Host kernel line (adding nocswd):

# cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-68.rt9.123.el8.nocswd.x86_64 root=/dev/mapper/rhel_dell--per430--11-root ro crashkernel=auto resume=/dev/mapper/rhel_dell--per430--11-swap rd.lvm.lv=rhel_dell-per430-11/root rd.lvm.lv=rhel_dell-per430-11/swap console=ttyS0,115200n81 skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 nocswd default_hugepagesz=1G iommu=pt intel_iommu=on

Guest kernel line(adding nocswd):

BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-68.rt9.123.el8.nocswd.x86_64 root=/dev/mapper/rhel_vm--74--52-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_vm--74--52-swap rd.lvm.lv=rhel_vm-74-52/root rd.lvm.lv=rhel_vm-74-52/swap skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 nocswd default_hugepagesz=1G iommu=pt intel_iommu=on

1 hour max latency is:

# Min Latencies: 00008
# Avg Latencies: 00015
# Max Latencies: 00049



Other infos for both 2 versions:

1. Versions besides kernel-rt:

tuned-2.10.0-14.el8.noarch
python3-linux-procfs-0.6-6.el8.noarch

2. For both host and guest:

# cat /proc/sys/kernel/timer_migration
0

Juri, could you please share how to check the FIFO value of ktimersoftd(s) and ksoftirqd(s)? Thanks.


Best regards,
Pei

Comment 33 Juri Lelli 2019-02-22 11:52:22 UTC

(In reply to Pei Zhang from comment #32)
> Hi Luiz, Juri,
> 
> Currently I just finished one test case testing based on above testing
> requirements, there is no huge latency any more, however the max latency
> still exceeds 40us for both 2 versions:
> 
> (1) With 4.18.0-68.rt9.123.el8.x86_64
> 
> Test scenario: Single VM with single rt vCPU
> 
> 1 hour max latency is:
> 
> # Min Latencies: 00007
> # Avg Latencies: 00016
> # Max Latencies: 00045
> 
> 
> (2) With 4.18.0-68.rt9.123.el8.nocswd.x86_64 (build provided by Juri)
> 
> Test scenario: Single VM with single rt vCPU
> 
> Host kernel line (adding nocswd):
> 
> # cat /proc/cmdline 
> BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-68.rt9.123.el8.nocswd.x86_64
> root=/dev/mapper/rhel_dell--per430--11-root ro crashkernel=auto
> resume=/dev/mapper/rhel_dell--per430--11-swap
> rd.lvm.lv=rhel_dell-per430-11/root rd.lvm.lv=rhel_dell-per430-11/swap
> console=ttyS0,115200n81 skew_tick=1
> isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable
> nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13
> rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 nocswd default_hugepagesz=1G
> iommu=pt intel_iommu=on
> 
> Guest kernel line(adding nocswd):
> 
> BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-68.rt9.123.el8.nocswd.x86_64
> root=/dev/mapper/rhel_vm--74--52-root ro console=tty0 console=ttyS0,115200n8
> biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_vm--74--52-swap
> rd.lvm.lv=rhel_vm-74-52/root rd.lvm.lv=rhel_vm-74-52/swap skew_tick=1
> isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1
> nocswd default_hugepagesz=1G iommu=pt intel_iommu=on
> 
> 1 hour max latency is:
> 
> # Min Latencies: 00008
> # Avg Latencies: 00015
> # Max Latencies: 00049

Mmm. Would you be able to trace both guest and host while idle with
something like the following?

trace-cmd start -e 'sched:sched_wakeup' -e 'sched:sched_migrate_task' -e 'sched:sched_switch' -e 'timer:*' -e 'irq:*' -b 10000 && sleep 5 && trace-cmd stop && trace-cmd extract

trace-cmd report 2>/dev/null | grep clocksource_watchdog | wc -l

Just to check that clocksource watchdog is in fact disabled.

> 
> 
> 
> Other infos for both 2 versions:
> 
> 1. Versions besides kernel-rt:
> 
> tuned-2.10.0-14.el8.noarch
> python3-linux-procfs-0.6-6.el8.noarch
> 
> 2. For both host and guest:
> 
> # cat /proc/sys/kernel/timer_migration
> 0
> 
> Juri, could you please share how to check the FIFO value of ktimersoftd(s)
> and ksoftirqd(s)? Thanks.

Sure.

for PID in $(pgrep ksoftirqd); do chrt -p ${PID}; done

for PID in $(pgrep ktimersoftd); do chrt -p ${PID}; done

Thanks.

Comment 34 Juri Lelli 2019-02-22 12:33:48 UTC

Also, what is the value for the following (on the host) you run with?

/sys/module/kvm/parameters/lapic_timer_advance_ns

I saw that it has quite an impact on latencies. (Anything below 3000
doesn't really work for me).

Comment 35 Pei Zhang 2019-02-22 15:50:36 UTC

(In reply to Juri Lelli from comment #33)
> 
> for PID in $(pgrep ksoftirqd); do chrt -p ${PID}; done

pid 10's current scheduling policy: SCHED_FIFO
pid 10's current scheduling priority: 2
...

> for PID in $(pgrep ktimersoftd); do chrt -p ${PID}; done

pid 11's current scheduling policy: SCHED_FIFO
pid 11's current scheduling priority: 3
...

> Thanks.


(In reply to Juri Lelli from comment #34)
> Also, what is the value for the following (on the host) you run with?
> 
> /sys/module/kvm/parameters/lapic_timer_advance_ns
> 
> I saw that it has quite an impact on latencies. (Anything below 3000
> doesn't really work for me).

# cat /sys/module/kvm/parameters/lapic_timer_advance_ns
0

Comment 36 Juri Lelli 2019-02-22 15:56:17 UTC

(In reply to Pei Zhang from comment #35)
> (In reply to Juri Lelli from comment #33)
> > 
> > for PID in $(pgrep ksoftirqd); do chrt -p ${PID}; done
> 
> pid 10's current scheduling policy: SCHED_FIFO
> pid 10's current scheduling priority: 2
> ...
> 
> > for PID in $(pgrep ktimersoftd); do chrt -p ${PID}; done
> 
> pid 11's current scheduling policy: SCHED_FIFO
> pid 11's current scheduling priority: 3
> ...
> 

They look OK. Thanks for checking.

> (In reply to Juri Lelli from comment #34)
> > Also, what is the value for the following (on the host) you run with?
> > 
> > /sys/module/kvm/parameters/lapic_timer_advance_ns
> > 
> > I saw that it has quite an impact on latencies. (Anything below 3000
> > doesn't really work for me).
> 
> # cat /sys/module/kvm/parameters/lapic_timer_advance_ns
> 0

This instead I think needs tuning. I'd start with 1000, but, as said,
in my test box I couldn't get below 40us latencies until I put 3000 in
this parameter (I believe tuning guide advise this tuning strategy as
well).

Could you please try again with 1000 and increments of 500 until we
hopefully see acceptable latencies? I'd anyway stop at 5000.

Thanks!

Comment 37 Pei Zhang 2019-02-22 16:01:55 UTC

(In reply to Juri Lelli from comment #33)
[...]
> 
> Mmm. Would you be able to trace both guest and host while idle with
> something like the following?
> 
> trace-cmd start -e 'sched:sched_wakeup' -e 'sched:sched_migrate_task' -e
> 'sched:sched_switch' -e 'timer:*' -e 'irq:*' -b 10000 && sleep 5 &&
> trace-cmd stop && trace-cmd extract
> 
> trace-cmd report 2>/dev/null | grep clocksource_watchdog | wc -l
> 
> Just to check that clocksource watchdog is in fact disabled.
> 

# trace-cmd start -e 'sched:sched_wakeup' -e 'sched:sched_migrate_task' -e 'sched:sched_switch' -e 'timer:*' -e 'irq:*' -b 10000 && sleep 5 && trace-cmd stop && trace-cmd extract
CPU0 data recorded at offset=0x5b5000
    3477504 bytes in size
CPU1 data recorded at offset=0x906000
    0 bytes in size

# trace-cmd report 2>/dev/null | grep clocksource_watchdog | wc -l
0

Is this expected?

Comment 38 Pei Zhang 2019-02-22 16:03:41 UTC

(In reply to Juri Lelli from comment #36)
> (In reply to Pei Zhang from comment #35)
> > (In reply to Juri Lelli from comment #33)
> > > 
> > > for PID in $(pgrep ksoftirqd); do chrt -p ${PID}; done
> > 
> > pid 10's current scheduling policy: SCHED_FIFO
> > pid 10's current scheduling priority: 2
> > ...
> > 
> > > for PID in $(pgrep ktimersoftd); do chrt -p ${PID}; done
> > 
> > pid 11's current scheduling policy: SCHED_FIFO
> > pid 11's current scheduling priority: 3
> > ...
> > 
> 
> They look OK. Thanks for checking.
> 
> > (In reply to Juri Lelli from comment #34)
> > > Also, what is the value for the following (on the host) you run with?
> > > 
> > > /sys/module/kvm/parameters/lapic_timer_advance_ns
> > > 
> > > I saw that it has quite an impact on latencies. (Anything below 3000
> > > doesn't really work for me).
> > 
> > # cat /sys/module/kvm/parameters/lapic_timer_advance_ns
> > 0
> 
> This instead I think needs tuning. I'd start with 1000, but, as said,
> in my test box I couldn't get below 40us latencies until I put 3000 in
> this parameter (I believe tuning guide advise this tuning strategy as
> well).
> 
> Could you please try again with 1000 and increments of 500 until we
> hopefully see acceptable latencies? I'd anyway stop at 5000.

OK. I'll have a try.

Best regards,
Pei

Comment 39 Juri Lelli 2019-02-22 16:06:10 UTC

(In reply to Pei Zhang from comment #37)
> (In reply to Juri Lelli from comment #33)
> [...]
> > 
> > Mmm. Would you be able to trace both guest and host while idle with
> > something like the following?
> > 
> > trace-cmd start -e 'sched:sched_wakeup' -e 'sched:sched_migrate_task' -e
> > 'sched:sched_switch' -e 'timer:*' -e 'irq:*' -b 10000 && sleep 5 &&
> > trace-cmd stop && trace-cmd extract
> > 
> > trace-cmd report 2>/dev/null | grep clocksource_watchdog | wc -l
> > 
> > Just to check that clocksource watchdog is in fact disabled.
> > 
> 
> # trace-cmd start -e 'sched:sched_wakeup' -e 'sched:sched_migrate_task' -e
> 'sched:sched_switch' -e 'timer:*' -e 'irq:*' -b 10000 && sleep 5 &&
> trace-cmd stop && trace-cmd extract
> CPU0 data recorded at offset=0x5b5000
>     3477504 bytes in size
> CPU1 data recorded at offset=0x906000
>     0 bytes in size
> 
> # trace-cmd report 2>/dev/null | grep clocksource_watchdog | wc -l
> 0
> 
> Is this expected?

If kernel booted with "nocswd" yes.

Getting something different from 0 would mean that some watchdog has
been running.

Comment 40 Luiz Capitulino 2019-02-22 19:34:34 UTC

(In reply to Juri Lelli from comment #36)

> > (In reply to Juri Lelli from comment #34)
> > > Also, what is the value for the following (on the host) you run with?
> > > 
> > > /sys/module/kvm/parameters/lapic_timer_advance_ns
> > > 
> > > I saw that it has quite an impact on latencies. (Anything below 3000
> > > doesn't really work for me).
> > 
> > # cat /sys/module/kvm/parameters/lapic_timer_advance_ns
> > 0
> 
> This instead I think needs tuning. I'd start with 1000, but, as said,
> in my test box I couldn't get below 40us latencies until I put 3000 in
> this parameter (I believe tuning guide advise this tuning strategy as
> well).
> 
> Could you please try again with 1000 and increments of 500 until we
> hopefully see acceptable latencies? I'd anyway stop at 5000.

Juri,

I think you just found another important issue! This manual work you're
(rightfully) doing, is supposed to be automated by the tuned host profile
and hence lapic_timer_advance_ns is supposed to be non-zero. If that's
not the case, then that's probably a bug. Would you please open a BZ? The
component should be tuned, and the affected profile is the realtime-virtual-host
one. You can assign it to me.

Comment 41 Pei Zhang 2019-02-23 01:48:17 UTC

(In reply to Luiz Capitulino from comment #40)
> (In reply to Juri Lelli from comment #36)
> 
> > > (In reply to Juri Lelli from comment #34)
> > > > Also, what is the value for the following (on the host) you run with?
> > > > 
> > > > /sys/module/kvm/parameters/lapic_timer_advance_ns
> > > > 
> > > > I saw that it has quite an impact on latencies. (Anything below 3000
> > > > doesn't really work for me).
> > > 
> > > # cat /sys/module/kvm/parameters/lapic_timer_advance_ns
> > > 0
> > 
> > This instead I think needs tuning. I'd start with 1000, but, as said,
> > in my test box I couldn't get below 40us latencies until I put 3000 in
> > this parameter (I believe tuning guide advise this tuning strategy as
> > well).
> > 
> > Could you please try again with 1000 and increments of 500 until we
> > hopefully see acceptable latencies? I'd anyway stop at 5000.
> 
> Juri,
> 
> I think you just found another important issue! This manual work you're
> (rightfully) doing, is supposed to be automated by the tuned host profile
> and hence lapic_timer_advance_ns is supposed to be non-zero. If that's
> not the case, then that's probably a bug. Would you please open a BZ? The
> component should be tuned, and the affected profile is the
> realtime-virtual-host
> one. You can assign it to me.

Luiz, I've filed this bz[1], already cc you and Juri.

[1]Bug 1680199 - default value of lapic_timer_advance_ns is set 0, but we expect non-zero value

Comment 42 Pei Zhang 2019-02-25 15:44:20 UTC

No lapic_timer_advance_ns  version                      max_latency
1  0                       4.18.0-71.rt9.127.el8         63us
2  0                       4.18.0-68.rt9.123.el8.nocswd  67us
3  1000                    4.18.0-71.rt9.127.el8         46us
4  1000                    4.18.0-68.rt9.123.el8.nocswd  47us
5  2000                    4.18.0-71.rt9.127.el8         47us
6  2000                    4.18.0-68.rt9.123.el8.nocswd  38us

Beak links:
1.https://beaker.engineering.redhat.com/jobs/3361245
2.https://beaker.engineering.redhat.com/jobs/3361244
3.https://beaker.engineering.redhat.com/jobs/3362923 (manaully run the 3rd scenario in this run)
4.https://beaker.engineering.redhat.com/jobs/3362924
5.https://beaker.engineering.redhat.com/jobs/3363722
6.https://beaker.engineering.redhat.com/jobs/3363720

Comment 43 Pei Zhang 2019-02-25 15:45:54 UTC

In Comment 42, I tested all 3 kvm-rt standard scenarios and each running 1 hour cyclictest.

Comment 44 Juri Lelli 2019-02-25 15:52:21 UTC

Thanks for doing all the runs!

Even though run 6. is meeting latency requirements already,
given the trend, I wonder if it would make sense to further
test with 2500, 3000 (at least) and see if latency keeps
getting smaller.

Luiz?

Comment 45 Luiz Capitulino 2019-02-26 13:24:43 UTC

Yeah, I think it's good to have the numbers. But I don't think we can avoid
disabling the clocksource watchdog: if it's not causing a spike, then the
timer is probably starving on the cores where there's vcpu threads pinned
running the test-case.

Also, Pei, did you run the nocswd kernel in the guest? If not, you should
since the clocksource watchdog will run there too.

Comment 46 Juri Lelli 2019-02-26 13:42:09 UTC

(In reply to Luiz Capitulino from comment #45)
> Yeah, I think it's good to have the numbers. But I don't think we can avoid
> disabling the clocksource watchdog: if it's not causing a spike, then the
> timer is probably starving on the cores where there's vcpu threads pinned
> running the test-case.

Mmm, why so? ktimersoftd(s) run it, and they should have higher priority than
the vcpus threads on the host.

That said, I'm not in principle against disabling the watchdog. Just making
sure we agree on why we are disabling it and potentially collecting the best
info we can collect to report upstream.

> Also, Pei, did you run the nocswd kernel in the guest? If not, you should
> since the clocksource watchdog will run there too.

Agree. Quite a lot of configurations of configurations to test (sorry), but
I believe it's good info at the end.

Thanks!

Comment 47 Luiz Capitulino 2019-02-26 13:53:36 UTC

(In reply to Juri Lelli from comment #46)
> (In reply to Luiz Capitulino from comment #45)
> > Yeah, I think it's good to have the numbers. But I don't think we can avoid
> > disabling the clocksource watchdog: if it's not causing a spike, then the
> > timer is probably starving on the cores where there's vcpu threads pinned
> > running the test-case.
> 
> Mmm, why so? ktimersoftd(s) run it, and they should have higher priority than
> the vcpus threads on the host.

Oh, you're right. For some reason I thought they had the same priority, but
you're right that ktimersoftd is higher.
 
> That said, I'm not in principle against disabling the watchdog. Just making
> sure we agree on why we are disabling it and potentially collecting the best
> info we can collect to report upstream.

So, I'm a bit surprised that the clocksource watchdog is not causing a higher
latency in cyclictest. In any case, it's causing interruptions which is also
bad for DPDK. There's a tool we use sometimes to collect interruptions
called sysjitter, but you can probably do this with tracing.

> 
> > Also, Pei, did you run the nocswd kernel in the guest? If not, you should
> > since the clocksource watchdog will run there too.
> 
> Agree. Quite a lot of configurations of configurations to test (sorry), but
> I believe it's good info at the end.

Yeah, this all looks good to me.

Comment 48 Pei Zhang 2019-02-28 11:01:57 UTC

(In reply to Juri Lelli from comment #44)
> Thanks for doing all the runs!
> 
> Even though run 6. is meeting latency requirements already,
> given the trend, I wonder if it would make sense to further
> test with 2500, 3000 (at least) and see if latency keeps
> getting smaller.

I run with lapic_timer_advance_ns=3000, each version running 2 times. 

No lapic_timer_advance_ns  version                      max_latency
1  3000                    4.18.0-73.rt9.129.el8         48us
2  3000                    4.18.0-73.rt9.129.el8         48us
3  3000                    4.18.0-68.rt9.123.el8.nocswd  31us
2  3000                    4.18.0-68.rt9.123.el8.nocswd  37us

Beak links:
1.https://beaker.engineering.redhat.com/jobs/3369048
2.https://beaker.engineering.redhat.com/jobs/3366432
3.https://beaker.engineering.redhat.com/jobs/3369047
4.https://beaker.engineering.redhat.com/jobs/3366436

Comment 49 Pei Zhang 2019-02-28 11:03:39 UTC

(In reply to Luiz Capitulino from comment #45)
> Yeah, I think it's good to have the numbers. But I don't think we can avoid
> disabling the clocksource watchdog: if it's not causing a spike, then the
> timer is probably starving on the cores where there's vcpu threads pinned
> running the test-case.
> 
> Also, Pei, did you run the nocswd kernel in the guest? If not, you should
> since the clocksource watchdog will run there too.

Yes, I'm running nocswd kernel in both host and guest. In our testings, I always run same versions on host and guest.

Comment 50 Juri Lelli 2019-03-01 09:48:44 UTC

I just discovered that clocksource watchdog doesn't run if
"tsc=reliable" is added to the kernel cmdline.

This should allow us to fix the issue at hand w/o additional
kernel side patches.

Worth a try?

Comment 51 Pei Zhang 2019-03-01 09:59:40 UTC

(In reply to Juri Lelli from comment #50)
> I just discovered that clocksource watchdog doesn't run if
> "tsc=reliable" is added to the kernel cmdline.
> 
> This should allow us to fix the issue at hand w/o additional
> kernel side patches.
> 
> Worth a try?

Hi Juri,

Do you mean testing normal kernel-rt with adding tsc=reliable in both host and guest kernel line?

Also add lapic_timer_advance_ns=3000,right?

I can submit a 24 hours run with all 3 testing scenarios this weekend.  

Thanks.

Comment 52 Juri Lelli 2019-03-01 10:04:46 UTC

(In reply to Pei Zhang from comment #51)
> (In reply to Juri Lelli from comment #50)
> > I just discovered that clocksource watchdog doesn't run if
> > "tsc=reliable" is added to the kernel cmdline.
> > 
> > This should allow us to fix the issue at hand w/o additional
> > kernel side patches.
> > 
> > Worth a try?
> 
> Hi Juri,
> 
> Do you mean testing normal kernel-rt with adding tsc=reliable in both host
> and guest kernel line?

Yes, add "tsc=reliable" to the command line you already use for
running tests. Both host and guest. And use latest (unpatched)
kernel-rt.

> 
> Also add lapic_timer_advance_ns=3000, right?

Yes, that or best value, if tuned gets fixed.

For now, 3000 should be anyway OK.

> I can submit a 24 hours run with all 3 testing scenarios this weekend.  

That would be cool, thanks.

Comment 53 Pei Zhang 2019-03-01 10:12:03 UTC

(In reply to Juri Lelli from comment #52)
> (In reply to Pei Zhang from comment #51)
> > (In reply to Juri Lelli from comment #50)
> > > I just discovered that clocksource watchdog doesn't run if
> > > "tsc=reliable" is added to the kernel cmdline.
> > > 
> > > This should allow us to fix the issue at hand w/o additional
> > > kernel side patches.
> > > 
> > > Worth a try?
> > 
> > Hi Juri,
> > 
> > Do you mean testing normal kernel-rt with adding tsc=reliable in both host
> > and guest kernel line?
> 
> Yes, add "tsc=reliable" to the command line you already use for
> running tests. Both host and guest. And use latest (unpatched)
> kernel-rt.
> 
> > 
> > Also add lapic_timer_advance_ns=3000, right?
> 
> Yes, that or best value, if tuned gets fixed.
> 
> For now, 3000 should be anyway OK.
> 
> > I can submit a 24 hours run with all 3 testing scenarios this weekend.  
> 
> That would be cool, thanks.

Well noted. I'll test following these requirements. Thanks.

Comment 54 Juri Lelli 2019-03-01 10:19:59 UTC

Oh, actually I would keep kernel-rt version at 4.18.0-71.rt9.127.el8

I believe there are changes in newer kernels that might introduce
variability that we don't want at this stage.

Thanks!

Comment 55 Pei Zhang 2019-03-01 13:06:49 UTC

(In reply to Juri Lelli from comment #54)
> Oh, actually I would keep kernel-rt version at 4.18.0-71.rt9.127.el8
> 
> I believe there are changes in newer kernels that might introduce
> variability that we don't want at this stage.

Ok. Get it. I'm submitting both 4.18.0-71.rt9.127.el8 and 4.18.0-74.rt9.130.el8(latest one) separately on 2 servers.
 
> Thanks!

Comment 56 Luiz Capitulino 2019-03-01 13:29:40 UTC

(In reply to Juri Lelli from comment #50)
> I just discovered that clocksource watchdog doesn't run if
> "tsc=reliable" is added to the kernel cmdline.
> 
> This should allow us to fix the issue at hand w/o additional
> kernel side patches.
> 
> Worth a try?

Honest question: does tsc=reliable disable all tsc checks?
If yes, what happens if we pass tsc=reliable and the tsc is
not reliable? I guess we wouldn't be warned, right? I'm not sure
we can afford that.

Marcelo, what you think about this?

Btw, I think we'd have to hardcode tsc=reliable in the tuned
profiles...

Comment 57 Juri Lelli 2019-03-01 13:39:44 UTC

(In reply to Luiz Capitulino from comment #56)
> (In reply to Juri Lelli from comment #50)
> > I just discovered that clocksource watchdog doesn't run if
> > "tsc=reliable" is added to the kernel cmdline.
> > 
> > This should allow us to fix the issue at hand w/o additional
> > kernel side patches.
> > 
> > Worth a try?
> 
> Honest question: does tsc=reliable disable all tsc checks?

Not sure if it does something else then disabling the clocksource
watchdog.

> If yes, what happens if we pass tsc=reliable and the tsc is
> not reliable? I guess we wouldn't be warned, right? I'm not sure
> we can afford that.

But, then won't we be in the same situation as with disabling
clocksource watchdog only? That is the thing, AFAIK, that checks
for TSC reliability at runtime.

> 
> Marcelo, what you think about this?
> 
> Btw, I think we'd have to hardcode tsc=reliable in the tuned
> profiles...

Comment 58 Luiz Capitulino 2019-03-01 13:42:38 UTC

(In reply to Juri Lelli from comment #57)
> (In reply to Luiz Capitulino from comment #56)
> > (In reply to Juri Lelli from comment #50)
> > > I just discovered that clocksource watchdog doesn't run if
> > > "tsc=reliable" is added to the kernel cmdline.
> > > 
> > > This should allow us to fix the issue at hand w/o additional
> > > kernel side patches.
> > > 
> > > Worth a try?
> > 
> > Honest question: does tsc=reliable disable all tsc checks?
> 
> Not sure if it does something else then disabling the clocksource
> watchdog.
> 
> > If yes, what happens if we pass tsc=reliable and the tsc is
> > not reliable? I guess we wouldn't be warned, right? I'm not sure
> > we can afford that.
> 
> But, then won't we be in the same situation as with disabling
> clocksource watchdog only? That is the thing, AFAIK, that checks
> for TSC reliability at runtime.

I think there are reliability checks at boot too.

> 
> > 
> > Marcelo, what you think about this?
> > 
> > Btw, I think we'd have to hardcode tsc=reliable in the tuned
> > profiles...

Comment 59 Juri Lelli 2019-03-01 13:53:28 UTC

(In reply to Luiz Capitulino from comment #58)
> (In reply to Juri Lelli from comment #57)
> > (In reply to Luiz Capitulino from comment #56)
> > > (In reply to Juri Lelli from comment #50)
> > > > I just discovered that clocksource watchdog doesn't run if
> > > > "tsc=reliable" is added to the kernel cmdline.
> > > > 
> > > > This should allow us to fix the issue at hand w/o additional
> > > > kernel side patches.
> > > > 
> > > > Worth a try?
> > > 
> > > Honest question: does tsc=reliable disable all tsc checks?
> > 
> > Not sure if it does something else then disabling the clocksource
> > watchdog.
> > 
> > > If yes, what happens if we pass tsc=reliable and the tsc is
> > > not reliable? I guess we wouldn't be warned, right? I'm not sure
> > > we can afford that.
> > 
> > But, then won't we be in the same situation as with disabling
> > clocksource watchdog only? That is the thing, AFAIK, that checks
> > for TSC reliability at runtime.
> 
> I think there are reliability checks at boot too.

Mmm, but even in this case, if tsc becomes unreliable after boot and
we don't detect that, bad things would happen as well, no?

So, it seems to me that on systems where we can afford to disable the
watchdog, we better be sure that TSC is stable indeed.

> 
> > 
> > > 
> > > Marcelo, what you think about this?
> > > 
> > > Btw, I think we'd have to hardcode tsc=reliable in the tuned
> > > profiles...

Comment 60 Luiz Capitulino 2019-03-01 14:06:58 UTC

(In reply to Juri Lelli from comment #59)
> (In reply to Luiz Capitulino from comment #58)
> > (In reply to Juri Lelli from comment #57)
> > > (In reply to Luiz Capitulino from comment #56)
> > > > (In reply to Juri Lelli from comment #50)
> > > > > I just discovered that clocksource watchdog doesn't run if
> > > > > "tsc=reliable" is added to the kernel cmdline.
> > > > > 
> > > > > This should allow us to fix the issue at hand w/o additional
> > > > > kernel side patches.
> > > > > 
> > > > > Worth a try?
> > > > 
> > > > Honest question: does tsc=reliable disable all tsc checks?
> > > 
> > > Not sure if it does something else then disabling the clocksource
> > > watchdog.
> > > 
> > > > If yes, what happens if we pass tsc=reliable and the tsc is
> > > > not reliable? I guess we wouldn't be warned, right? I'm not sure
> > > > we can afford that.
> > > 
> > > But, then won't we be in the same situation as with disabling
> > > clocksource watchdog only? That is the thing, AFAIK, that checks
> > > for TSC reliability at runtime.
> > 
> > I think there are reliability checks at boot too.
> 
> Mmm, but even in this case, if tsc becomes unreliable after boot and
> we don't detect that, bad things would happen as well, no?

Yes, but not having a runtime check has been the case for several years.
Now, the question that I have is: is the boot check not enough? Or
even unreliable so that the runtime check is in fact, necessary?

If that's the case, then I agree disabling the clocksource watchdog
will have the same effect as passing tsc=reliable.
 
> So, it seems to me that on systems where we can afford to disable the
> watchdog, we better be sure that TSC is stable indeed.

KVM-RT does require a stable tsc. I think what I'm asking is, what are
the implications of having tsc=reliable hardcoded in the tuned profiles?

Comment 61 Marcelo Tosatti 2019-03-01 14:17:41 UTC

(In reply to Luiz Capitulino from comment #56)
> (In reply to Juri Lelli from comment #50)
> > I just discovered that clocksource watchdog doesn't run if
> > "tsc=reliable" is added to the kernel cmdline.
> > 
> > This should allow us to fix the issue at hand w/o additional
> > kernel side patches.
> > 
> > Worth a try?
> 
> Honest question: does tsc=reliable disable all tsc checks?
> If yes, what happens if we pass tsc=reliable and the tsc is
> not reliable? I guess we wouldn't be warned, right? I'm not sure
> we can afford that.
> 
> Marcelo, what you think about this?
> 
> Btw, I think we'd have to hardcode tsc=reliable in the tuned
> profiles...

First of all, i don't know what are all the consequences of tsc=reliable. 

But for the clocksource watchdog, what is necessary is to disable 
clocksource watchdog on a per-CPU basis, with justification
that it adds X us to processing latency and that is not acceptable
to the workload being handled (upstream patch).
Which was not properly measured AFAIK.

From what i understand from this thread, we are not certain how
much latency TSC watchdog adds.

About TSC going unstable, on comment 13 Luiz made a hypothesis:

"In the host, since our RT threads run at 100%, the clocksource watchdog
seems to hang between clock reads, which causes it to think there's a
skew, which in turn cause it to mark the TSC as unstable."

But later on the discussion it was noted that ktimersoftd priority is higher.

So is the actual reason TSC is being marked unstable?

Comment 62 Marcelo Tosatti 2019-03-01 14:21:02 UTC

(In reply to Luiz Capitulino from comment #60)
> (In reply to Juri Lelli from comment #59)
> > (In reply to Luiz Capitulino from comment #58)
> > > (In reply to Juri Lelli from comment #57)
> > > > (In reply to Luiz Capitulino from comment #56)
> > > > > (In reply to Juri Lelli from comment #50)
> > > > > > I just discovered that clocksource watchdog doesn't run if
> > > > > > "tsc=reliable" is added to the kernel cmdline.
> > > > > > 
> > > > > > This should allow us to fix the issue at hand w/o additional
> > > > > > kernel side patches.
> > > > > > 
> > > > > > Worth a try?
> > > > > 
> > > > > Honest question: does tsc=reliable disable all tsc checks?
> > > > 
> > > > Not sure if it does something else then disabling the clocksource
> > > > watchdog.
> > > > 
> > > > > If yes, what happens if we pass tsc=reliable and the tsc is
> > > > > not reliable? I guess we wouldn't be warned, right? I'm not sure
> > > > > we can afford that.
> > > > 
> > > > But, then won't we be in the same situation as with disabling
> > > > clocksource watchdog only? That is the thing, AFAIK, that checks
> > > > for TSC reliability at runtime.
> > > 
> > > I think there are reliability checks at boot too.
> > 
> > Mmm, but even in this case, if tsc becomes unreliable after boot and
> > we don't detect that, bad things would happen as well, no?
> 
> Yes, but not having a runtime check has been the case for several years.
> Now, the question that I have is: is the boot check not enough? Or
> even unreliable so that the runtime check is in fact, necessary?
> 
> If that's the case, then I agree disabling the clocksource watchdog
> will have the same effect as passing tsc=reliable.
>  
> > So, it seems to me that on systems where we can afford to disable the
> > watchdog, we better be sure that TSC is stable indeed.
> 
> KVM-RT does require a stable tsc. I think what I'm asking is, what are
> the implications of having tsc=reliable hardcoded in the tuned profiles?

You don't check whether TSC is unstable. So you rely on properly performing hardware 
to guarantee TSC stability (which is a bad idea).

IMO should find out what is the reason for TSC being marked unstable.

Comment 63 Juri Lelli 2019-03-01 14:27:12 UTC

(In reply to Marcelo Tosatti from comment #62)
> (In reply to Luiz Capitulino from comment #60)
> > (In reply to Juri Lelli from comment #59)
> > > (In reply to Luiz Capitulino from comment #58)
> > > > (In reply to Juri Lelli from comment #57)
> > > > > (In reply to Luiz Capitulino from comment #56)
> > > > > > (In reply to Juri Lelli from comment #50)
> > > > > > > I just discovered that clocksource watchdog doesn't run if
> > > > > > > "tsc=reliable" is added to the kernel cmdline.
> > > > > > > 
> > > > > > > This should allow us to fix the issue at hand w/o additional
> > > > > > > kernel side patches.
> > > > > > > 
> > > > > > > Worth a try?
> > > > > > 
> > > > > > Honest question: does tsc=reliable disable all tsc checks?
> > > > > 
> > > > > Not sure if it does something else then disabling the clocksource
> > > > > watchdog.
> > > > > 
> > > > > > If yes, what happens if we pass tsc=reliable and the tsc is
> > > > > > not reliable? I guess we wouldn't be warned, right? I'm not sure
> > > > > > we can afford that.
> > > > > 
> > > > > But, then won't we be in the same situation as with disabling
> > > > > clocksource watchdog only? That is the thing, AFAIK, that checks
> > > > > for TSC reliability at runtime.
> > > > 
> > > > I think there are reliability checks at boot too.
> > > 
> > > Mmm, but even in this case, if tsc becomes unreliable after boot and
> > > we don't detect that, bad things would happen as well, no?
> > 
> > Yes, but not having a runtime check has been the case for several years.
> > Now, the question that I have is: is the boot check not enough? Or
> > even unreliable so that the runtime check is in fact, necessary?
> > 
> > If that's the case, then I agree disabling the clocksource watchdog
> > will have the same effect as passing tsc=reliable.
> >  
> > > So, it seems to me that on systems where we can afford to disable the
> > > watchdog, we better be sure that TSC is stable indeed.
> > 
> > KVM-RT does require a stable tsc. I think what I'm asking is, what are
> > the implications of having tsc=reliable hardcoded in the tuned profiles?
> 
> You don't check whether TSC is unstable. So you rely on properly performing
> hardware 
> to guarantee TSC stability (which is a bad idea).
> 
> IMO should find out what is the reason for TSC being marked unstable.

This was happening because tuned wasn't properly setting up ktimersoftd(s)
priorities after reboot. And this issue has been fixed.

Comment 64 Luiz Capitulino 2019-03-01 14:32:39 UTC

(In reply to Marcelo Tosatti from comment #61)

> About TSC going unstable, on comment 13 Luiz made a hypothesis:
> 
> "In the host, since our RT threads run at 100%, the clocksource watchdog
> seems to hang between clock reads, which causes it to think there's a
> skew, which in turn cause it to mark the TSC as unstable."
> 
> But later on the discussion it was noted that ktimersoftd priority is higher.
> 
> So is the actual reason TSC is being marked unstable?

Good point. Juri found out that due to a bug in tuned the ktimersoftd thread
had a lower priority than the vcpu thread, so ktimersoftd was getting stuck.
That's why there was a delta, and that's why the tsc was being marked as
unstable. Now that bug is fixed, we don't see the tsc being marked as
unstable anymore.

However, having the timer firing every few seconds does increase the
latency. That's the problem we want to fix next. What I'm hearing is
that we have two options:

1. Use tsc=reliable in the profiles. To do that, we have to understand
all the implications of having tsc=reliable (ie. what the kernel does
or doesn't do)

2. Make clocksource watchdog respect a CPU mask provided by the user

What do you guys think we should do? My only concern about item 2 is to
get stuck on upstream for too long. Maybe we could use tsc=reliable
as a workaround meanwhile?

Comment 65 Juri Lelli 2019-03-01 14:42:43 UTC

(In reply to Luiz Capitulino from comment #64)
> (In reply to Marcelo Tosatti from comment #61)
> 
> > About TSC going unstable, on comment 13 Luiz made a hypothesis:
> > 
> > "In the host, since our RT threads run at 100%, the clocksource watchdog
> > seems to hang between clock reads, which causes it to think there's a
> > skew, which in turn cause it to mark the TSC as unstable."
> > 
> > But later on the discussion it was noted that ktimersoftd priority is higher.
> > 
> > So is the actual reason TSC is being marked unstable?
> 
> Good point. Juri found out that due to a bug in tuned the ktimersoftd thread
> had a lower priority than the vcpu thread, so ktimersoftd was getting stuck.
> That's why there was a delta, and that's why the tsc was being marked as
> unstable. Now that bug is fixed, we don't see the tsc being marked as
> unstable anymore.
> 
> However, having the timer firing every few seconds does increase the
> latency. That's the problem we want to fix next. What I'm hearing is
> that we have two options:
> 
> 1. Use tsc=reliable in the profiles. To do that, we have to understand
> all the implications of having tsc=reliable (ie. what the kernel does
> or doesn't do)

I'll look into this.

> 2. Make clocksource watchdog respect a CPU mask provided by the user

And already started asking upstream if this would be a sensible and
viable option. First reply was however that this might not work, as
checking for stability is an all-or-nothing thing. Will try to get
a definitive answer from experts (Thomas for example).

> What do you guys think we should do? My only concern about item 2 is to
> get stuck on upstream for too long. Maybe we could use tsc=reliable
> as a workaround meanwhile?

Comment 66 Marcelo Tosatti 2019-03-01 17:01:29 UTC

(In reply to Juri Lelli from comment #65)
> (In reply to Luiz Capitulino from comment #64)
> > (In reply to Marcelo Tosatti from comment #61)
> > 
> > > About TSC going unstable, on comment 13 Luiz made a hypothesis:
> > > 
> > > "In the host, since our RT threads run at 100%, the clocksource watchdog
> > > seems to hang between clock reads, which causes it to think there's a
> > > skew, which in turn cause it to mark the TSC as unstable."
> > > 
> > > But later on the discussion it was noted that ktimersoftd priority is higher.
> > > 
> > > So is the actual reason TSC is being marked unstable?
> > 
> > Good point. Juri found out that due to a bug in tuned the ktimersoftd thread
> > had a lower priority than the vcpu thread, so ktimersoftd was getting stuck.
> > That's why there was a delta, and that's why the tsc was being marked as
> > unstable. Now that bug is fixed, we don't see the tsc being marked as
> > unstable anymore.
> > 
> > However, having the timer firing every few seconds does increase the
> > latency. That's the problem we want to fix next. What I'm hearing is
> > that we have two options:
> > 
> > 1. Use tsc=reliable in the profiles. To do that, we have to understand
> > all the implications of having tsc=reliable (ie. what the kernel does
> > or doesn't do)
> 
> I'll look into this.
> 
> > 2. Make clocksource watchdog respect a CPU mask provided by the user
> 
> And already started asking upstream if this would be a sensible and
> viable option. First reply was however that this might not work, as
> checking for stability is an all-or-nothing thing. Will try to get
> a definitive answer from experts (Thomas for example).

Please mention the latency numbers you get with and without 
cs watchdog on the isolated CPUs... and then mention NFV wants < 20us. 
 
> > What do you guys think we should do? My only concern about item 2 is to
> > get stuck on upstream for too long. Maybe we could use tsc=reliable
> > as a workaround meanwhile?

Comment 67 Pei Zhang 2019-03-04 14:49:46 UTC

Testing 24 hours cyclictest, we have finished 2 testing scenarios on 2 standard kernel-rt versions up to now.

Note:
(1) Adding tsc=reliable to both host and guest kernel line

(2) # cat /sys/module/kvm/parameters/lapic_timer_advance_ns
3000


==With kernel-rt-4.18.0-74.rt9.130.el8.x86_64, max latency is 52us.
Beaker job: https://beaker.engineering.redhat.com/jobs/3374286

(1)Single VM with 1 rt vCPU

# Min Latencies: 00006
# Avg Latencies: 00013
# Max Latencies: 00049

(3)Multiple VMs each with 1 rt vCPU:
-VM1
# Min Latencies: 00005
# Avg Latencies: 00012
# Max Latencies: 00033

-VM2
# Min Latencies: 00005
# Avg Latencies: 00012
# Max Latencies: 00052

-VM3
# Min Latencies: 00006
# Avg Latencies: 00012
# Max Latencies: 00030

-VM4
# Min Latencies: 00005
# Avg Latencies: 00012
# Max Latencies: 00031


==With kernel-rt-4.18.0-71.rt9.127.el8.x86_64, max latency is 44us.
Beaker job: https://beaker.engineering.redhat.com/jobs/3374285
(1)Single VM with 1 rt vCPU

# Min Latencies: 00006
# Avg Latencies: 00011
# Max Latencies: 00031

(3)Multiple VMs each with 1 rt vCPU:

# Min Latencies: 00006
# Avg Latencies: 00011
# Max Latencies: 00042

# Min Latencies: 00006
# Avg Latencies: 00011
# Max Latencies: 00029

# Min Latencies: 00006
# Avg Latencies: 00012
# Max Latencies: 00044


# Min Latencies: 00006
# Avg Latencies: 00011
# Max Latencies: 00029

Comment 68 Pei Zhang 2019-03-05 08:32:11 UTC

Update the third testing scenario of Comment 67:

Note:
(1) Adding tsc=reliable to both host and guest kernel line

(2) # cat /sys/module/kvm/parameters/lapic_timer_advance_ns
3000

==With kernel-rt-4.18.0-74.rt9.130.el8.x86_64, max latency is 84us.
Beaker job: https://beaker.engineering.redhat.com/jobs/3374286

(3)Single VM with 8 rt vCPUs
# Min Latencies: 00005 00014 00014 00014 00014 00014 00014 00011
# Avg Latencies: 00013 00015 00015 00015 00014 00015 00015 00018
# Max Latencies: 00031 00021 00022 00022 00021 00021 00021 00084


==With kernel-rt-4.18.0-71.rt9.127.el8.x86_64, max latency is 49us.
Beaker job: https://beaker.engineering.redhat.com/jobs/3374285
(3)Single VM with 8 rt vCPUs
# Min Latencies: 00006 00013 00013 00013 00013 00013 00013 00011
# Avg Latencies: 00011 00014 00014 00014 00014 00014 00014 00014
# Max Latencies: 00049 00042 00042 00042 00043 00042 00042 00043

Comment 69 Pei Zhang 2019-03-22 02:54:59 UTC

Update latest testing: max cyclictest latency is 47us.

Note: No extra options were set, we just tested following standard testing steps/configurations in this run.

Versions:
libvirt-5.0.0-7.module+el8+2887+effa3c42.x86_64
tuned-2.10.0-15.el8.noarch
kernel-rt-4.18.0-80.rt9.138.el8.x86_64
qemu-kvm-3.1.0-20.module+el8+2888+cdc893a8.x86_64

(1)Single VM with 1 rt vCPU:
# Min Latencies: 00006
# Avg Latencies: 00017
# Max Latencies: 00040

(2)Single VM with 8 rt vCPUs:
# Min Latencies: 00007 00015 00019 00019 00018 00019 00014 00012
# Avg Latencies: 00016 00019 00019 00019 00019 00019 00019 00019
# Max Latencies: 00038 00028 00027 00027 00028 00031 00027 00044


(3)Multiple VMs each with 1 rt vCPU:
- VM1
# Min Latencies: 00007
# Avg Latencies: 00015
# Max Latencies: 00036

- VM2
# Min Latencies: 00007
# Avg Latencies: 00015
# Max Latencies: 00037

- VM3
# Min Latencies: 00007
# Avg Latencies: 00016
# Max Latencies: 00047

- VM4
# Min Latencies: 00007
# Avg Latencies: 00016
# Max Latencies: 00038

Comment 70 Juri Lelli 2019-06-03 15:56:48 UTC

Closing this as the root problem for the huge (~200us) spike (and related dmesg warnings) was fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1669294 (userspace component).

Going to open a new BZ for tracking the remaining issue related to clocksource watchdog timer causing latencies >40us.